[RFC,0/8] Introduce provisioning primitives for thinly provisioned storage

Message ID	20220915164826.1396245-1-sarthakkukreti@google.com (mailing list archive)
Headers	show Return-Path: <dm-devel-bounces@redhat.com> From: Sarthak Kukreti <sarthakkukreti@chromium.org> To: dm-devel@redhat.com, linux-block@vger.kernel.org, linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org Date: Thu, 15 Sep 2022 09:48:18 -0700 Message-Id: <20220915164826.1396245-1-sarthakkukreti@google.com> MIME-Version: 1.0 X-Mimecast-Impersonation-Protect: Policy=CLT - Impersonation Protection Definition; Similar Internal Domain=false; Similar Monitored External Domain=false; Custom External Domain=false; Mimecast External Domain=false; Newly Observed Domain=false; Internal User Name=false; Custom Display Name List=false; Reply-to Address Mismatch=false; Targeted Threat Dictionary=false; Mimecast Threat Dictionary=false; Custom Threat Dictionary=false Subject: [dm-devel] [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage Precedence: list Cc: Jens Axboe <axboe@kernel.dk>, Gwendal Grignou <gwendal@google.com>, Theodore Ts'o <tytso@mit.edu>, Sarthak Kukreti <sarthakkukreti@chromium.org>, "Michael S . Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>, Bart Van Assche <bvanassche@google.com>, Mike Snitzer <snitzer@kernel.org>, Evan Green <evgreen@google.com>, Andreas Dilger <adilger.kernel@dilger.ca>, Daniil Lunev <dlunev@google.com>, Stefan Hajnoczi <stefanha@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Alasdair Kergon <agk@redhat.com> Errors-To: dm-devel-bounces@redhat.com Sender: "dm-devel" <dm-devel-bounces@redhat.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64
Series	Introduce provisioning primitives for thinly provisioned storage \| expand [RFC,0/8] Introduce provisioning primitives for thinly provisioned storage [RFC,1/8] block: Introduce provisioning primitives [RFC,2/8] dm: Add support for block provisioning [RFC,3/8] virtio_blk: Add support for provision requests [RFC,4/8] fs: Introduce FALLOC_FL_PROVISION [RFC,5/8] loop: Add support for provision requests [RFC,6/8] ext4: Add support for FALLOC_FL_PROVISION [RFC,7/8] ext4: Add mount option for provisioning blocks during allocations [RFC,8/8] ext4: Add a per-file provision override xattr

Sarthak Kukreti Sept. 15, 2022, 4:48 p.m. UTC

From: Sarthak Kukreti <sarthakkukreti@chromium.org>

Hi,

This patch series is an RFC of a mechanism to pass through provision requests on stacked thinly provisioned storage devices/filesystems.

The linux kernel provides several mechanisms to set up thinly provisioned block storage abstractions (eg. dm-thin, loop devices over sparse files), either directly as block devices or backing storage for filesystems. Currently, short of writing data to either the device or filesystem, there is no way for users to pre-allocate space for use in such storage setups. Consider the following use-cases:

1) Suspend-to-disk and resume from a dm-thin device: In order to ensure that the underlying thinpool metadata is not modified during the suspend mechanism, the dm-thin device needs to be fully provisioned.
2) If a filesystem uses a loop device over a sparse file, fallocate() on the filesystem will allocate blocks for files but the underlying sparse file will remain intact.
3) Another example is virtual machine using a sparse file/dm-thin as a storage device; by default, allocations within the VM boundaries will not affect the host.
4) Several storage standards support mechanisms for thin provisioning on real hardware devices. For example:
  a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin provisioning: "When the THINP bit in the NSFEAT field of the Identify Namespace data structure is set to ‘1’, the controller ... shall track the number of allocated blocks in the Namespace Utilization field"
  b. The SCSi Block Commands reference - 4 section references "Thin provisioned logical units",
  c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".

In all of the above situations, currently the only way for pre-allocating space is to issue writes (or use WRITE_ZEROES/WRITE_SAME). However, that does not scale well with larger pre-allocation sizes. 

This patchset introduces primitives to support block-level provisioning (note: the term 'provisioning' is used to prevent overloading the term 'allocations/pre-allocations') requests across filesystems and block devices. This allows fallocate() and file creation requests to reserve space across stacked layers of block devices and filesystems. Currently, the patchset covers a prototype on the device-mapper targets, loop device and ext4, but the same mechanism can be extended to other filesystems/block devices as well as extended for use with devices in 4 a-c.

Patch 1 introduces REQ_OP_PROVISION as a new request type. The provision request acts like the inverse of a discard request; instead of notifying lower layers that the block range will no longer be used, provision acts as a request to lower layers to provision disk space for the given block range. Real hardware storage devices will currently disable the provisioing capability but for the standards listed in 4a.-c., REQ_OP_PROVISION can be overloaded for use as the provisioing primitive for future devices.

Patch 2 implements REQ_OP_PROVISION handling for some of the device-mapper targets. This additionally adds support for pre-allocating space for thinly provisioned logical volumes via fallocate()

Patch 3 implements the handling for virtio-blk.

Patch 4 introduces an fallocate() mode (FALLOC_FL_PROVISION) that sends a provision request to the underlying block device (and beyond). This acts as the primary mechanism for file-level provisioing.

Patch 5 wires up the loop device handling of REQ_OP_PROVISION.

Patches 6-8 cover a prototype implementation for ext4, which includes wiring up the fallocate() implementation, introducing a filesystem level option (called 'provision') to control the default allocation behaviour and finally a file level override to retain current handling, even on filesystems mounted with 'provision'

Testing:
--------
- A backport of this patch series was tested on ChromiumOS using a 5.10 kernel.
- File on ext4 on a thin logical volume: fallocate(FALLOC_FL_PROVISION) : 4.6s, dd if=/dev/zero of=...: 6 mins.

TODOs:
------
1) The stacked block devices (dm-*, loop etc.) currently unconditionally pass through provision requests. Add support for provision, similar to how discard handling is set up (with options to disable, passdown or passthrough requests).
2) Blktests and Xfstests for validating provisioning.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Stefan Hajnoczi Sept. 16, 2022, 6:09 a.m. UTC | #1

On Thu, Sep 15, 2022 at 09:48:18AM -0700, Sarthak Kukreti wrote:
> From: Sarthak Kukreti <sarthakkukreti@chromium.org>
> 
> Hi,
> 
> This patch series is an RFC of a mechanism to pass through provision requests on stacked thinly provisioned storage devices/filesystems.
> 
> The linux kernel provides several mechanisms to set up thinly provisioned block storage abstractions (eg. dm-thin, loop devices over sparse files), either directly as block devices or backing storage for filesystems. Currently, short of writing data to either the device or filesystem, there is no way for users to pre-allocate space for use in such storage setups. Consider the following use-cases:
> 
> 1) Suspend-to-disk and resume from a dm-thin device: In order to ensure that the underlying thinpool metadata is not modified during the suspend mechanism, the dm-thin device needs to be fully provisioned.
> 2) If a filesystem uses a loop device over a sparse file, fallocate() on the filesystem will allocate blocks for files but the underlying sparse file will remain intact.
> 3) Another example is virtual machine using a sparse file/dm-thin as a storage device; by default, allocations within the VM boundaries will not affect the host.
> 4) Several storage standards support mechanisms for thin provisioning on real hardware devices. For example:
>   a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin provisioning: "When the THINP bit in the NSFEAT field of the Identify Namespace data structure is set to ‘1’, the controller ... shall track the number of allocated blocks in the Namespace Utilization field"
>   b. The SCSi Block Commands reference - 4 section references "Thin provisioned logical units",
>   c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".

When REQ_OP_PROVISION is sent on an already-allocated range of blocks,
are those blocks zeroed? NVMe Write Zeroes with Deallocate=0 works this
way, for example. That behavior is counterintuitive since the operation
name suggests it just affects the logical block's provisioning state,
not the contents of the blocks.

> In all of the above situations, currently the only way for pre-allocating space is to issue writes (or use WRITE_ZEROES/WRITE_SAME). However, that does not scale well with larger pre-allocation sizes. 

What exactly is the issue with WRITE_ZEROES scalability? Are you
referring to cases where the device doesn't support an efficient
WRITE_ZEROES command and actually writes blocks filled with zeroes
instead of updating internal allocation metadata cheaply?

Stefan
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Sarthak Kukreti Sept. 16, 2022, 6:48 p.m. UTC | #2

On Thu, Sep 15, 2022 at 11:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Sep 15, 2022 at 09:48:18AM -0700, Sarthak Kukreti wrote:
> > From: Sarthak Kukreti <sarthakkukreti@chromium.org>
> >
> > Hi,
> >
> > This patch series is an RFC of a mechanism to pass through provision requests on stacked thinly provisioned storage devices/filesystems.
> >
> > The linux kernel provides several mechanisms to set up thinly provisioned block storage abstractions (eg. dm-thin, loop devices over sparse files), either directly as block devices or backing storage for filesystems. Currently, short of writing data to either the device or filesystem, there is no way for users to pre-allocate space for use in such storage setups. Consider the following use-cases:
> >
> > 1) Suspend-to-disk and resume from a dm-thin device: In order to ensure that the underlying thinpool metadata is not modified during the suspend mechanism, the dm-thin device needs to be fully provisioned.
> > 2) If a filesystem uses a loop device over a sparse file, fallocate() on the filesystem will allocate blocks for files but the underlying sparse file will remain intact.
> > 3) Another example is virtual machine using a sparse file/dm-thin as a storage device; by default, allocations within the VM boundaries will not affect the host.
> > 4) Several storage standards support mechanisms for thin provisioning on real hardware devices. For example:
> >   a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin provisioning: "When the THINP bit in the NSFEAT field of the Identify Namespace data structure is set to ‘1’, the controller ... shall track the number of allocated blocks in the Namespace Utilization field"
> >   b. The SCSi Block Commands reference - 4 section references "Thin provisioned logical units",
> >   c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".
>
> When REQ_OP_PROVISION is sent on an already-allocated range of blocks,
> are those blocks zeroed? NVMe Write Zeroes with Deallocate=0 works this
> way, for example. That behavior is counterintuitive since the operation
> name suggests it just affects the logical block's provisioning state,
> not the contents of the blocks.
>
No, the blocks are not zeroed. The current implementation (in the dm
patch) is to indeed look at the provisioned state of the logical block
and provision if it is unmapped. if the block is already allocated,
REQ_OP_PROVISION should have no effect on the contents of the block.
Similarly, in the file semantics, sending an FALLOC_FL_PROVISION
requests for extents already mapped should not affect the contents in
the extents.

> > In all of the above situations, currently the only way for pre-allocating space is to issue writes (or use WRITE_ZEROES/WRITE_SAME). However, that does not scale well with larger pre-allocation sizes.
>
> What exactly is the issue with WRITE_ZEROES scalability? Are you
> referring to cases where the device doesn't support an efficient
> WRITE_ZEROES command and actually writes blocks filled with zeroes
> instead of updating internal allocation metadata cheaply?
>
Yes. On ChromiumOS, we regularly deal with storage devices that don't
support WRITE_ZEROES or that need to have it disabled, via a quirk,
due to a bug in the vendor's implementation. Using WRITE_ZEROES for
allocation makes the allocation path quite slow for such devices (not
to mention the effect on storage lifetime), so having a separate
provisioning construct is very appealing. Even for devices that do
support an efficient WRITE_ZEROES implementation but don't support
logical provisioning per-se, I suppose that the allocation path might
be a bit faster (the device driver's request queue would report
'max_provision_sectors'=0 and the request would be short circuited
there) although I haven't benchmarked the difference.

Sarthak

> Stefan

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Bart Van Assche Sept. 16, 2022, 8:01 p.m. UTC | #3

On 9/16/22 11:48, Sarthak Kukreti wrote:
> Yes. On ChromiumOS, we regularly deal with storage devices that don't
> support WRITE_ZEROES or that need to have it disabled, via a quirk,
> due to a bug in the vendor's implementation. Using WRITE_ZEROES for
> allocation makes the allocation path quite slow for such devices (not
> to mention the effect on storage lifetime), so having a separate
> provisioning construct is very appealing. Even for devices that do
> support an efficient WRITE_ZEROES implementation but don't support
> logical provisioning per-se, I suppose that the allocation path might
> be a bit faster (the device driver's request queue would report
> 'max_provision_sectors'=0 and the request would be short circuited
> there) although I haven't benchmarked the difference.

Some background information about why ChromiumOS uses thin provisioning 
instead of a single filesystem across the entire storage device would be 
welcome. Although UFS devices support thin provisioning I am not aware 
of any use cases in Android that would benefit from UFS thin 
provisioning support.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Sarthak Kukreti Sept. 16, 2022, 9:59 p.m. UTC | #4

On Fri, Sep 16, 2022 at 1:01 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/16/22 11:48, Sarthak Kukreti wrote:
> > Yes. On ChromiumOS, we regularly deal with storage devices that don't
> > support WRITE_ZEROES or that need to have it disabled, via a quirk,
> > due to a bug in the vendor's implementation. Using WRITE_ZEROES for
> > allocation makes the allocation path quite slow for such devices (not
> > to mention the effect on storage lifetime), so having a separate
> > provisioning construct is very appealing. Even for devices that do
> > support an efficient WRITE_ZEROES implementation but don't support
> > logical provisioning per-se, I suppose that the allocation path might
> > be a bit faster (the device driver's request queue would report
> > 'max_provision_sectors'=0 and the request would be short circuited
> > there) although I haven't benchmarked the difference.
>
> Some background information about why ChromiumOS uses thin provisioning
> instead of a single filesystem across the entire storage device would be
> welcome. Although UFS devices support thin provisioning I am not aware
> of any use cases in Android that would benefit from UFS thin
> provisioning support.
>
Sure (and I'd be happy to put this in the cover letter, if you prefer;
I didn't include it initially, since it seemed orthogonal to the
discussion of the patchset)!

On ChromiumOS, the primary driving force for using thin provisioning
is to have flexible, segmented block storage, both per-user and for
applications/virtual machines with several useful properties, for
example: block-level encrypted user storage, snapshot based A-B
updates for verified content, on-demand partitioning for short-lived
use cases. Several of the other planned use-cases (like verified
content retention over powerwash) require flexible on-demand block
storage that is decoupled from the primary filesystem(s) so that we
can have cryptographic erase for the user partitions and keep the
on-demand, dm-verity backed executables intact.

Best
Sarthak

> Thanks,
>
> Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Darrick J. Wong Sept. 17, 2022, 3:03 a.m. UTC | #5

On Thu, Sep 15, 2022 at 09:48:18AM -0700, Sarthak Kukreti wrote:
> From: Sarthak Kukreti <sarthakkukreti@chromium.org>
> 
> Hi,
> 
> This patch series is an RFC of a mechanism to pass through provision
> requests on stacked thinly provisioned storage devices/filesystems.

[Reflowed text]

> The linux kernel provides several mechanisms to set up thinly
> provisioned block storage abstractions (eg. dm-thin, loop devices over
> sparse files), either directly as block devices or backing storage for
> filesystems. Currently, short of writing data to either the device or
> filesystem, there is no way for users to pre-allocate space for use in
> such storage setups. Consider the following use-cases:
> 
> 1) Suspend-to-disk and resume from a dm-thin device: In order to
> ensure that the underlying thinpool metadata is not modified during
> the suspend mechanism, the dm-thin device needs to be fully
> provisioned.
> 2) If a filesystem uses a loop device over a sparse file, fallocate()
> on the filesystem will allocate blocks for files but the underlying
> sparse file will remain intact.
> 3) Another example is virtual machine using a sparse file/dm-thin as a
> storage device; by default, allocations within the VM boundaries will
> not affect the host.
> 4) Several storage standards support mechanisms for thin provisioning
> on real hardware devices. For example:
>   a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin
>   provisioning: "When the THINP bit in the NSFEAT field of the
>   Identify Namespace data structure is set to ‘1’, the controller ...
>   shall track the number of allocated blocks in the Namespace
>   Utilization field"
>   b. The SCSi Block Commands reference - 4 section references "Thin
>   provisioned logical units",
>   c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".
> 
> In all of the above situations, currently the only way for
> pre-allocating space is to issue writes (or use
> WRITE_ZEROES/WRITE_SAME). However, that does not scale well with
> larger pre-allocation sizes. 
> 
> This patchset introduces primitives to support block-level
> provisioning (note: the term 'provisioning' is used to prevent
> overloading the term 'allocations/pre-allocations') requests across
> filesystems and block devices. This allows fallocate() and file
> creation requests to reserve space across stacked layers of block
> devices and filesystems. Currently, the patchset covers a prototype on
> the device-mapper targets, loop device and ext4, but the same
> mechanism can be extended to other filesystems/block devices as well
> as extended for use with devices in 4 a-c.

If you call REQ_OP_PROVISION on an unmapped LBA range of a block device
and then try to read the provisioned blocks, what do you get?  Zeroes?
Random stale disk contents?

I think I saw elsewhere in the thread that any mapped LBAs within the
provisioning range are left alone (i.e. not zeroed) so I'll proceed on
that basis.

> Patch 1 introduces REQ_OP_PROVISION as a new request type. The
> provision request acts like the inverse of a discard request; instead
> of notifying lower layers that the block range will no longer be used,
> provision acts as a request to lower layers to provision disk space
> for the given block range. Real hardware storage devices will
> currently disable the provisioing capability but for the standards
> listed in 4a.-c., REQ_OP_PROVISION can be overloaded for use as the
> provisioing primitive for future devices.
> 
> Patch 2 implements REQ_OP_PROVISION handling for some of the
> device-mapper targets. This additionally adds support for
> pre-allocating space for thinly provisioned logical volumes via
> fallocate()
> 
> Patch 3 implements the handling for virtio-blk.
> 
> Patch 4 introduces an fallocate() mode (FALLOC_FL_PROVISION) that
> sends a provision request to the underlying block device (and beyond).
> This acts as the primary mechanism for file-level provisioing.

Personally, I think it's well within the definition of fallocate mode==0
(aka preallocate) for XFS to call REQ_OP_PROVISION on the blocks that it
preallocates?  XFS always sets the unwritten flag on the file mapping,
so it doesn't matter if the device provisions space without zeroing the
contents.

That said, if devices are really allowed to expose stale disk blocks
then for blkdev fallocate I think you could get away with reusin
FALLOC_FL_NO_HIDE_STALE instead of introducing a new fallocate flag.

> Patch 5 wires up the loop device handling of REQ_OP_PROVISION.
> 
> Patches 6-8 cover a prototype implementation for ext4, which includes
> wiring up the fallocate() implementation, introducing a filesystem
> level option (called 'provision') to control the default allocation
> behaviour and finally a file level override to retain current
> handling, even on filesystems mounted with 'provision'

Hmm, I'll have a look.

> Testing:
> --------
> - A backport of this patch series was tested on ChromiumOS using a
> 5.10 kernel.
> - File on ext4 on a thin logical volume:
> fallocate(FALLOC_FL_PROVISION) : 4.6s, dd if=/dev/zero of=...: 6 mins.
> 
> TODOs:
> ------
> 1) The stacked block devices (dm-*, loop etc.) currently
> unconditionally pass through provision requests. Add support for
> provision, similar to how discard handling is set up (with options to
> disable, passdown or passthrough requests).
> 2) Blktests and Xfstests for validating provisioning.

Yes....

--D

> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Sarthak Kukreti Sept. 17, 2022, 7:46 p.m. UTC | #6

On Fri, Sep 16, 2022 at 8:03 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Sep 15, 2022 at 09:48:18AM -0700, Sarthak Kukreti wrote:
> > From: Sarthak Kukreti <sarthakkukreti@chromium.org>
> >
> > Hi,
> >
> > This patch series is an RFC of a mechanism to pass through provision
> > requests on stacked thinly provisioned storage devices/filesystems.
>
> [Reflowed text]
>
> > The linux kernel provides several mechanisms to set up thinly
> > provisioned block storage abstractions (eg. dm-thin, loop devices over
> > sparse files), either directly as block devices or backing storage for
> > filesystems. Currently, short of writing data to either the device or
> > filesystem, there is no way for users to pre-allocate space for use in
> > such storage setups. Consider the following use-cases:
> >
> > 1) Suspend-to-disk and resume from a dm-thin device: In order to
> > ensure that the underlying thinpool metadata is not modified during
> > the suspend mechanism, the dm-thin device needs to be fully
> > provisioned.
> > 2) If a filesystem uses a loop device over a sparse file, fallocate()
> > on the filesystem will allocate blocks for files but the underlying
> > sparse file will remain intact.
> > 3) Another example is virtual machine using a sparse file/dm-thin as a
> > storage device; by default, allocations within the VM boundaries will
> > not affect the host.
> > 4) Several storage standards support mechanisms for thin provisioning
> > on real hardware devices. For example:
> >   a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin
> >   provisioning: "When the THINP bit in the NSFEAT field of the
> >   Identify Namespace data structure is set to ‘1’, the controller ...
> >   shall track the number of allocated blocks in the Namespace
> >   Utilization field"
> >   b. The SCSi Block Commands reference - 4 section references "Thin
> >   provisioned logical units",
> >   c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".
> >
> > In all of the above situations, currently the only way for
> > pre-allocating space is to issue writes (or use
> > WRITE_ZEROES/WRITE_SAME). However, that does not scale well with
> > larger pre-allocation sizes.
> >
> > This patchset introduces primitives to support block-level
> > provisioning (note: the term 'provisioning' is used to prevent
> > overloading the term 'allocations/pre-allocations') requests across
> > filesystems and block devices. This allows fallocate() and file
> > creation requests to reserve space across stacked layers of block
> > devices and filesystems. Currently, the patchset covers a prototype on
> > the device-mapper targets, loop device and ext4, but the same
> > mechanism can be extended to other filesystems/block devices as well
> > as extended for use with devices in 4 a-c.
>
> If you call REQ_OP_PROVISION on an unmapped LBA range of a block device
> and then try to read the provisioned blocks, what do you get?  Zeroes?
> Random stale disk contents?
>
> I think I saw elsewhere in the thread that any mapped LBAs within the
> provisioning range are left alone (i.e. not zeroed) so I'll proceed on
> that basis.
>
For block devices, I'd say it's definitely possible to get stale data, depending
on the implementation of the allocation layer; for example, with dm-thinpool,
the default setting via using LVM2 tools is to zero out blocks on allocation.
But that's configurable and can be turned off to improve performance.

Similarly, for actual devices that end up supporting thin provisioning, unless
the specification absolutely mandates that an LBA contains zeroes post
allocation, some implementations will definitely miss out on that (probably
similar to the semantics of discard_zeroes_data today). I'm operating under
the assumption that it's possible to get stale data from LBAs allocated using
provision requests at the block layer and trying to see if we can create a
safe default operating model from that.

> > Patch 1 introduces REQ_OP_PROVISION as a new request type. The
> > provision request acts like the inverse of a discard request; instead
> > of notifying lower layers that the block range will no longer be used,
> > provision acts as a request to lower layers to provision disk space
> > for the given block range. Real hardware storage devices will
> > currently disable the provisioing capability but for the standards
> > listed in 4a.-c., REQ_OP_PROVISION can be overloaded for use as the
> > provisioing primitive for future devices.
> >
> > Patch 2 implements REQ_OP_PROVISION handling for some of the
> > device-mapper targets. This additionally adds support for
> > pre-allocating space for thinly provisioned logical volumes via
> > fallocate()
> >
> > Patch 3 implements the handling for virtio-blk.
> >
> > Patch 4 introduces an fallocate() mode (FALLOC_FL_PROVISION) that
> > sends a provision request to the underlying block device (and beyond).
> > This acts as the primary mechanism for file-level provisioing.
>
> Personally, I think it's well within the definition of fallocate mode==0
> (aka preallocate) for XFS to call REQ_OP_PROVISION on the blocks that it
> preallocates?  XFS always sets the unwritten flag on the file mapping,
> so it doesn't matter if the device provisions space without zeroing the
> contents.
>
> That said, if devices are really allowed to expose stale disk blocks
> then for blkdev fallocate I think you could get away with reusin
> FALLOC_FL_NO_HIDE_STALE instead of introducing a new fallocate flag.
>
For filesystems, I think it's reasonable to support the mode if and only if
the filesystem can guarantee that unwritten extents return zero. For instance,
in the current ext4 implementation, the provisioned extents are still marked as
unwritten, which means a read from the file would still show all zeroes (which
I think differs from the original FALLOC_FL_NO_HIDE implementation).

That might be one more reason to keep the mode separate from the regular
modes though; to drive home the point that it is only acceptable under
the above conditions and that there's more to it than just adding
blkdev_issue_provision(..) at the end of fs_fallocate().

Best
Sarthak

> > Patch 5 wires up the loop device handling of REQ_OP_PROVISION.
> >
> > Patches 6-8 cover a prototype implementation for ext4, which includes
> > wiring up the fallocate() implementation, introducing a filesystem
> > level option (called 'provision') to control the default allocation
> > behaviour and finally a file level override to retain current
> > handling, even on filesystems mounted with 'provision'
>
> Hmm, I'll have a look.
>
> > Testing:
> > --------
> > - A backport of this patch series was tested on ChromiumOS using a
> > 5.10 kernel.
> > - File on ext4 on a thin logical volume:
> > fallocate(FALLOC_FL_PROVISION) : 4.6s, dd if=/dev/zero of=...: 6 mins.
> >
> > TODOs:
> > ------
> > 1) The stacked block devices (dm-*, loop etc.) currently
> > unconditionally pass through provision requests. Add support for
> > provision, similar to how discard handling is set up (with options to
> > disable, passdown or passthrough requests).
> > 2) Blktests and Xfstests for validating provisioning.
>
> Yes....
>
> --D
>
> > --
> > dm-devel mailing list
> > dm-devel@redhat.com
> > https://listman.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Stefan Hajnoczi Sept. 19, 2022, 4:36 p.m. UTC | #7

On Sat, Sep 17, 2022 at 12:46:33PM -0700, Sarthak Kukreti wrote:
> On Fri, Sep 16, 2022 at 8:03 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, Sep 15, 2022 at 09:48:18AM -0700, Sarthak Kukreti wrote:
> > > From: Sarthak Kukreti <sarthakkukreti@chromium.org>
> > >
> > > Hi,
> > >
> > > This patch series is an RFC of a mechanism to pass through provision
> > > requests on stacked thinly provisioned storage devices/filesystems.
> >
> > [Reflowed text]
> >
> > > The linux kernel provides several mechanisms to set up thinly
> > > provisioned block storage abstractions (eg. dm-thin, loop devices over
> > > sparse files), either directly as block devices or backing storage for
> > > filesystems. Currently, short of writing data to either the device or
> > > filesystem, there is no way for users to pre-allocate space for use in
> > > such storage setups. Consider the following use-cases:
> > >
> > > 1) Suspend-to-disk and resume from a dm-thin device: In order to
> > > ensure that the underlying thinpool metadata is not modified during
> > > the suspend mechanism, the dm-thin device needs to be fully
> > > provisioned.
> > > 2) If a filesystem uses a loop device over a sparse file, fallocate()
> > > on the filesystem will allocate blocks for files but the underlying
> > > sparse file will remain intact.
> > > 3) Another example is virtual machine using a sparse file/dm-thin as a
> > > storage device; by default, allocations within the VM boundaries will
> > > not affect the host.
> > > 4) Several storage standards support mechanisms for thin provisioning
> > > on real hardware devices. For example:
> > >   a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin
> > >   provisioning: "When the THINP bit in the NSFEAT field of the
> > >   Identify Namespace data structure is set to ‘1’, the controller ...
> > >   shall track the number of allocated blocks in the Namespace
> > >   Utilization field"
> > >   b. The SCSi Block Commands reference - 4 section references "Thin
> > >   provisioned logical units",
> > >   c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".
> > >
> > > In all of the above situations, currently the only way for
> > > pre-allocating space is to issue writes (or use
> > > WRITE_ZEROES/WRITE_SAME). However, that does not scale well with
> > > larger pre-allocation sizes.
> > >
> > > This patchset introduces primitives to support block-level
> > > provisioning (note: the term 'provisioning' is used to prevent
> > > overloading the term 'allocations/pre-allocations') requests across
> > > filesystems and block devices. This allows fallocate() and file
> > > creation requests to reserve space across stacked layers of block
> > > devices and filesystems. Currently, the patchset covers a prototype on
> > > the device-mapper targets, loop device and ext4, but the same
> > > mechanism can be extended to other filesystems/block devices as well
> > > as extended for use with devices in 4 a-c.
> >
> > If you call REQ_OP_PROVISION on an unmapped LBA range of a block device
> > and then try to read the provisioned blocks, what do you get?  Zeroes?
> > Random stale disk contents?
> >
> > I think I saw elsewhere in the thread that any mapped LBAs within the
> > provisioning range are left alone (i.e. not zeroed) so I'll proceed on
> > that basis.
> >
> For block devices, I'd say it's definitely possible to get stale data, depending
> on the implementation of the allocation layer; for example, with dm-thinpool,
> the default setting via using LVM2 tools is to zero out blocks on allocation.
> But that's configurable and can be turned off to improve performance.
> 
> Similarly, for actual devices that end up supporting thin provisioning, unless
> the specification absolutely mandates that an LBA contains zeroes post
> allocation, some implementations will definitely miss out on that (probably
> similar to the semantics of discard_zeroes_data today). I'm operating under
> the assumption that it's possible to get stale data from LBAs allocated using
> provision requests at the block layer and trying to see if we can create a
> safe default operating model from that.

Please explain the semantics of REQ_OP_PROVISION in the
code/documentation in the next revision.

Thanks,
Stefan
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Christoph Hellwig Sept. 20, 2022, 7:46 a.m. UTC | #8

On Fri, Sep 16, 2022 at 11:48:34AM -0700, Sarthak Kukreti wrote:
> Yes. On ChromiumOS, we regularly deal with storage devices that don't
> support WRITE_ZEROES or that need to have it disabled, via a quirk,
> due to a bug in the vendor's implementation.

So bloody punich the vendors for it.  Unlike most of the Linux community
your actually have purchasing power and you'd help everyone by making
use of that instead adding hacks to upstream.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Daniil Lunev Sept. 20, 2022, 10:17 a.m. UTC | #9

> So bloody punich the vendors for it.  Unlike most of the Linux community
> your actually have purchasing power and you'd help everyone by making
> use of that instead adding hacks to upstream.

Hi Cristoph,
I just want to note that the primitive this patchset introduces would not
map
to WRITE ZERO command in NVMe, but to WRITE UNAVAILABLE in
NVME 2.0 spec, and to UNMAP ANCHORED in SCSI spec.

--Daniil
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Christoph Hellwig Sept. 20, 2022, 11:30 a.m. UTC | #10

On Tue, Sep 20, 2022 at 08:17:10PM +1000, Daniil Lunev wrote:
> to WRITE ZERO command in NVMe, but to WRITE UNAVAILABLE in

There is no such thing as WRITE UNAVAILABLE in NVMe.

> NVME 2.0 spec, and to UNMAP ANCHORED in SCSI spec.

The SCSI anchored LBA state is quite complicated, and in addition
to UNMAP you can also create it using WRITE SAME, which is at least
partially useful, as it allows for sensible initialization pattern.
For the purpose of Linux that woud be 0.

That being siad you still haven't actually explained what problem
you're even trying to solve.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Daniil Lunev Sept. 20, 2022, 9:48 p.m. UTC | #11

> There is no such thing as WRITE UNAVAILABLE in NVMe.
Apologize, that is WRITE UNCORRECTABLE. Chapter 3.2.7 of
NVM Express NVM Command Set Specification 1.0b

> That being siad you still haven't actually explained what problem
> you're even trying to solve.

The specific problem is the following:
* There is an thinpool over a physical device
* There are multiple logical volumes over the thin pool
* Each logical volume has an independent file system and an
  independent application running over it
* Each application is potentially allowed to consume the entirety
  of the disk space - there is no strict size limit for application
* Applications need to pre-allocate space sometime, for which
  they use fallocate. Once the operation succeeded, the application
  assumed the space is guaranteed to be there for it.
* Since filesystems on the volumes are independent, filesystem
  level enforcement of size constraints is impossible and the only
  common level is the thin pool, thus, each fallocate has to find its
  representation in thin pool one way or another - otherwise you
  may end up in the situation, where FS thinks it has allocated space
  but when it tries to actually write it, the thin pool is already
  exhausted.
* Hole-Punching fallocate will not reach the thin pool, so the only
  solution presently is zero-writing pre-allocate.
* Not all storage devices support zero-writing efficiently - apart
  from NVMe being or not being capable of doing efficient write
  zero - changing which is easier said than done, and would take
  years - there are also other types of storage devices that do not
  have WRITE ZERO capability in the first place or have it in a
  peculiar way. And adding custom WRITE ZERO to LVM would be
  arguably a much bigger hack.
* Thus, a provisioning block operation allows an interface specific
  operation that guarantees the presence of the block in the
  mapped space. LVM Thin-pool itself is the primary target for our
  use case but the argument is that this operation maps well to
  other interfaces which allow thinly provisioned units.

--Daniil
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Mike Snitzer Sept. 21, 2022, 3:08 p.m. UTC | #12

On Tue, Sep 20 2022 at  5:48P -0400,
Daniil Lunev <dlunev@google.com> wrote:

> > There is no such thing as WRITE UNAVAILABLE in NVMe.
> Apologize, that is WRITE UNCORRECTABLE. Chapter 3.2.7 of
> NVM Express NVM Command Set Specification 1.0b
> 
> > That being siad you still haven't actually explained what problem
> > you're even trying to solve.
> 
> The specific problem is the following:
> * There is an thinpool over a physical device
> * There are multiple logical volumes over the thin pool
> * Each logical volume has an independent file system and an
>   independent application running over it
> * Each application is potentially allowed to consume the entirety
>   of the disk space - there is no strict size limit for application
> * Applications need to pre-allocate space sometime, for which
>   they use fallocate. Once the operation succeeded, the application
>   assumed the space is guaranteed to be there for it.
> * Since filesystems on the volumes are independent, filesystem
>   level enforcement of size constraints is impossible and the only
>   common level is the thin pool, thus, each fallocate has to find its
>   representation in thin pool one way or another - otherwise you
>   may end up in the situation, where FS thinks it has allocated space
>   but when it tries to actually write it, the thin pool is already
>   exhausted.
> * Hole-Punching fallocate will not reach the thin pool, so the only
>   solution presently is zero-writing pre-allocate.
> * Not all storage devices support zero-writing efficiently - apart
>   from NVMe being or not being capable of doing efficient write
>   zero - changing which is easier said than done, and would take
>   years - there are also other types of storage devices that do not
>   have WRITE ZERO capability in the first place or have it in a
>   peculiar way. And adding custom WRITE ZERO to LVM would be
>   arguably a much bigger hack.
> * Thus, a provisioning block operation allows an interface specific
>   operation that guarantees the presence of the block in the
>   mapped space. LVM Thin-pool itself is the primary target for our
>   use case but the argument is that this operation maps well to
>   other interfaces which allow thinly provisioned units.

Thanks for this overview. Should help level-set others.

Adding fallocate support has been a long-standing dm-thin TODO item
for me. I just never got around to it. So thanks to Sarthak, you and
anyone else who had a hand in developing this.

I had a look at the DM thin implementation and it looks pretty simple
(doesn't require a thin-metadata change, etc).  I'll look closer at
the broader implementation (block, etc) but I'm encouraged by what I'm
seeing.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Christoph Hellwig Sept. 23, 2022, 8:51 a.m. UTC | #13

On Wed, Sep 21, 2022 at 07:48:50AM +1000, Daniil Lunev wrote:
> > There is no such thing as WRITE UNAVAILABLE in NVMe.
> Apologize, that is WRITE UNCORRECTABLE. Chapter 3.2.7 of
> NVM Express NVM Command Set Specification 1.0b

Write uncorrectable is a very different thing, and the equivalent of the
horribly misnamed SCSI WRITE LONG COMMAND.  It injects an unrecoverable
error, and does not provision anything.

> * Each application is potentially allowed to consume the entirety
>   of the disk space - there is no strict size limit for application
> * Applications need to pre-allocate space sometime, for which
>   they use fallocate. Once the operation succeeded, the application
>   assumed the space is guaranteed to be there for it.
> * Since filesystems on the volumes are independent, filesystem
>   level enforcement of size constraints is impossible and the only
>   common level is the thin pool, thus, each fallocate has to find its
>   representation in thin pool one way or another - otherwise you
>   may end up in the situation, where FS thinks it has allocated space
>   but when it tries to actually write it, the thin pool is already
>   exhausted.
> * Hole-Punching fallocate will not reach the thin pool, so the only
>   solution presently is zero-writing pre-allocate.

To me it sounds like you want a non-thin pool in dm-thin and/or
guaranted space reservations for it.

> * Thus, a provisioning block operation allows an interface specific
>   operation that guarantees the presence of the block in the
>   mapped space. LVM Thin-pool itself is the primary target for our
>   use case but the argument is that this operation maps well to
>   other interfaces which allow thinly provisioned units.

I think where you are trying to go here is badly mistaken.  With flash
(or hard drive SMR) there is no such thing as provisioning LBAs.  Every
write is out of place, and a one time space allocation does not help
you at all.  So fundamentally what you try to here just goes against
the actual physics of modern storage media.  While there are some
layers that keep up a pretence, trying to that an an exposed API
level is a really bad idea.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Mike Snitzer Sept. 23, 2022, 2:08 p.m. UTC | #14

On Fri, Sep 23 2022 at  4:51P -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Wed, Sep 21, 2022 at 07:48:50AM +1000, Daniil Lunev wrote:
> > > There is no such thing as WRITE UNAVAILABLE in NVMe.
> > Apologize, that is WRITE UNCORRECTABLE. Chapter 3.2.7 of
> > NVM Express NVM Command Set Specification 1.0b
> 
> Write uncorrectable is a very different thing, and the equivalent of the
> horribly misnamed SCSI WRITE LONG COMMAND.  It injects an unrecoverable
> error, and does not provision anything.
> 
> > * Each application is potentially allowed to consume the entirety
> >   of the disk space - there is no strict size limit for application
> > * Applications need to pre-allocate space sometime, for which
> >   they use fallocate. Once the operation succeeded, the application
> >   assumed the space is guaranteed to be there for it.
> > * Since filesystems on the volumes are independent, filesystem
> >   level enforcement of size constraints is impossible and the only
> >   common level is the thin pool, thus, each fallocate has to find its
> >   representation in thin pool one way or another - otherwise you
> >   may end up in the situation, where FS thinks it has allocated space
> >   but when it tries to actually write it, the thin pool is already
> >   exhausted.
> > * Hole-Punching fallocate will not reach the thin pool, so the only
> >   solution presently is zero-writing pre-allocate.
> 
> To me it sounds like you want a non-thin pool in dm-thin and/or
> guaranted space reservations for it.

What is implemented in this patchset: enablement for dm-thinp to
actually provide guarantees which fallocate requires.

Seems you're getting hung up on the finishing details in HW (details
which are _not_ the point of this patchset).

The proposed changes are in service to _Linux_ code. The patchset
implements the primitive from top (ext4) to bottom (dm-thinp, loop).
It stops short of implementing handling everywhere that'd need it
(e.g. in XFS, etc). But those changes can come as follow-on work once
the primitive is established top to bottom.

But you know all this ;)

> > * Thus, a provisioning block operation allows an interface specific
> >   operation that guarantees the presence of the block in the
> >   mapped space. LVM Thin-pool itself is the primary target for our
> >   use case but the argument is that this operation maps well to
> >   other interfaces which allow thinly provisioned units.
> 
> I think where you are trying to go here is badly mistaken.  With flash
> (or hard drive SMR) there is no such thing as provisioning LBAs.  Every
> write is out of place, and a one time space allocation does not help
> you at all.  So fundamentally what you try to here just goes against
> the actual physics of modern storage media.  While there are some
> layers that keep up a pretence, trying to that an an exposed API
> level is a really bad idea.

This doesn't need to be so feudal.  Reserving an LBA in physical HW
really isn't the point.

Fact remains: an operation that ensures space is actually reserved via
fallocate is long overdue (just because an FS did its job doesn't mean
underlying layers reflect that). And certainly useful, even if "only"
benefiting dm-thinp and the loop driver. Like other block primitives,
REQ_OP_PROVISION is filtered out by block core if the device doesn't
support it.

That said, I agree with Brian Foster that we need really solid
documentation and justification for why fallocate mode=0 cannot be
used (but the case has been made in this thread).

Also, I do see an issue with the implementation (relative to stacked
devices): dm_table_supports_provision() is too myopic about DM. It
needs to go a step further and verify that some layer in the stack
actually services REQ_OP_PROVISION. Will respond to DM patch too.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Sarthak Kukreti Dec. 29, 2022, 8:17 a.m. UTC | #15

On Fri, Sep 23, 2022 at 7:08 AM Mike Snitzer <snitzer@redhat.com> wrote:
>
> On Fri, Sep 23 2022 at  4:51P -0400,
> Christoph Hellwig <hch@infradead.org> wrote:
>
> > On Wed, Sep 21, 2022 at 07:48:50AM +1000, Daniil Lunev wrote:
> > > > There is no such thing as WRITE UNAVAILABLE in NVMe.
> > > Apologize, that is WRITE UNCORRECTABLE. Chapter 3.2.7 of
> > > NVM Express NVM Command Set Specification 1.0b
> >
> > Write uncorrectable is a very different thing, and the equivalent of the
> > horribly misnamed SCSI WRITE LONG COMMAND.  It injects an unrecoverable
> > error, and does not provision anything.
> >
> > > * Each application is potentially allowed to consume the entirety
> > >   of the disk space - there is no strict size limit for application
> > > * Applications need to pre-allocate space sometime, for which
> > >   they use fallocate. Once the operation succeeded, the application
> > >   assumed the space is guaranteed to be there for it.
> > > * Since filesystems on the volumes are independent, filesystem
> > >   level enforcement of size constraints is impossible and the only
> > >   common level is the thin pool, thus, each fallocate has to find its
> > >   representation in thin pool one way or another - otherwise you
> > >   may end up in the situation, where FS thinks it has allocated space
> > >   but when it tries to actually write it, the thin pool is already
> > >   exhausted.
> > > * Hole-Punching fallocate will not reach the thin pool, so the only
> > >   solution presently is zero-writing pre-allocate.
> >
> > To me it sounds like you want a non-thin pool in dm-thin and/or
> > guaranted space reservations for it.
>
> What is implemented in this patchset: enablement for dm-thinp to
> actually provide guarantees which fallocate requires.
>
> Seems you're getting hung up on the finishing details in HW (details
> which are _not_ the point of this patchset).
>
> The proposed changes are in service to _Linux_ code. The patchset
> implements the primitive from top (ext4) to bottom (dm-thinp, loop).
> It stops short of implementing handling everywhere that'd need it
> (e.g. in XFS, etc). But those changes can come as follow-on work once
> the primitive is established top to bottom.
>
> But you know all this ;)
>
> > > * Thus, a provisioning block operation allows an interface specific
> > >   operation that guarantees the presence of the block in the
> > >   mapped space. LVM Thin-pool itself is the primary target for our
> > >   use case but the argument is that this operation maps well to
> > >   other interfaces which allow thinly provisioned units.
> >
> > I think where you are trying to go here is badly mistaken.  With flash
> > (or hard drive SMR) there is no such thing as provisioning LBAs.  Every
> > write is out of place, and a one time space allocation does not help
> > you at all.  So fundamentally what you try to here just goes against
> > the actual physics of modern storage media.  While there are some
> > layers that keep up a pretence, trying to that an an exposed API
> > level is a really bad idea.
>
> This doesn't need to be so feudal.  Reserving an LBA in physical HW
> really isn't the point.
>
> Fact remains: an operation that ensures space is actually reserved via
> fallocate is long overdue (just because an FS did its job doesn't mean
> underlying layers reflect that). And certainly useful, even if "only"
> benefiting dm-thinp and the loop driver. Like other block primitives,
> REQ_OP_PROVISION is filtered out by block core if the device doesn't
> support it.
>
> That said, I agree with Brian Foster that we need really solid
> documentation and justification for why fallocate mode=0 cannot be
> used (but the case has been made in this thread).
>
> Also, I do see an issue with the implementation (relative to stacked
> devices): dm_table_supports_provision() is too myopic about DM. It
> needs to go a step further and verify that some layer in the stack
> actually services REQ_OP_PROVISION. Will respond to DM patch too.
>
Thanks all for the suggestions and feedback! I just posted v2 (more
than a bit belatedly) on the various mailing lists with the relevant
fixes, documentation and some benchmarks on performance.

Best
Sarthak

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[RFC,0/8] Introduce provisioning primitives for thinly provisioned storage

Message

Comments