dax: Allow block size > PAGE_SIZE

Message ID	20241101-dax-page-size-v1-1-eedbd0c6b08f@asahilina.net (mailing list archive)
State	New
Headers	show Received: from mail.marcansoft.com (marcansoft.com [212.63.210.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D7B47C13C; Fri, 1 Nov 2024 12:22:50 +0000 (UTC) sender: sendonly@marcansoft.com) by mail.marcansoft.com (Postfix) with ESMTPSA id 0025F432AE; Fri, 1 Nov 2024 12:22:47 +0000 (UTC) From: Asahi Lina <lina@asahilina.net> Date: Fri, 01 Nov 2024 21:22:31 +0900 Subject: [PATCH] dax: Allow block size > PAGE_SIZE Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20241101-dax-page-size-v1-1-eedbd0c6b08f@asahilina.net> To: Dan Williams <dan.j.williams@intel.com>, Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>, Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org> Cc: Sergio Lopez Pascual <slp@redhat.com>, linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org, asahi@lists.linux.dev, Asahi Lina <lina@asahilina.net>
Series	dax: Allow block size > PAGE_SIZE \| expand dax: Allow block size > PAGE_SIZE

Asahi Lina Nov. 1, 2024, 12:22 p.m. UTC

For virtio-dax, the file/FS blocksize is irrelevant. FUSE always uses
large DAX blocks (2MiB), which will work with all host page sizes. Since
we are mapping files into the DAX window on the host, the underlying
block size of the filesystem and its block device (if any) are
meaningless.

For real devices with DAX, the only requirement should be that the FS
block size is *at least* as large as PAGE_SIZE, to ensure that at least
whole pages can be mapped out of the device contiguously.

Fixes warning when using virtio-dax on a 4K guest with a 16K host,
backed by tmpfs (which sets blksz == PAGE_SIZE on the host).

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


---
base-commit: 81983758430957d9a5cb3333fe324fd70cf63e7e
change-id: 20241101-dax-page-size-83a1073b4e1b

Cheers,
~~ Lina

Jan Kara Nov. 4, 2024, 10:57 a.m. UTC | #1

On Fri 01-11-24 21:22:31, Asahi Lina wrote:
> For virtio-dax, the file/FS blocksize is irrelevant. FUSE always uses
> large DAX blocks (2MiB), which will work with all host page sizes. Since
> we are mapping files into the DAX window on the host, the underlying
> block size of the filesystem and its block device (if any) are
> meaningless.
> 
> For real devices with DAX, the only requirement should be that the FS
> block size is *at least* as large as PAGE_SIZE, to ensure that at least
> whole pages can be mapped out of the device contiguously.
> 
> Fixes warning when using virtio-dax on a 4K guest with a 16K host,
> backed by tmpfs (which sets blksz == PAGE_SIZE on the host).
> 
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>  fs/dax.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Well, I don't quite understand how just relaxing the check is enough. I
guess it may work with virtiofs (I don't know enough about virtiofs to
really tell either way) but for ordinary DAX filesystem it would be
seriously wrong if DAX was used with blocksize > pagesize as multiple
mapping entries could be pointing to the same PFN which is going to have
weird results. If virtiofs can actually map 4k subpages out of 16k page on
host (and generally perform 4k granular tracking etc.), it would seem more
appropriate if virtiofs actually exposed the filesystem 4k block size instead
of 16k blocksize? Or am I missing something?

								Honza

> diff --git a/fs/dax.c b/fs/dax.c
> index c62acd2812f8d4981aaba82acfeaf972f555362a..406fb75bdbe9d17a6e4bf3d4cb92683e90f05910 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1032,7 +1032,7 @@ int dax_writeback_mapping_range(struct address_space *mapping,
>  	int ret = 0;
>  	unsigned int scanned = 0;
>  
> -	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
> +	if (WARN_ON_ONCE(inode->i_blkbits < PAGE_SHIFT))
>  		return -EIO;
>  
>  	if (mapping_empty(mapping) || wbc->sync_mode != WB_SYNC_ALL)
> 
> ---
> base-commit: 81983758430957d9a5cb3333fe324fd70cf63e7e
> change-id: 20241101-dax-page-size-83a1073b4e1b
> 
> Cheers,
> ~~ Lina
>

Asahi Lina Nov. 4, 2024, 3:31 p.m. UTC | #2

On 11/4/24 7:57 PM, Jan Kara wrote:
> On Fri 01-11-24 21:22:31, Asahi Lina wrote:
>> For virtio-dax, the file/FS blocksize is irrelevant. FUSE always uses
>> large DAX blocks (2MiB), which will work with all host page sizes. Since
>> we are mapping files into the DAX window on the host, the underlying
>> block size of the filesystem and its block device (if any) are
>> meaningless.
>>
>> For real devices with DAX, the only requirement should be that the FS
>> block size is *at least* as large as PAGE_SIZE, to ensure that at least
>> whole pages can be mapped out of the device contiguously.
>>
>> Fixes warning when using virtio-dax on a 4K guest with a 16K host,
>> backed by tmpfs (which sets blksz == PAGE_SIZE on the host).
>>
>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>> ---
>>  fs/dax.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Well, I don't quite understand how just relaxing the check is enough. I
> guess it may work with virtiofs (I don't know enough about virtiofs to
> really tell either way) but for ordinary DAX filesystem it would be
> seriously wrong if DAX was used with blocksize > pagesize as multiple
> mapping entries could be pointing to the same PFN which is going to have
> weird results.

Isn't that generally possible by just mapping the same file multiple
times? Why would that be an issue?

Of course having a block size smaller than the page size is never going
to work because you would not be able to map single blocks out of files
directly. But I don't see why a larger block size would cause any
issues. You'd just use several pages to map a single filesystem block.
For example, if the block size is 16K and the page size is 4K, then a
single file block would be DAX mapped as four contiguous 4K pages in
both physical and virtual memory.

> If virtiofs can actually map 4k subpages out of 16k page on
> host (and generally perform 4k granular tracking etc.), it would seem more
> appropriate if virtiofs actually exposed the filesystem 4k block size instead
> of 16k blocksize? Or am I missing something?

virtiofs itself on the guest does 2MiB mappings into the SHM region, and
then the guest is free to map blocks out of those mappings. So as long
as the guest page size is less than 2MiB, it doesn't matter, since all
files will be aligned in physical memory to that block size. It behaves
as if the filesystem block size is 2MiB from the point of view of the
guest regardless of the actual block size. For example, if the host page
size is 16K, the guest will request a 2MiB mapping of a file, which the
VMM will satisfy by mmapping 128 16K pages from its page cache (at
arbitrary physical memory addresses) into guest "physical" memory as one
contiguous block. Then the guest will see the whole 2MiB mapping as
contiguous, even though it isn't in physical RAM, and it can use any
page granularity it wants (that is supported by the architecture) to map
it to a userland process.

> 
> 								Honza
> 
>> diff --git a/fs/dax.c b/fs/dax.c
>> index c62acd2812f8d4981aaba82acfeaf972f555362a..406fb75bdbe9d17a6e4bf3d4cb92683e90f05910 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -1032,7 +1032,7 @@ int dax_writeback_mapping_range(struct address_space *mapping,
>>  	int ret = 0;
>>  	unsigned int scanned = 0;
>>  
>> -	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
>> +	if (WARN_ON_ONCE(inode->i_blkbits < PAGE_SHIFT))
>>  		return -EIO;
>>  
>>  	if (mapping_empty(mapping) || wbc->sync_mode != WB_SYNC_ALL)
>>
>> ---
>> base-commit: 81983758430957d9a5cb3333fe324fd70cf63e7e
>> change-id: 20241101-dax-page-size-83a1073b4e1b
>>
>> Cheers,
>> ~~ Lina
>>

~~ Lina

Dave Chinner Nov. 4, 2024, 10:16 p.m. UTC | #3

On Tue, Nov 05, 2024 at 12:31:22AM +0900, Asahi Lina wrote:
> 
> 
> On 11/4/24 7:57 PM, Jan Kara wrote:
> > On Fri 01-11-24 21:22:31, Asahi Lina wrote:
> >> For virtio-dax, the file/FS blocksize is irrelevant. FUSE always uses
> >> large DAX blocks (2MiB), which will work with all host page sizes. Since
> >> we are mapping files into the DAX window on the host, the underlying
> >> block size of the filesystem and its block device (if any) are
> >> meaningless.
> >>
> >> For real devices with DAX, the only requirement should be that the FS
> >> block size is *at least* as large as PAGE_SIZE, to ensure that at least
> >> whole pages can be mapped out of the device contiguously.
> >>
> >> Fixes warning when using virtio-dax on a 4K guest with a 16K host,
> >> backed by tmpfs (which sets blksz == PAGE_SIZE on the host).
> >>
> >> Signed-off-by: Asahi Lina <lina@asahilina.net>
> >> ---
> >>  fs/dax.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > Well, I don't quite understand how just relaxing the check is enough. I
> > guess it may work with virtiofs (I don't know enough about virtiofs to
> > really tell either way) but for ordinary DAX filesystem it would be
> > seriously wrong if DAX was used with blocksize > pagesize as multiple
> > mapping entries could be pointing to the same PFN which is going to have
> > weird results.
> 
> Isn't that generally possible by just mapping the same file multiple
> times? Why would that be an issue?

I think what Jan is talking about having multiple inode->i_mapping
entries point to the same pfn, not multiple vm mapped regions
pointing at the same file offset....

> Of course having a block size smaller than the page size is never going
> to work because you would not be able to map single blocks out of files
> directly. But I don't see why a larger block size would cause any
> issues. You'd just use several pages to map a single filesystem block.

If only it were that simple.....

> For example, if the block size is 16K and the page size is 4K, then a
> single file block would be DAX mapped as four contiguous 4K pages in
> both physical and virtual memory.

Up until 6.12, filesystems on linux did not support block size >
page size. This was a constraint of the page cache implementation
being based around the xarray indexing being tightly tied to
PAGE_SIZE granularity indexing. Folios and large folio support
provided the infrastructure to allow indexing to increase to order-N
based index granularity. It's only taken 20 years to get a solution
to this problem merged, but it's finally there now.

Unfortunately, the DAX infrastructure is independent of the page
cache but is also tightly tied to PAGE_SIZE based inode->i_mapping
index granularity. In a way, this is even more fundamental than the
page cache issues we had to solve. That's because we don't have
folios with their own locks and size tracking. In DAX, we use the
inode->i_mapping xarray entry for a given file offset to -serialise
access to the backing pfn- via lock bits held in the xarray entry.
We also encode the size of the dax entry in bits held in the xarray
entry.

The filesystem needs to track dirty state with filesystem block
granularity. Operations on filesystem blocks (e.g. partial writes,
page faults) need to be co-ordinated across the entire filesystem
block. This means we have to be able to lock a single filesystem
block whilst we are doing instantiation, sub-block zeroing, etc.

Large folio support in the page cache provided this "single tracking
object for a > PAGE_SIZE range" support needed to allow fsb >
page_size in filesystems. The large folio spans the entire
filesystem block, providing a single serialisation and state
tracking for all the page cache operations needing to be done on
that filesystem block.

The DAX infrastructure needs the same changes for fsb > page size
support. We have a limited number bits we can use for DAX entry
state:

/*
 * DAX pagecache entries use XArray value entries so they can't be mistaken
 * for pages.  We use one bit for locking, one bit for the entry size (PMD)
 * and two more to tell us if the entry is a zero page or an empty entry that
 * is just used for locking.  In total four special bits.
 *
 * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE
 * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
 * block allocation.
 */
#define DAX_SHIFT       (4)
#define DAX_LOCKED      (1UL << 0)
#define DAX_PMD         (1UL << 1)
#define DAX_ZERO_PAGE   (1UL << 2)
#define DAX_EMPTY       (1UL << 3)

I *think* that we have at most PAGE_SHIFT worth of bits we can
use because we only store the pfn part of the pfn_t in the dax
entry. There are PAGE_SHIFT high bits in the pfn_t that hold
pfn state that we mask out.

Hence I think we can easily steal another 3 bits for storing an
order - orders 0-4 are needed (3 bits) for up to 64kB on 4kB
PAGE_SIZE - so I think this is a solvable problem. There's a lot
more to it than "just use several pages to map to a single
filesystem block", though.....

> > If virtiofs can actually map 4k subpages out of 16k page on
> > host (and generally perform 4k granular tracking etc.), it would seem more
> > appropriate if virtiofs actually exposed the filesystem 4k block size instead
> > of 16k blocksize? Or am I missing something?
> 
> virtiofs itself on the guest does 2MiB mappings into the SHM region, and
> then the guest is free to map blocks out of those mappings. So as long
> as the guest page size is less than 2MiB, it doesn't matter, since all
> files will be aligned in physical memory to that block size. It behaves
> as if the filesystem block size is 2MiB from the point of view of the
> guest regardless of the actual block size. For example, if the host page
> size is 16K, the guest will request a 2MiB mapping of a file, which the
> VMM will satisfy by mmapping 128 16K pages from its page cache (at
> arbitrary physical memory addresses) into guest "physical" memory as one
> contiguous block. Then the guest will see the whole 2MiB mapping as
> contiguous, even though it isn't in physical RAM, and it can use any
> page granularity it wants (that is supported by the architecture) to map
> it to a userland process.

Clearly I'm missing something important because, from this
description, I honestly don't know which mapping is actually using
DAX.

Can you draw out the virtofs stack from userspace in the guest down
to storage in the host so dumb people like myself know exactly where
what is being directly accessed and how?

-Dave.

Asahi Lina Nov. 6, 2024, 10:55 a.m. UTC | #4

On 11/5/24 7:16 AM, Dave Chinner wrote:
> On Tue, Nov 05, 2024 at 12:31:22AM +0900, Asahi Lina wrote:
>>
>>
>> On 11/4/24 7:57 PM, Jan Kara wrote:
>>> On Fri 01-11-24 21:22:31, Asahi Lina wrote:
>>>> For virtio-dax, the file/FS blocksize is irrelevant. FUSE always uses
>>>> large DAX blocks (2MiB), which will work with all host page sizes. Since
>>>> we are mapping files into the DAX window on the host, the underlying
>>>> block size of the filesystem and its block device (if any) are
>>>> meaningless.
>>>>
>>>> For real devices with DAX, the only requirement should be that the FS
>>>> block size is *at least* as large as PAGE_SIZE, to ensure that at least
>>>> whole pages can be mapped out of the device contiguously.
>>>>
>>>> Fixes warning when using virtio-dax on a 4K guest with a 16K host,
>>>> backed by tmpfs (which sets blksz == PAGE_SIZE on the host).
>>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>  fs/dax.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> Well, I don't quite understand how just relaxing the check is enough. I
>>> guess it may work with virtiofs (I don't know enough about virtiofs to
>>> really tell either way) but for ordinary DAX filesystem it would be
>>> seriously wrong if DAX was used with blocksize > pagesize as multiple
>>> mapping entries could be pointing to the same PFN which is going to have
>>> weird results.
>>
>> Isn't that generally possible by just mapping the same file multiple
>> times? Why would that be an issue?
> 
> I think what Jan is talking about having multiple inode->i_mapping
> entries point to the same pfn, not multiple vm mapped regions
> pointing at the same file offset....
> 
>> Of course having a block size smaller than the page size is never going
>> to work because you would not be able to map single blocks out of files
>> directly. But I don't see why a larger block size would cause any
>> issues. You'd just use several pages to map a single filesystem block.
> 
> If only it were that simple.....
> 
>> For example, if the block size is 16K and the page size is 4K, then a
>> single file block would be DAX mapped as four contiguous 4K pages in
>> both physical and virtual memory.
> 
> Up until 6.12, filesystems on linux did not support block size >
> page size. This was a constraint of the page cache implementation
> being based around the xarray indexing being tightly tied to
> PAGE_SIZE granularity indexing. Folios and large folio support
> provided the infrastructure to allow indexing to increase to order-N
> based index granularity. It's only taken 20 years to get a solution
> to this problem merged, but it's finally there now.

Right, but I thought that was already enforced at the filesystem level.
I don't understand why the actual DAX infrastructure would care...

Some FSes do already support *smaller* block size than page size (e.g.
btrfs), but obviously that case is never going to work with DAX.
 > Unfortunately, the DAX infrastructure is independent of the page
> cache but is also tightly tied to PAGE_SIZE based inode->i_mapping
> index granularity. In a way, this is even more fundamental than the
> page cache issues we had to solve. That's because we don't have
> folios with their own locks and size tracking. In DAX, we use the
> inode->i_mapping xarray entry for a given file offset to -serialise
> access to the backing pfn- via lock bits held in the xarray entry.
> We also encode the size of the dax entry in bits held in the xarray
> entry.
> 
> The filesystem needs to track dirty state with filesystem block
> granularity. Operations on filesystem blocks (e.g. partial writes,
> page faults) need to be co-ordinated across the entire filesystem
> block. This means we have to be able to lock a single filesystem
> block whilst we are doing instantiation, sub-block zeroing, etc.

Ah, so it's about locking? I had a feeling that might be the case...

> Large folio support in the page cache provided this "single tracking
> object for a > PAGE_SIZE range" support needed to allow fsb >
> page_size in filesystems. The large folio spans the entire
> filesystem block, providing a single serialisation and state
> tracking for all the page cache operations needing to be done on
> that filesystem block.
> 
> The DAX infrastructure needs the same changes for fsb > page size
> support. We have a limited number bits we can use for DAX entry
> state:
> 
> /*
>  * DAX pagecache entries use XArray value entries so they can't be mistaken
>  * for pages.  We use one bit for locking, one bit for the entry size (PMD)
>  * and two more to tell us if the entry is a zero page or an empty entry that
>  * is just used for locking.  In total four special bits.
>  *
>  * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE
>  * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
>  * block allocation.
>  */
> #define DAX_SHIFT       (4)
> #define DAX_LOCKED      (1UL << 0)
> #define DAX_PMD         (1UL << 1)
> #define DAX_ZERO_PAGE   (1UL << 2)
> #define DAX_EMPTY       (1UL << 3)
> 
> I *think* that we have at most PAGE_SHIFT worth of bits we can
> use because we only store the pfn part of the pfn_t in the dax
> entry. There are PAGE_SHIFT high bits in the pfn_t that hold
> pfn state that we mask out.
> 
> Hence I think we can easily steal another 3 bits for storing an
> order - orders 0-4 are needed (3 bits) for up to 64kB on 4kB
> PAGE_SIZE - so I think this is a solvable problem. There's a lot
> more to it than "just use several pages to map to a single
> filesystem block", though.....

Honestly, this is all quite over my head... my use case is virtiofs,
which I think is quite different to running a filesystem on bare-metal
DAX. It's starting to sound like we should perhaps just gate off the
check for virtiofs only?

> 
>>> If virtiofs can actually map 4k subpages out of 16k page on
>>> host (and generally perform 4k granular tracking etc.), it would seem more
>>> appropriate if virtiofs actually exposed the filesystem 4k block size instead
>>> of 16k blocksize? Or am I missing something?
>>
>> virtiofs itself on the guest does 2MiB mappings into the SHM region, and
>> then the guest is free to map blocks out of those mappings. So as long
>> as the guest page size is less than 2MiB, it doesn't matter, since all
>> files will be aligned in physical memory to that block size. It behaves
>> as if the filesystem block size is 2MiB from the point of view of the
>> guest regardless of the actual block size. For example, if the host page
>> size is 16K, the guest will request a 2MiB mapping of a file, which the
>> VMM will satisfy by mmapping 128 16K pages from its page cache (at
>> arbitrary physical memory addresses) into guest "physical" memory as one
>> contiguous block. Then the guest will see the whole 2MiB mapping as
>> contiguous, even though it isn't in physical RAM, and it can use any
>> page granularity it wants (that is supported by the architecture) to map
>> it to a userland process.
> 
> Clearly I'm missing something important because, from this
> description, I honestly don't know which mapping is actually using
> DAX.
> 
> Can you draw out the virtofs stack from userspace in the guest down
> to storage in the host so dumb people like myself know exactly where
> what is being directly accessed and how?

I'm not familiar with all of the details, but essentially virtiofs is
FUSE backed by a virtio device instead of userspace, plus the extra DAX
mapping stuff. Since it's not a real filesystem backed by a block
device, it has no significant concept of block size itself. i_blkbits
comes from the st_blksize of the inode stat, which in our case is passed
through from the underlying filesystem backing the virtiofs in the host
(but it could be anything, nothing says virtiofs has to be backed by a
real kernel FS in the host).

So as a baseline, virtiofs is just FUSE and block size doesn't matter
since all the non-mmap filesystem APIs shouldn't care about block size
(other than for optimization reasons and issues with torn racy writes).
The guest should be able to pretend the block size is 4K for FS/VM
purposes even if it's 16K in the host, and track everything in the page
cache and DAX infrastructure in terms of 4K blocks. As far as I know
there is no operation in plain FUSE that actually cares about the block
size itself.

So then there's DAX/mmap. When DAX is enabled, FUSE can issue
FUSE_SETUPMAPPING and FUSE_REMOVEMAPPING opcodes. These request that a
range of a file be mapped into the device memory region used by
virtiofs. When the VMM receives those, it will use mmap to map the
backing file into the guest's virtio device memory window, and then the
guest can use DAX to directly access those pages and allow userspace
processes to mmap them directly. This means that mmaps are coherent
between processes on the guest and the host (or in another guest), which
is the main reason we're doing this.

If you look at fs/fuse/dax.c, you'll see that FUSE_DAX_SHIFT is 21. This
means that the FUSE code only ever issues
FUSE_SETUPMAPPING/FUSE_REMOVEMAPPING opcodes with offsets/lengths at
2MiB granularity within files. So, regardless of the underlying
filesystem block size in the host (if there is one at all), the guest
will always see aligned 2MiB blocks of files available in its virtio
device region, similar to the hypothetical case of an actual
block-backed DAX filesystem with a 2MiB allocation block size.

We could cap st_blksize in the VMM to 4K, I guess? I don't know if that
would make more sense than removing the kernel check. On one hand, that
might result in less optimized I/O if userspace then does 4K writes. On
the other hand, if we report st_blksize as 16K to userspace then I guess
it could assume concurrent 16K writes cannot be torn, which is not the
case if the guest is using 4K pages and page cache blocks (at least not
until all the folio stuff is worked out for blocks > page size in both
the page cache and DAX layers).

This WARN still feels like the wrong thing, though. Right now it is the
only thing in DAX code complaining on a page size/block size mismatch
(at least for virtiofs). If this is so important, I feel like there
should be a higher level check elsewhere, like something happening at
mount time or on file open. It should actually cause the operations to
fail cleanly.

~~ Lina

Jan Kara Nov. 6, 2024, 12:12 p.m. UTC | #5

On Wed 06-11-24 19:55:23, Asahi Lina wrote:
> On 11/5/24 7:16 AM, Dave Chinner wrote:
> > On Tue, Nov 05, 2024 at 12:31:22AM +0900, Asahi Lina wrote:
> > Unfortunately, the DAX infrastructure is independent of the page
> > cache but is also tightly tied to PAGE_SIZE based inode->i_mapping
> > index granularity. In a way, this is even more fundamental than the
> > page cache issues we had to solve. That's because we don't have
> > folios with their own locks and size tracking. In DAX, we use the
> > inode->i_mapping xarray entry for a given file offset to -serialise
> > access to the backing pfn- via lock bits held in the xarray entry.
> > We also encode the size of the dax entry in bits held in the xarray
> > entry.
> > 
> > The filesystem needs to track dirty state with filesystem block
> > granularity. Operations on filesystem blocks (e.g. partial writes,
> > page faults) need to be co-ordinated across the entire filesystem
> > block. This means we have to be able to lock a single filesystem
> > block whilst we are doing instantiation, sub-block zeroing, etc.
> 
> Ah, so it's about locking? I had a feeling that might be the case...

Yes. About locking and general state tracking.

> > Large folio support in the page cache provided this "single tracking
> > object for a > PAGE_SIZE range" support needed to allow fsb >
> > page_size in filesystems. The large folio spans the entire
> > filesystem block, providing a single serialisation and state
> > tracking for all the page cache operations needing to be done on
> > that filesystem block.
> > 
> > The DAX infrastructure needs the same changes for fsb > page size
> > support. We have a limited number bits we can use for DAX entry
> > state:
> > 
> > /*
> >  * DAX pagecache entries use XArray value entries so they can't be mistaken
> >  * for pages.  We use one bit for locking, one bit for the entry size (PMD)
> >  * and two more to tell us if the entry is a zero page or an empty entry that
> >  * is just used for locking.  In total four special bits.
> >  *
> >  * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE
> >  * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
> >  * block allocation.
> >  */
> > #define DAX_SHIFT       (4)
> > #define DAX_LOCKED      (1UL << 0)
> > #define DAX_PMD         (1UL << 1)
> > #define DAX_ZERO_PAGE   (1UL << 2)
> > #define DAX_EMPTY       (1UL << 3)
> > 
> > I *think* that we have at most PAGE_SHIFT worth of bits we can
> > use because we only store the pfn part of the pfn_t in the dax
> > entry. There are PAGE_SHIFT high bits in the pfn_t that hold
> > pfn state that we mask out.
> > 
> > Hence I think we can easily steal another 3 bits for storing an
> > order - orders 0-4 are needed (3 bits) for up to 64kB on 4kB
> > PAGE_SIZE - so I think this is a solvable problem. There's a lot
> > more to it than "just use several pages to map to a single
> > filesystem block", though.....
> 
> Honestly, this is all quite over my head... my use case is virtiofs,
> which I think is quite different to running a filesystem on bare-metal
> DAX. It's starting to sound like we should perhaps just gate off the
> check for virtiofs only?

Yes, my point was that the solution should better be virtiofs specific
because for anybody else blocksize > PAGE_SIZE is going to fail
spectacularly.

> >>> If virtiofs can actually map 4k subpages out of 16k page on
> >>> host (and generally perform 4k granular tracking etc.), it would seem more
> >>> appropriate if virtiofs actually exposed the filesystem 4k block size instead
> >>> of 16k blocksize? Or am I missing something?
> >>
> >> virtiofs itself on the guest does 2MiB mappings into the SHM region, and
> >> then the guest is free to map blocks out of those mappings. So as long
> >> as the guest page size is less than 2MiB, it doesn't matter, since all
> >> files will be aligned in physical memory to that block size. It behaves
> >> as if the filesystem block size is 2MiB from the point of view of the
> >> guest regardless of the actual block size. For example, if the host page
> >> size is 16K, the guest will request a 2MiB mapping of a file, which the
> >> VMM will satisfy by mmapping 128 16K pages from its page cache (at
> >> arbitrary physical memory addresses) into guest "physical" memory as one
> >> contiguous block. Then the guest will see the whole 2MiB mapping as
> >> contiguous, even though it isn't in physical RAM, and it can use any
> >> page granularity it wants (that is supported by the architecture) to map
> >> it to a userland process.
> > 
> > Clearly I'm missing something important because, from this
> > description, I honestly don't know which mapping is actually using
> > DAX.
> > 
> > Can you draw out the virtofs stack from userspace in the guest down
> > to storage in the host so dumb people like myself know exactly where
> > what is being directly accessed and how?
> 
> I'm not familiar with all of the details, but essentially virtiofs is
> FUSE backed by a virtio device instead of userspace, plus the extra DAX
> mapping stuff. Since it's not a real filesystem backed by a block
> device, it has no significant concept of block size itself. i_blkbits
> comes from the st_blksize of the inode stat, which in our case is passed
> through from the underlying filesystem backing the virtiofs in the host
> (but it could be anything, nothing says virtiofs has to be backed by a
> real kernel FS in the host).
> 
> So as a baseline, virtiofs is just FUSE and block size doesn't matter
> since all the non-mmap filesystem APIs shouldn't care about block size
> (other than for optimization reasons and issues with torn racy writes).
> The guest should be able to pretend the block size is 4K for FS/VM
> purposes even if it's 16K in the host, and track everything in the page
> cache and DAX infrastructure in terms of 4K blocks. As far as I know
> there is no operation in plain FUSE that actually cares about the block
> size itself.
> 
> So then there's DAX/mmap. When DAX is enabled, FUSE can issue
> FUSE_SETUPMAPPING and FUSE_REMOVEMAPPING opcodes. These request that a
> range of a file be mapped into the device memory region used by
> virtiofs. When the VMM receives those, it will use mmap to map the
> backing file into the guest's virtio device memory window, and then the
> guest can use DAX to directly access those pages and allow userspace
> processes to mmap them directly. This means that mmaps are coherent
> between processes on the guest and the host (or in another guest), which
> is the main reason we're doing this.
> 
> If you look at fs/fuse/dax.c, you'll see that FUSE_DAX_SHIFT is 21. This
> means that the FUSE code only ever issues
> FUSE_SETUPMAPPING/FUSE_REMOVEMAPPING opcodes with offsets/lengths at
> 2MiB granularity within files. So, regardless of the underlying
> filesystem block size in the host (if there is one at all), the guest
> will always see aligned 2MiB blocks of files available in its virtio
> device region, similar to the hypothetical case of an actual
> block-backed DAX filesystem with a 2MiB allocation block size.

OK, I've read fs/fuse/dax.c and I agree that virtiofs works as if its
block size would be 2MB because it not only maps space with this
granularity but also allocates from underlying storage, tracks extent state
etc. with this granularity.
 
> We could cap st_blksize in the VMM to 4K, I guess? I don't know if that
> would make more sense than removing the kernel check. On one hand, that
> might result in less optimized I/O if userspace then does 4K writes. On
> the other hand, if we report st_blksize as 16K to userspace then I guess
> it could assume concurrent 16K writes cannot be torn, which is not the
> case if the guest is using 4K pages and page cache blocks (at least not
> until all the folio stuff is worked out for blocks > page size in both
> the page cache and DAX layers).

Yes, I think capping sb->s_blocksize at PAGE_SIZE for virtiofs will be the
cleanest solution for now.
 
> This WARN still feels like the wrong thing, though. Right now it is the
> only thing in DAX code complaining on a page size/block size mismatch
> (at least for virtiofs). If this is so important, I feel like there
> should be a higher level check elsewhere, like something happening at
> mount time or on file open. It should actually cause the operations to
> fail cleanly.

That's a fair point. Currently filesystems supporting DAX check for this in
their mount code because there isn't really a DAX code that would get
called during mount and would have enough information to perform the check.
I'm not sure adding a new call just for this check makes a lot of sense.
But if you have some good place in mind, please tell me.

								Honza

Dan Williams Nov. 6, 2024, 7:59 p.m. UTC | #6

Jan Kara wrote:
[..]
> > This WARN still feels like the wrong thing, though. Right now it is the
> > only thing in DAX code complaining on a page size/block size mismatch
> > (at least for virtiofs). If this is so important, I feel like there
> > should be a higher level check elsewhere, like something happening at
> > mount time or on file open. It should actually cause the operations to
> > fail cleanly.
> 
> That's a fair point. Currently filesystems supporting DAX check for this in
> their mount code because there isn't really a DAX code that would get
> called during mount and would have enough information to perform the check.
> I'm not sure adding a new call just for this check makes a lot of sense.
> But if you have some good place in mind, please tell me.

Is not the reason that dax_writeback_mapping_range() the only thing
checking ->i_blkbits because 'struct writeback_control' does writeback
in terms of page-index ranges?

All other dax entry points are filesystem controlled that know the
block-to-pfn-to-mapping relationship.

Recall that dax_writeback_mapping_range() is historically for pmem
persistence guarantees to make sure that applications write through CPU
cache to media.

Presumably there are no cache coherency concerns with fuse and dax
writes from the guest side are not a risk of being stranded in CPU
cache. Host side filesystem writeback will take care of them when / if
the guest triggers a storage device cache flush, not a guest page cache
writeback.

So, I think the simplest fix here is potentially:

diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 12ef91d170bb..15cf7bb20b5e 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -777,11 +777,8 @@ ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 static int fuse_dax_writepages(struct address_space *mapping,
 			       struct writeback_control *wbc)
 {
-
-	struct inode *inode = mapping->host;
-	struct fuse_conn *fc = get_fuse_conn(inode);
-
-	return dax_writeback_mapping_range(mapping, fc->dax->dev, wbc);
+	/* nothing to flush, fuse cache coherency is managed by the host */
+	return 0;
 }

 static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf, unsigned int order,

Jan Kara Nov. 7, 2024, 10:01 a.m. UTC | #7

On Wed 06-11-24 11:59:44, Dan Williams wrote:
> Jan Kara wrote:
> [..]
> > > This WARN still feels like the wrong thing, though. Right now it is the
> > > only thing in DAX code complaining on a page size/block size mismatch
> > > (at least for virtiofs). If this is so important, I feel like there
> > > should be a higher level check elsewhere, like something happening at
> > > mount time or on file open. It should actually cause the operations to
> > > fail cleanly.
> > 
> > That's a fair point. Currently filesystems supporting DAX check for this in
> > their mount code because there isn't really a DAX code that would get
> > called during mount and would have enough information to perform the check.
> > I'm not sure adding a new call just for this check makes a lot of sense.
> > But if you have some good place in mind, please tell me.
> 
> Is not the reason that dax_writeback_mapping_range() the only thing
> checking ->i_blkbits because 'struct writeback_control' does writeback
> in terms of page-index ranges?

To be fair, I don't remember why we've put the assertion specifically into
dax_writeback_mapping_range(). But as Dave explained there's much more to
this blocksize == pagesize limitation in DAX than just doing writeback in
terms of page-index ranges. The whole DAX entry tracking in xarray would
have to be modified to properly support other entry sizes than just PTE &
PMD sizes because otherwise the entry locking just doesn't provide the
guarantees that are expected from filesystems (e.g. you could have parallel
modifications happening to a single fs block in pagesize < blocksize case).

> All other dax entry points are filesystem controlled that know the
> block-to-pfn-to-mapping relationship.
> 
> Recall that dax_writeback_mapping_range() is historically for pmem
> persistence guarantees to make sure that applications write through CPU
> cache to media.

Correct.

> Presumably there are no cache coherency concerns with fuse and dax
> writes from the guest side are not a risk of being stranded in CPU
> cache. Host side filesystem writeback will take care of them when / if
> the guest triggers a storage device cache flush, not a guest page cache
> writeback.

I'm not so sure. When you call fsync(2) in the guest on virtiofs file, it
should provide persistency guarantees on the file contents even in case of
*host* power failure. So if the guest is directly mapping host's page cache
pages through virtiofs, filemap_fdatawrite() call in the guest must result
in fsync(2) on the host to persist those pages. And as far as I vaguely
remember that happens by KVM catching the arch_wb_cache_pmem() calls and
issuing fsync(2) on the host. But I could be totally wrong here.

								Honza

Asahi Lina Nov. 7, 2024, 4:09 p.m. UTC | #8

On 11/7/24 7:01 PM, Jan Kara wrote:
> On Wed 06-11-24 11:59:44, Dan Williams wrote:
>> Jan Kara wrote:
>> [..]
>>>> This WARN still feels like the wrong thing, though. Right now it is the
>>>> only thing in DAX code complaining on a page size/block size mismatch
>>>> (at least for virtiofs). If this is so important, I feel like there
>>>> should be a higher level check elsewhere, like something happening at
>>>> mount time or on file open. It should actually cause the operations to
>>>> fail cleanly.
>>>
>>> That's a fair point. Currently filesystems supporting DAX check for this in
>>> their mount code because there isn't really a DAX code that would get
>>> called during mount and would have enough information to perform the check.
>>> I'm not sure adding a new call just for this check makes a lot of sense.
>>> But if you have some good place in mind, please tell me.
>>
>> Is not the reason that dax_writeback_mapping_range() the only thing
>> checking ->i_blkbits because 'struct writeback_control' does writeback
>> in terms of page-index ranges?
> 
> To be fair, I don't remember why we've put the assertion specifically into
> dax_writeback_mapping_range(). But as Dave explained there's much more to
> this blocksize == pagesize limitation in DAX than just doing writeback in
> terms of page-index ranges. The whole DAX entry tracking in xarray would
> have to be modified to properly support other entry sizes than just PTE &
> PMD sizes because otherwise the entry locking just doesn't provide the
> guarantees that are expected from filesystems (e.g. you could have parallel
> modifications happening to a single fs block in pagesize < blocksize case).
> 
>> All other dax entry points are filesystem controlled that know the
>> block-to-pfn-to-mapping relationship.
>>
>> Recall that dax_writeback_mapping_range() is historically for pmem
>> persistence guarantees to make sure that applications write through CPU
>> cache to media.
> 
> Correct.
> 
>> Presumably there are no cache coherency concerns with fuse and dax
>> writes from the guest side are not a risk of being stranded in CPU
>> cache. Host side filesystem writeback will take care of them when / if
>> the guest triggers a storage device cache flush, not a guest page cache
>> writeback.
> 
> I'm not so sure. When you call fsync(2) in the guest on virtiofs file, it
> should provide persistency guarantees on the file contents even in case of
> *host* power failure. So if the guest is directly mapping host's page cache
> pages through virtiofs, filemap_fdatawrite() call in the guest must result
> in fsync(2) on the host to persist those pages. And as far as I vaguely
> remember that happens by KVM catching the arch_wb_cache_pmem() calls and
> issuing fsync(2) on the host. But I could be totally wrong here.

I don't think that's how it actually works, at least on arm64.
arch_wb_cache_pmem() calls dcache_clean_pop() which is either dc cvap or
dc cvac. Those are trapped by HCR_EL2<TPC>, and that is never set by KVM.

There was some discussion of this here:
https://lore.kernel.org/all/20190702055937.3ffpwph7anvohmxu@US-160370MP2.local/

But I'm not sure that all really made sense then.

msync() and fsync() should already provide persistence. Those end up
calling vfs_fsync_range(), which becomes a FUSE fsync(), which fsyncs
(or fdatasyncs) the whole file. What I'm not so sure is whether there
are any other codepaths that also need to provide those guarantees which
*don't* end up calling fsync on the VFS. For example, the manpages kind
of imply munmap() syncs, though as far as I can tell that's not actually
the case. If there are missing sync paths, then I think those might just
be broken right now...

~~ Lina

Matthew Wilcox Nov. 7, 2024, 8:52 p.m. UTC | #9

On Tue, Nov 05, 2024 at 09:16:40AM +1100, Dave Chinner wrote:
> The DAX infrastructure needs the same changes for fsb > page size
> support. We have a limited number bits we can use for DAX entry
> state:
> 
> /*
>  * DAX pagecache entries use XArray value entries so they can't be mistaken
>  * for pages.  We use one bit for locking, one bit for the entry size (PMD)
>  * and two more to tell us if the entry is a zero page or an empty entry that
>  * is just used for locking.  In total four special bits.
>  *
>  * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE
>  * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
>  * block allocation.
>  */
> #define DAX_SHIFT       (4)
> #define DAX_LOCKED      (1UL << 0)
> #define DAX_PMD         (1UL << 1)
> #define DAX_ZERO_PAGE   (1UL << 2)
> #define DAX_EMPTY       (1UL << 3)
> 
> I *think* that we have at most PAGE_SHIFT worth of bits we can
> use because we only store the pfn part of the pfn_t in the dax
> entry. There are PAGE_SHIFT high bits in the pfn_t that hold
> pfn state that we mask out.

We're a lot more constrained than that on 32-bit.  We support up to 40
bits of physical address on arm32 (well, the hardware supports it ...
Linux is not very good with that amount of physical space).  Assuming a
PAGE_SHIFT of 12, we've got 3 bits (yes, the current DAX doesn't support
the 40th bit on arm32).  Fortunately, we don't need more than that.

There are a set of encodings which don't seem to have a name (perhaps
I should name it after myself) that can encode any power-of-two that is
naturally aligned by using just one extra bit.  I've documented it here:

https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder

So we can just recycle the DAX_PMD bit as bit 0 of the encoding.
We can also reclaim DAX_EMPTY by using the "No object" encoding as
DAX_EMPTY.  So that gives us a bit back.

ie the functions I'd actually have in dax.c would be:

#define DAX_LOCKED	1
#define DAX_ZERO_PAGE	2

unsigned int dax_entry_order(void *entry)
{
	return ffsl(xa_to_value(entry) >> 2) - 1;
}

unsigned long dax_to_pfn(void *entry)
{
	unsigned long v = xa_to_value(entry) >> 2;
	return (v & (v - 1)) / 2;
}

void *dax_make_entry(pfn_t pfn, unsigned int order, unsigned long flags)
{
	VM_BUG_ON(pfn_t_to_pfn(pfn) & ((1UL << order) - 1) != 0);
	flags |= (4UL << order) | (pfn_t_to_pfn(pfn) * 8);
	return xa_mk_value(flags);
}

bool dax_is_empty_entry(void *entry)
{
	return (xa_to_value(entry) >> 2) == 0;
}

Dan Williams Nov. 7, 2024, 9:22 p.m. UTC | #10

Jan Kara wrote:
> On Wed 06-11-24 11:59:44, Dan Williams wrote:
> > Jan Kara wrote:
> > [..]
> > > > This WARN still feels like the wrong thing, though. Right now it is the
> > > > only thing in DAX code complaining on a page size/block size mismatch
> > > > (at least for virtiofs). If this is so important, I feel like there
> > > > should be a higher level check elsewhere, like something happening at
> > > > mount time or on file open. It should actually cause the operations to
> > > > fail cleanly.
> > > 
> > > That's a fair point. Currently filesystems supporting DAX check for this in
> > > their mount code because there isn't really a DAX code that would get
> > > called during mount and would have enough information to perform the check.
> > > I'm not sure adding a new call just for this check makes a lot of sense.
> > > But if you have some good place in mind, please tell me.
> > 
> > Is not the reason that dax_writeback_mapping_range() the only thing
> > checking ->i_blkbits because 'struct writeback_control' does writeback
> > in terms of page-index ranges?
> 
> To be fair, I don't remember why we've put the assertion specifically into
> dax_writeback_mapping_range(). But as Dave explained there's much more to
> this blocksize == pagesize limitation in DAX than just doing writeback in
> terms of page-index ranges. The whole DAX entry tracking in xarray would
> have to be modified to properly support other entry sizes than just PTE &
> PMD sizes because otherwise the entry locking just doesn't provide the
> guarantees that are expected from filesystems (e.g. you could have parallel
> modifications happening to a single fs block in pagesize < blocksize case).

Oh, yes, agree with that, was just observing that if "i_blkbits !=
PAGE_SHIFT" then at a mininum the range_start and range_end values from
writeback_control would need to be checked for alignment to the block
boundary.

> > All other dax entry points are filesystem controlled that know the
> > block-to-pfn-to-mapping relationship.
> > 
> > Recall that dax_writeback_mapping_range() is historically for pmem
> > persistence guarantees to make sure that applications write through CPU
> > cache to media.
> 
> Correct.
> 
> > Presumably there are no cache coherency concerns with fuse and dax
> > writes from the guest side are not a risk of being stranded in CPU
> > cache. Host side filesystem writeback will take care of them when / if
> > the guest triggers a storage device cache flush, not a guest page cache
> > writeback.
> 
> I'm not so sure. When you call fsync(2) in the guest on virtiofs file, it
> should provide persistency guarantees on the file contents even in case of
> *host* power failure.

It should, yes, but not necessarily through
dax_writeback_mapping_range().

> So if the guest is directly mapping host's page cache pages through
> virtiofs, filemap_fdatawrite() call in the guest must result in
> fsync(2) on the host to persist those pages. And as far as I vaguely
> remember that happens by KVM catching the arch_wb_cache_pmem() calls
> and issuing fsync(2) on the host. But I could be totally wrong here.

While I imagine you could invent some scheme to trap
dax_flush()/arch_wb_cache_pmem() as the signal to trigger page writeback
on the host, my expectation is that should be handled by the
REQ_{PREFLUSH,FUA} to the backing device that follows a page-cache
writeback event. This is the approach taken by virtio_pmem.

Now, if virtio_fs does not have a block_device to receive those requests
then I can see why trapping arch_wb_cache_pmem() is attempted, but a
backing device signal to flush the host conceptually makes more sense to
me because dax, on the guest side, explicitly means there are no
software buffers to write-back. The host just needs a coarse signal that
if it is buffering any pages on behalf of the guest, it now needs to
flush them.

Dan Williams Nov. 7, 2024, 9:24 p.m. UTC | #11

Asahi Lina wrote:
[..]
> I don't think that's how it actually works, at least on arm64.
> arch_wb_cache_pmem() calls dcache_clean_pop() which is either dc cvap or
> dc cvac. Those are trapped by HCR_EL2<TPC>, and that is never set by KVM.
> 
> There was some discussion of this here:
> https://lore.kernel.org/all/20190702055937.3ffpwph7anvohmxu@US-160370MP2.local/
> 
> But I'm not sure that all really made sense then.
> 
> msync() and fsync() should already provide persistence. Those end up
> calling vfs_fsync_range(), which becomes a FUSE fsync(), which fsyncs
> (or fdatasyncs) the whole file. What I'm not so sure is whether there
> are any other codepaths that also need to provide those guarantees which
> *don't* end up calling fsync on the VFS. For example, the manpages kind
> of imply munmap() syncs, though as far as I can tell that's not actually
> the case. If there are missing sync paths, then I think those might just
> be broken right now...

IIRC, from the pmem persistence dicussions, if userspace fails to call
*sync then there is no obligation to flush on munmap() or close(). Some
filesystems layer on those guarantees, but the behavior is
implementation specific.

Asahi Lina Nov. 7, 2024, 10:34 p.m. UTC | #12

On 11/8/24 6:24 AM, Dan Williams wrote:
> Asahi Lina wrote:
> [..]
>> I don't think that's how it actually works, at least on arm64.
>> arch_wb_cache_pmem() calls dcache_clean_pop() which is either dc cvap or
>> dc cvac. Those are trapped by HCR_EL2<TPC>, and that is never set by KVM.
>>
>> There was some discussion of this here:
>> https://lore.kernel.org/all/20190702055937.3ffpwph7anvohmxu@US-160370MP2.local/
>>
>> But I'm not sure that all really made sense then.
>>
>> msync() and fsync() should already provide persistence. Those end up
>> calling vfs_fsync_range(), which becomes a FUSE fsync(), which fsyncs
>> (or fdatasyncs) the whole file. What I'm not so sure is whether there
>> are any other codepaths that also need to provide those guarantees which
>> *don't* end up calling fsync on the VFS. For example, the manpages kind
>> of imply munmap() syncs, though as far as I can tell that's not actually
>> the case. If there are missing sync paths, then I think those might just
>> be broken right now...
> 
> IIRC, from the pmem persistence dicussions, if userspace fails to call
> *sync then there is no obligation to flush on munmap() or close(). Some
> filesystems layer on those guarantees, but the behavior is
> implementation specific.

Then I think your patch should be fine then, since there's nothing to do
for writepages(). The syncing is handled via fsync() for FUSE/virtiofs
and I don't think the dax_writeback_mapping_range() is actually doing
anything in KVM anyway.

~~ Lina

Jan Kara Nov. 8, 2024, 12:16 p.m. UTC | #13

On Fri 08-11-24 01:09:54, Asahi Lina wrote:
> On 11/7/24 7:01 PM, Jan Kara wrote:
> > On Wed 06-11-24 11:59:44, Dan Williams wrote:
> >> Jan Kara wrote:
> >> [..]
> >>>> This WARN still feels like the wrong thing, though. Right now it is the
> >>>> only thing in DAX code complaining on a page size/block size mismatch
> >>>> (at least for virtiofs). If this is so important, I feel like there
> >>>> should be a higher level check elsewhere, like something happening at
> >>>> mount time or on file open. It should actually cause the operations to
> >>>> fail cleanly.
> >>>
> >>> That's a fair point. Currently filesystems supporting DAX check for this in
> >>> their mount code because there isn't really a DAX code that would get
> >>> called during mount and would have enough information to perform the check.
> >>> I'm not sure adding a new call just for this check makes a lot of sense.
> >>> But if you have some good place in mind, please tell me.
> >>
> >> Is not the reason that dax_writeback_mapping_range() the only thing
> >> checking ->i_blkbits because 'struct writeback_control' does writeback
> >> in terms of page-index ranges?
> > 
> > To be fair, I don't remember why we've put the assertion specifically into
> > dax_writeback_mapping_range(). But as Dave explained there's much more to
> > this blocksize == pagesize limitation in DAX than just doing writeback in
> > terms of page-index ranges. The whole DAX entry tracking in xarray would
> > have to be modified to properly support other entry sizes than just PTE &
> > PMD sizes because otherwise the entry locking just doesn't provide the
> > guarantees that are expected from filesystems (e.g. you could have parallel
> > modifications happening to a single fs block in pagesize < blocksize case).
> > 
> >> All other dax entry points are filesystem controlled that know the
> >> block-to-pfn-to-mapping relationship.
> >>
> >> Recall that dax_writeback_mapping_range() is historically for pmem
> >> persistence guarantees to make sure that applications write through CPU
> >> cache to media.
> > 
> > Correct.
> > 
> >> Presumably there are no cache coherency concerns with fuse and dax
> >> writes from the guest side are not a risk of being stranded in CPU
> >> cache. Host side filesystem writeback will take care of them when / if
> >> the guest triggers a storage device cache flush, not a guest page cache
> >> writeback.
> > 
> > I'm not so sure. When you call fsync(2) in the guest on virtiofs file, it
> > should provide persistency guarantees on the file contents even in case of
> > *host* power failure. So if the guest is directly mapping host's page cache
> > pages through virtiofs, filemap_fdatawrite() call in the guest must result
> > in fsync(2) on the host to persist those pages. And as far as I vaguely
> > remember that happens by KVM catching the arch_wb_cache_pmem() calls and
> > issuing fsync(2) on the host. But I could be totally wrong here.
> 
> I don't think that's how it actually works, at least on arm64.
> arch_wb_cache_pmem() calls dcache_clean_pop() which is either dc cvap or
> dc cvac. Those are trapped by HCR_EL2<TPC>, and that is never set by KVM.
> 
> There was some discussion of this here:
> https://lore.kernel.org/all/20190702055937.3ffpwph7anvohmxu@US-160370MP2.local/

I see. Thanks for correcting me.

> But I'm not sure that all really made sense then.
> 
> msync() and fsync() should already provide persistence. Those end up
> calling vfs_fsync_range(), which becomes a FUSE fsync(), which fsyncs
> (or fdatasyncs) the whole file. What I'm not so sure is whether there
> are any other codepaths that also need to provide those guarantees which
> *don't* end up calling fsync on the VFS. For example, the manpages kind
> of imply munmap() syncs, though as far as I can tell that's not actually
> the case. If there are missing sync paths, then I think those might just
> be broken right now...

munmap(2) is not an issue because that has no persistency guarantees in
case of power failure attached to it. Thinking about it some more I agree
that just dropping dax_writeback_mapping_range() from virtiofs should be
safe. The modifications are going to be persisted by the host eventually
(so writeback as such isn't needed) and all crash-safe guarantees are
revolving around calls like fsync(2), sync(2), sync_fs(2) which get passed
by fuse and hopefully acted upon on the host. I'm quite confident with this
because even standard filesystems such as ext4 flush disk caches only in
response to operations like these (plus some in journalling code but that's
a separate story).

								Honza

Asahi Lina Nov. 12, 2024, 9:49 a.m. UTC | #14

On 11/8/24 9:16 PM, Jan Kara wrote:
> On Fri 08-11-24 01:09:54, Asahi Lina wrote:
>> On 11/7/24 7:01 PM, Jan Kara wrote:
>>> On Wed 06-11-24 11:59:44, Dan Williams wrote:
>>>> Jan Kara wrote:
>>>> [..]
>>>>>> This WARN still feels like the wrong thing, though. Right now it is the
>>>>>> only thing in DAX code complaining on a page size/block size mismatch
>>>>>> (at least for virtiofs). If this is so important, I feel like there
>>>>>> should be a higher level check elsewhere, like something happening at
>>>>>> mount time or on file open. It should actually cause the operations to
>>>>>> fail cleanly.
>>>>>
>>>>> That's a fair point. Currently filesystems supporting DAX check for this in
>>>>> their mount code because there isn't really a DAX code that would get
>>>>> called during mount and would have enough information to perform the check.
>>>>> I'm not sure adding a new call just for this check makes a lot of sense.
>>>>> But if you have some good place in mind, please tell me.
>>>>
>>>> Is not the reason that dax_writeback_mapping_range() the only thing
>>>> checking ->i_blkbits because 'struct writeback_control' does writeback
>>>> in terms of page-index ranges?
>>>
>>> To be fair, I don't remember why we've put the assertion specifically into
>>> dax_writeback_mapping_range(). But as Dave explained there's much more to
>>> this blocksize == pagesize limitation in DAX than just doing writeback in
>>> terms of page-index ranges. The whole DAX entry tracking in xarray would
>>> have to be modified to properly support other entry sizes than just PTE &
>>> PMD sizes because otherwise the entry locking just doesn't provide the
>>> guarantees that are expected from filesystems (e.g. you could have parallel
>>> modifications happening to a single fs block in pagesize < blocksize case).
>>>
>>>> All other dax entry points are filesystem controlled that know the
>>>> block-to-pfn-to-mapping relationship.
>>>>
>>>> Recall that dax_writeback_mapping_range() is historically for pmem
>>>> persistence guarantees to make sure that applications write through CPU
>>>> cache to media.
>>>
>>> Correct.
>>>
>>>> Presumably there are no cache coherency concerns with fuse and dax
>>>> writes from the guest side are not a risk of being stranded in CPU
>>>> cache. Host side filesystem writeback will take care of them when / if
>>>> the guest triggers a storage device cache flush, not a guest page cache
>>>> writeback.
>>>
>>> I'm not so sure. When you call fsync(2) in the guest on virtiofs file, it
>>> should provide persistency guarantees on the file contents even in case of
>>> *host* power failure. So if the guest is directly mapping host's page cache
>>> pages through virtiofs, filemap_fdatawrite() call in the guest must result
>>> in fsync(2) on the host to persist those pages. And as far as I vaguely
>>> remember that happens by KVM catching the arch_wb_cache_pmem() calls and
>>> issuing fsync(2) on the host. But I could be totally wrong here.
>>
>> I don't think that's how it actually works, at least on arm64.
>> arch_wb_cache_pmem() calls dcache_clean_pop() which is either dc cvap or
>> dc cvac. Those are trapped by HCR_EL2<TPC>, and that is never set by KVM.
>>
>> There was some discussion of this here:
>> https://lore.kernel.org/all/20190702055937.3ffpwph7anvohmxu@US-160370MP2.local/
> 
> I see. Thanks for correcting me.
> 
>> But I'm not sure that all really made sense then.
>>
>> msync() and fsync() should already provide persistence. Those end up
>> calling vfs_fsync_range(), which becomes a FUSE fsync(), which fsyncs
>> (or fdatasyncs) the whole file. What I'm not so sure is whether there
>> are any other codepaths that also need to provide those guarantees which
>> *don't* end up calling fsync on the VFS. For example, the manpages kind
>> of imply munmap() syncs, though as far as I can tell that's not actually
>> the case. If there are missing sync paths, then I think those might just
>> be broken right now...
> 
> munmap(2) is not an issue because that has no persistency guarantees in
> case of power failure attached to it. Thinking about it some more I agree
> that just dropping dax_writeback_mapping_range() from virtiofs should be
> safe. The modifications are going to be persisted by the host eventually
> (so writeback as such isn't needed) and all crash-safe guarantees are
> revolving around calls like fsync(2), sync(2), sync_fs(2) which get passed
> by fuse and hopefully acted upon on the host. I'm quite confident with this
> because even standard filesystems such as ext4 flush disk caches only in
> response to operations like these (plus some in journalling code but that's
> a separate story).
> 
> 								Honza

I think we should go with that then. Should I send it as Suggested-by:
Dan or do you want to send it?

~~ Lina

Jan Kara Nov. 12, 2024, 2:34 p.m. UTC | #15

On Tue 12-11-24 18:49:46, Asahi Lina wrote:
> On 11/8/24 9:16 PM, Jan Kara wrote:
> > On Fri 08-11-24 01:09:54, Asahi Lina wrote:
> >> On 11/7/24 7:01 PM, Jan Kara wrote:
> >>> On Wed 06-11-24 11:59:44, Dan Williams wrote:
> >>>> Jan Kara wrote:
> >>>> [..]
> >>>>>> This WARN still feels like the wrong thing, though. Right now it is the
> >>>>>> only thing in DAX code complaining on a page size/block size mismatch
> >>>>>> (at least for virtiofs). If this is so important, I feel like there
> >>>>>> should be a higher level check elsewhere, like something happening at
> >>>>>> mount time or on file open. It should actually cause the operations to
> >>>>>> fail cleanly.
> >>>>>
> >>>>> That's a fair point. Currently filesystems supporting DAX check for this in
> >>>>> their mount code because there isn't really a DAX code that would get
> >>>>> called during mount and would have enough information to perform the check.
> >>>>> I'm not sure adding a new call just for this check makes a lot of sense.
> >>>>> But if you have some good place in mind, please tell me.
> >>>>
> >>>> Is not the reason that dax_writeback_mapping_range() the only thing
> >>>> checking ->i_blkbits because 'struct writeback_control' does writeback
> >>>> in terms of page-index ranges?
> >>>
> >>> To be fair, I don't remember why we've put the assertion specifically into
> >>> dax_writeback_mapping_range(). But as Dave explained there's much more to
> >>> this blocksize == pagesize limitation in DAX than just doing writeback in
> >>> terms of page-index ranges. The whole DAX entry tracking in xarray would
> >>> have to be modified to properly support other entry sizes than just PTE &
> >>> PMD sizes because otherwise the entry locking just doesn't provide the
> >>> guarantees that are expected from filesystems (e.g. you could have parallel
> >>> modifications happening to a single fs block in pagesize < blocksize case).
> >>>
> >>>> All other dax entry points are filesystem controlled that know the
> >>>> block-to-pfn-to-mapping relationship.
> >>>>
> >>>> Recall that dax_writeback_mapping_range() is historically for pmem
> >>>> persistence guarantees to make sure that applications write through CPU
> >>>> cache to media.
> >>>
> >>> Correct.
> >>>
> >>>> Presumably there are no cache coherency concerns with fuse and dax
> >>>> writes from the guest side are not a risk of being stranded in CPU
> >>>> cache. Host side filesystem writeback will take care of them when / if
> >>>> the guest triggers a storage device cache flush, not a guest page cache
> >>>> writeback.
> >>>
> >>> I'm not so sure. When you call fsync(2) in the guest on virtiofs file, it
> >>> should provide persistency guarantees on the file contents even in case of
> >>> *host* power failure. So if the guest is directly mapping host's page cache
> >>> pages through virtiofs, filemap_fdatawrite() call in the guest must result
> >>> in fsync(2) on the host to persist those pages. And as far as I vaguely
> >>> remember that happens by KVM catching the arch_wb_cache_pmem() calls and
> >>> issuing fsync(2) on the host. But I could be totally wrong here.
> >>
> >> I don't think that's how it actually works, at least on arm64.
> >> arch_wb_cache_pmem() calls dcache_clean_pop() which is either dc cvap or
> >> dc cvac. Those are trapped by HCR_EL2<TPC>, and that is never set by KVM.
> >>
> >> There was some discussion of this here:
> >> https://lore.kernel.org/all/20190702055937.3ffpwph7anvohmxu@US-160370MP2.local/
> > 
> > I see. Thanks for correcting me.
> > 
> >> But I'm not sure that all really made sense then.
> >>
> >> msync() and fsync() should already provide persistence. Those end up
> >> calling vfs_fsync_range(), which becomes a FUSE fsync(), which fsyncs
> >> (or fdatasyncs) the whole file. What I'm not so sure is whether there
> >> are any other codepaths that also need to provide those guarantees which
> >> *don't* end up calling fsync on the VFS. For example, the manpages kind
> >> of imply munmap() syncs, though as far as I can tell that's not actually
> >> the case. If there are missing sync paths, then I think those might just
> >> be broken right now...
> > 
> > munmap(2) is not an issue because that has no persistency guarantees in
> > case of power failure attached to it. Thinking about it some more I agree
> > that just dropping dax_writeback_mapping_range() from virtiofs should be
> > safe. The modifications are going to be persisted by the host eventually
> > (so writeback as such isn't needed) and all crash-safe guarantees are
> > revolving around calls like fsync(2), sync(2), sync_fs(2) which get passed
> > by fuse and hopefully acted upon on the host. I'm quite confident with this
> > because even standard filesystems such as ext4 flush disk caches only in
> > response to operations like these (plus some in journalling code but that's
> > a separate story).
> > 
> > 								Honza
> 
> I think we should go with that then. Should I send it as Suggested-by:
> Dan or do you want to send it?

I say go ahead and send it with Dan's suggested-by :)

								Honza

Dan Williams Nov. 12, 2024, 7:32 p.m. UTC | #16

Jan Kara wrote:
[..]
> > I think we should go with that then. Should I send it as Suggested-by:
> > Dan or do you want to send it?
> 
> I say go ahead and send it with Dan's suggested-by :)

Yeah, no concerns from me, and I can ack it when it comes out.

dax: Allow block size > PAGE_SIZE

Commit Message

Comments

Patch