diff mbox series

[RFC] ext4: Add support for ext4_map_blocks_atomic()

Message ID 3a417188e5abe3048afac3d31ebbf11588b6d68d.1709927824.git.ritesh.list@gmail.com (mailing list archive)
State New, archived
Headers show
Series [RFC] ext4: Add support for ext4_map_blocks_atomic() | expand

Commit Message

Ritesh Harjani (IBM) March 8, 2024, 8:25 p.m. UTC
Currently ext4 exposes [fsawu_min, fsawu_max] size as
[blocksize, clustersize] (given the hw block device constraints are
larger than FS atomic write units).

That means a user should be allowed to -
1. pwrite 0 4k /mnt/test/f1
2. pwrite 0 16k /mnt/test/f1

w/o this patch the second atomic write will fail. Since current
ext4_map_blocks() will just return the already allocated extent length
to the iomap (which is less than the user requested write length).

So add ext4_map_blocks_atomic() function which can allocate full
requested length for doing an atomic write before returning to iomap.
With this we have - 

1. touch /mnt1/test/f2
2. chattr +W /mnt1/test/f2
3. xfs_io -dc "pwrite -b 4k -A -V 1 0 4k" /mnt1/test/f2
	wrote 4096/4096 bytes at offset 0
	4 KiB, 1 ops; 0.0320 sec (124.630 KiB/sec and 31.1575 ops/sec)
4. filefrag -v /mnt1/test/f2
	Filesystem type is: ef53
	File size of /mnt1/test/f2 is 4096 (1 block of 4096 bytes)
	 ext:     logical_offset:        physical_offset: length:   expected: flags:
	   0:        0..       0:       9728..      9728:      1:             last,eof
	/mnt1/test/f2: 1 extent found
5. xfs_io -dc "pwrite -b 16k -A -V 1 0 16k" /mnt1/test/f2
	wrote 16384/16384 bytes at offset 0
	16 KiB, 1 ops; 0.0337 sec (474.637 KiB/sec and 29.6648 ops/sec)
6. filefrag -v /mnt1/test/f2
	Filesystem type is: ef53
	File size of /mnt1/test/f2 is 16384 (4 blocks of 4096 bytes)
	 ext:     logical_offset:        physical_offset: length:   expected: flags:
	   0:        0..       3:       9728..      9731:      4:             last,eof
	/mnt1/test/f2: 1 extent found

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---

Please note, that this is very minimal tested. But it serves as a PoC of what
can be done within ext4 to allow the usecase which John pointed out.

This also shows that every filesystem can have a different ways of doing aligned
allocations to support atomic writes. So lifting extent size hints to iomap
perhaps might become very XFS centric? Althouh as long as other filesystems are 
not forced to follow that, I don't think it should be a problem.


 fs/ext4/ext4.h  |  2 ++
 fs/ext4/inode.c | 40 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 39 insertions(+), 3 deletions(-)

Comments

Ritesh Harjani (IBM) March 9, 2024, 2:37 a.m. UTC | #1
"Ritesh Harjani (IBM)" <ritesh.list@gmail.com> writes:

> +int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode,
> +			   struct ext4_map_blocks *map, int flags)
> +{
> +	unsigned int mapped_len = 0, m_len = map->m_len;
> +	ext4_lblk_t m_lblk = map->m_lblk;
> +	int ret;
> +
> +	WARN_ON(!(flags & EXT4_GET_BLOCKS_CREATE));
> +
> +	do {
> +		ret = ext4_map_blocks(handle, inode, map, flags);
> +		if (ret < 0)
> +			return ret;
> +		mapped_len += map->m_len;
> +		map->m_lblk += map->m_len;
> +		map->m_len = m_len - mapped_len;
> +	} while (mapped_len < m_len);
> +
> +	map->m_lblk = m_lblk;
> +	map->m_len = mapped_len;
> +	return mapped_len;

ouch! 
1. I need to make sure map.m_pblk is updated properly.
2. I need to make sure above call only happens with bigalloc.

Sorry about that. Generally not a good idea to send something that late
at night.
But I guess this can be fixed easily. so hopefully the algorithm should
still remain, more or less the same for ext4_map_blocks_atomic().

-ritesh
John Garry March 13, 2024, 6:40 p.m. UTC | #2
On 08/03/2024 20:25, Ritesh Harjani (IBM) wrote:

Hi Ritesh,

> Currently ext4 exposes [fsawu_min, fsawu_max] size as
> [blocksize, clustersize] (given the hw block device constraints are
> larger than FS atomic write units).
> 
> That means a user should be allowed to -
> 1. pwrite 0 4k /mnt/test/f1
> 2. pwrite 0 16k /mnt/test/f1
> 

Previously you have mentioned 2 or 3 methods in which ext4 could support 
atomic writes. To avoid doubt, is this patch for the "Add intelligence 
in multi-block allocator of ext4 to provide aligned allocations (this 
option won't require any formatting)" method mentioned at 
https://lore.kernel.org/linux-fsdevel/8734tb0xx7.fsf@doe.com/

and same as method 3 at 
https://lore.kernel.org/linux-fsdevel/cover.1709356594.git.ritesh.list@gmail.com/? 


Thanks,
John
Ritesh Harjani (IBM) March 14, 2024, 3:52 p.m. UTC | #3
John Garry <john.g.garry@oracle.com> writes:

> On 08/03/2024 20:25, Ritesh Harjani (IBM) wrote:
>
> Hi Ritesh,
>
>> Currently ext4 exposes [fsawu_min, fsawu_max] size as
>> [blocksize, clustersize] (given the hw block device constraints are
>> larger than FS atomic write units).
>> 
>> That means a user should be allowed to -
>> 1. pwrite 0 4k /mnt/test/f1
>> 2. pwrite 0 16k /mnt/test/f1
>> 
>
> Previously you have mentioned 2 or 3 methods in which ext4 could support 
> atomic writes. To avoid doubt, is this patch for the "Add intelligence 
> in multi-block allocator of ext4 to provide aligned allocations (this 
> option won't require any formatting)" method mentioned at 
> https://lore.kernel.org/linux-fsdevel/8734tb0xx7.fsf@doe.com/
>
> and same as method 3 at 
> https://lore.kernel.org/linux-fsdevel/cover.1709356594.git.ritesh.list@gmail.com/? 

Hi John,

No. So this particular patch to add ext4_map_blocks_atomic() method is
only to support the usecase which you listed should work for a good user
behaviour. This is because, with bigalloc we advertizes fsawu_min and
fsawu_max as [blocksize, clustersize]
i.e. 

That means a user should be allowed to -
1. pwrite 0 4k /mnt/test/f1
followed by 
2. pwrite 0 16k /mnt/test/f1


So earlier we were failing the second 16k write at an offset where there
is already an existing extent smaller that 16k (that was because of the
assumption that the most of the users won't do such a thing).

But for a more general usecase, it is not difficult to support the
second 16k write in such a way for atomic writes with bigalloc,
so this patch just adds that support to this series.     

-ritesh 


>
>
> Thanks,
> John
John Garry March 18, 2024, 8:22 a.m. UTC | #4
On 14/03/2024 15:52, Ritesh Harjani (IBM) wrote:
>> and same as method 3 at
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/cover.1709356594.git.ritesh.list@gmail.com/?__;!!ACWV5N9M2RV99hQ!Pb-HbBdm2OWUIGDFfG1OkemtRSy2LyHsc5s6WiyTtGHW4uGWV6sMkoVjmknmBydf_i6TF_CDqp7dR0Y-CGY8EIc$   
> Hi John,
> 
> No. So this particular patch to add ext4_map_blocks_atomic() method is
> only to support the usecase which you listed should work for a good user
> behaviour. This is because, with bigalloc we advertizes fsawu_min and
> fsawu_max as [blocksize, clustersize]
> i.e.
> 
> That means a user should be allowed to -
> 1. pwrite 0 4k /mnt/test/f1
> followed by
> 2. pwrite 0 16k /mnt/test/f1
> 
> 
> So earlier we were failing the second 16k write at an offset where there
> is already an existing extent smaller that 16k (that was because of the
> assumption that the most of the users won't do such a thing).
> 
> But for a more general usecase, it is not difficult to support the
> second 16k write in such a way for atomic writes with bigalloc,
> so this patch just adds that support to this series.

Is there some reason for which the generic iomap solution in 
https://lore.kernel.org/linux-xfs/20240304130428.13026-1-john.g.garry@oracle.com/ 
won't work? That is, you would just need to set iomap->extent_shift 
appropriately. I will note that we gate this feature on XFS based on 
forcealign enabled for the inode - I am not sure if you would want this 
always for bigalloc.

Thanks,
John
diff mbox series

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 529ca32b9813..1e9adc5d6569 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3702,6 +3702,8 @@  extern int ext4_convert_unwritten_io_end_vec(handle_t *handle,
 					     ext4_io_end_t *io_end);
 extern int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			   struct ext4_map_blocks *map, int flags);
+extern int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode,
+				  struct ext4_map_blocks *map, int flags);
 extern int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
 						   int num,
 						   struct ext4_ext_path *path);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ea009ca9085d..db273c7faf36 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -453,6 +453,29 @@  static void ext4_map_blocks_es_recheck(handle_t *handle,
 }
 #endif /* ES_AGGRESSIVE_TEST */
 
+int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode,
+			   struct ext4_map_blocks *map, int flags)
+{
+	unsigned int mapped_len = 0, m_len = map->m_len;
+	ext4_lblk_t m_lblk = map->m_lblk;
+	int ret;
+
+	WARN_ON(!(flags & EXT4_GET_BLOCKS_CREATE));
+
+	do {
+		ret = ext4_map_blocks(handle, inode, map, flags);
+		if (ret < 0)
+			return ret;
+		mapped_len += map->m_len;
+		map->m_lblk += map->m_len;
+		map->m_len = m_len - mapped_len;
+	} while (mapped_len < m_len);
+
+	map->m_lblk = m_lblk;
+	map->m_len = mapped_len;
+	return mapped_len;
+}
+
 /*
  * The ext4_map_blocks() function tries to look up the requested blocks,
  * and returns if the blocks are already mapped.
@@ -3315,7 +3338,10 @@  static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
 	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 		m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
 
-	ret = ext4_map_blocks(handle, inode, map, m_flags);
+	if (flags & IOMAP_ATOMIC)
+		ret = ext4_map_blocks_atomic(handle, inode, map, m_flags);
+	else
+		ret = ext4_map_blocks(handle, inode, map, m_flags);
 
 	/*
 	 * We cannot fill holes in indirect tree based inodes as that could
@@ -3339,6 +3365,7 @@  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	int ret;
 	struct ext4_map_blocks map;
 	u8 blkbits = inode->i_blkbits;
+	unsigned int orig_len;
 
 	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
 		return -EINVAL;
@@ -3352,6 +3379,7 @@  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	map.m_lblk = offset >> blkbits;
 	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
 			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+	orig_len = map.m_len;
 
 	if (flags & IOMAP_WRITE) {
 		/*
@@ -3362,9 +3390,15 @@  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		 */
 		if (offset + length <= i_size_read(inode)) {
 			ret = ext4_map_blocks(NULL, inode, &map, 0);
-			if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
-				goto out;
+			if (map.m_flags & EXT4_MAP_MAPPED) {
+				if ((flags & IOMAP_ATOMIC && ret >= orig_len) ||
+				   (!(flags & IOMAP_ATOMIC) && ret > 0))
+					goto out;
+
+			}
 		}
+		WARN_ON(map.m_lblk != offset >> blkbits);
+		map.m_len = orig_len;
 		ret = ext4_iomap_alloc(inode, &map, flags);
 	} else {
 		ret = ext4_map_blocks(NULL, inode, &map, 0);