diff mbox series

[RFC,2/2] libfs: Improve behavior when directory offset values wrap

Message ID 20241117213206.1636438-3-cel@kernel.org (mailing list archive)
State New
Headers show
Series Improve simple directory offset wrap behavior | expand

Commit Message

Chuck Lever Nov. 17, 2024, 9:32 p.m. UTC
From: Chuck Lever <chuck.lever@oracle.com>

The fix in commit 64a7ce76fb90 ("libfs: fix infinite directory reads
for offset dir") introduced a fence in offset_iterate_dir() to stop
the loop from returning child entries created after the directory
was opened. This comparison relies on the strong ordering of
DIR_OFFSET_MIN <= largest child offset <= next_offset to terminate
the directory iteration.

However, because simple_offset_add() uses mtree_alloc_cyclic() to
select each next new directory offset, ctx->next_offset is not
always the highest unused offset. Once mtree_alloc_cyclic() allows
a new offset value to wrap, ctx->next_offset will be set to a value
less than the actual largest child offset.

The result is that readdir(3) no longer shows any entries in the
directory because their offsets are above ctx->next_offset, which is
now a small value. This situation is persistent, and the directory
cannot be removed unless all current children are already known and
can be explicitly removed by name first.

In the current Maple tree implementation, there is no practical way
that 63-bit offset values can ever wrap, so this issue is cleverly
avoided. But the ordering dependency is not documented via comments
or code, making the mechanism somewhat brittle. And it makes the
continued use of mtree_alloc_cyclic() somewhat confusing.

Further, if commit 64a7ce76fb90 ("libfs: fix infinite directory
reads for offset dir") were to be backported to a kernel that still
uses xarray to manage simple directory offsets, the directory offset
value range is limited to 32-bits, which is small enough to allow a
wrap after a few weeks of constant creation of entries in one
directory.

Therefore, replace the use of ctx->next_offset for fencing new
children from appearing in readdir results.

A jiffies timestamp marks the end of each opendir epoch. Entries
created after this timestamp will not be visible to the file
descriptor. I chose jiffies so that the dentry->d_time field can be
re-used for storing the entry creation time.

The new mechanism has its own corner cases. For instance, I think
if jiffies wraps twice while a directory is open, some children
might become invisible. On 32-bit systems, the jiffies value wraps
every 49 days. Double-wrapping is not a risk on systems with 64-bit
jiffies. Unlike with the next_offset-based mechanism, re-opening the
directory will make invisible children re-appear.

Reported-by: Yu Kuai <yukuai3@huawei.com>
Closes: https://lore.kernel.org/stable/20241111005242.34654-1-cel@kernel.org/T/#m1c448e5bd4aae3632a09468affcfe1d1594c6a59
Fixes: 64a7ce76fb90 ("libfs: fix infinite directory reads for offset dir")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/libfs.c | 36 +++++++++++++++++-------------------
 1 file changed, 17 insertions(+), 19 deletions(-)

Comments

Jeff Layton Nov. 18, 2024, 8 p.m. UTC | #1
On Sun, 2024-11-17 at 16:32 -0500, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> The fix in commit 64a7ce76fb90 ("libfs: fix infinite directory reads
> for offset dir") introduced a fence in offset_iterate_dir() to stop
> the loop from returning child entries created after the directory
> was opened. This comparison relies on the strong ordering of
> DIR_OFFSET_MIN <= largest child offset <= next_offset to terminate
> the directory iteration.
> 
> However, because simple_offset_add() uses mtree_alloc_cyclic() to
> select each next new directory offset, ctx->next_offset is not
> always the highest unused offset. Once mtree_alloc_cyclic() allows
> a new offset value to wrap, ctx->next_offset will be set to a value
> less than the actual largest child offset.
> 
> The result is that readdir(3) no longer shows any entries in the
> directory because their offsets are above ctx->next_offset, which is
> now a small value. This situation is persistent, and the directory
> cannot be removed unless all current children are already known and
> can be explicitly removed by name first.
> 
> In the current Maple tree implementation, there is no practical way
> that 63-bit offset values can ever wrap, so this issue is cleverly
> avoided. But the ordering dependency is not documented via comments
> or code, making the mechanism somewhat brittle. And it makes the
> continued use of mtree_alloc_cyclic() somewhat confusing.
> 
> Further, if commit 64a7ce76fb90 ("libfs: fix infinite directory
> reads for offset dir") were to be backported to a kernel that still
> uses xarray to manage simple directory offsets, the directory offset
> value range is limited to 32-bits, which is small enough to allow a
> wrap after a few weeks of constant creation of entries in one
> directory.
> 
> Therefore, replace the use of ctx->next_offset for fencing new
> children from appearing in readdir results.
> 
> A jiffies timestamp marks the end of each opendir epoch. Entries
> created after this timestamp will not be visible to the file
> descriptor. I chose jiffies so that the dentry->d_time field can be
> re-used for storing the entry creation time.
> 
> The new mechanism has its own corner cases. For instance, I think
> if jiffies wraps twice while a directory is open, some children
> might become invisible. On 32-bit systems, the jiffies value wraps
> every 49 days. Double-wrapping is not a risk on systems with 64-bit
> jiffies. Unlike with the next_offset-based mechanism, re-opening the
> directory will make invisible children re-appear.
> 
> Reported-by: Yu Kuai <yukuai3@huawei.com>
> Closes: https://lore.kernel.org/stable/20241111005242.34654-1-cel@kernel.org/T/#m1c448e5bd4aae3632a09468affcfe1d1594c6a59
> Fixes: 64a7ce76fb90 ("libfs: fix infinite directory reads for offset dir")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/libfs.c | 36 +++++++++++++++++-------------------
>  1 file changed, 17 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index bf67954b525b..862a603fd454 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -294,6 +294,7 @@ int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
>  		return ret;
>  
>  	offset_set(dentry, offset);
> +	WRITE_ONCE(dentry->d_time, jiffies);
>  	return 0;
>  }
>  
> @@ -454,9 +455,7 @@ void simple_offset_destroy(struct offset_ctx *octx)
>  
>  static int offset_dir_open(struct inode *inode, struct file *file)
>  {
> -	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
> -
> -	file->private_data = (void *)ctx->next_offset;
> +	file->private_data = (void *)jiffies;
>  	return 0;
>  }
>  
> @@ -473,9 +472,6 @@ static int offset_dir_open(struct inode *inode, struct file *file)
>   */
>  static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
>  {
> -	struct inode *inode = file->f_inode;
> -	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
> -
>  	switch (whence) {
>  	case SEEK_CUR:
>  		offset += file->f_pos;
> @@ -490,7 +486,8 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
>  
>  	/* In this case, ->private_data is protected by f_pos_lock */
>  	if (!offset)
> -		file->private_data = (void *)ctx->next_offset;
> +		/* Make newer child entries visible */
> +		file->private_data = (void *)jiffies;
>  	return vfs_setpos(file, offset, LONG_MAX);
>  }
>  
> @@ -521,7 +518,8 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
>  			  inode->i_ino, fs_umode_to_dtype(inode->i_mode));
>  }
>  
> -static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, long last_index)
> +static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx,
> +			       unsigned long fence)
>  {
>  	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
>  	struct dentry *dentry;
> @@ -531,14 +529,15 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
>  		if (!dentry)
>  			return;
>  
> -		if (dentry2offset(dentry) >= last_index) {
> -			dput(dentry);
> -			return;
> -		}
> -
> -		if (!offset_dir_emit(ctx, dentry)) {
> -			dput(dentry);
> -			return;
> +		/*
> +		 * Output only child entries created during or before
> +		 * the current opendir epoch.
> +		 */
> +		if (time_before_eq(dentry->d_time, fence)) {
> +			if (!offset_dir_emit(ctx, dentry)) {
> +				dput(dentry);
> +				return;
> +			}
>  		}
>  
>  		ctx->pos = dentry2offset(dentry) + 1;
> @@ -569,15 +568,14 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
>   */
>  static int offset_readdir(struct file *file, struct dir_context *ctx)
>  {
> +	unsigned long fence = (unsigned long)file->private_data;
>  	struct dentry *dir = file->f_path.dentry;
> -	long last_index = (long)file->private_data;
>  
>  	lockdep_assert_held(&d_inode(dir)->i_rwsem);
>  
>  	if (!dir_emit_dots(file, ctx))
>  		return 0;
> -
> -	offset_iterate_dir(d_inode(dir), ctx, last_index);
> +	offset_iterate_dir(d_inode(dir), ctx, fence);
>  	return 0;
>  }
>  

Using timestamps instead of directory ordering does seem less brittle,
and the choice to use jiffies makes sense given that d_time is also an
unsigned long.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Chuck Lever III Nov. 18, 2024, 8:58 p.m. UTC | #2
On Mon, Nov 18, 2024 at 03:00:56PM -0500, Jeff Layton wrote:
> On Sun, 2024-11-17 at 16:32 -0500, cel@kernel.org wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > The fix in commit 64a7ce76fb90 ("libfs: fix infinite directory reads
> > for offset dir") introduced a fence in offset_iterate_dir() to stop
> > the loop from returning child entries created after the directory
> > was opened. This comparison relies on the strong ordering of
> > DIR_OFFSET_MIN <= largest child offset <= next_offset to terminate
> > the directory iteration.
> > 
> > However, because simple_offset_add() uses mtree_alloc_cyclic() to
> > select each next new directory offset, ctx->next_offset is not
> > always the highest unused offset. Once mtree_alloc_cyclic() allows
> > a new offset value to wrap, ctx->next_offset will be set to a value
> > less than the actual largest child offset.
> > 
> > The result is that readdir(3) no longer shows any entries in the
> > directory because their offsets are above ctx->next_offset, which is
> > now a small value. This situation is persistent, and the directory
> > cannot be removed unless all current children are already known and
> > can be explicitly removed by name first.
> > 
> > In the current Maple tree implementation, there is no practical way
> > that 63-bit offset values can ever wrap, so this issue is cleverly
> > avoided. But the ordering dependency is not documented via comments
> > or code, making the mechanism somewhat brittle. And it makes the
> > continued use of mtree_alloc_cyclic() somewhat confusing.
> > 
> > Further, if commit 64a7ce76fb90 ("libfs: fix infinite directory
> > reads for offset dir") were to be backported to a kernel that still
> > uses xarray to manage simple directory offsets, the directory offset
> > value range is limited to 32-bits, which is small enough to allow a
> > wrap after a few weeks of constant creation of entries in one
> > directory.
> > 
> > Therefore, replace the use of ctx->next_offset for fencing new
> > children from appearing in readdir results.
> > 
> > A jiffies timestamp marks the end of each opendir epoch. Entries
> > created after this timestamp will not be visible to the file
> > descriptor. I chose jiffies so that the dentry->d_time field can be
> > re-used for storing the entry creation time.
> > 
> > The new mechanism has its own corner cases. For instance, I think
> > if jiffies wraps twice while a directory is open, some children
> > might become invisible. On 32-bit systems, the jiffies value wraps
> > every 49 days. Double-wrapping is not a risk on systems with 64-bit
> > jiffies. Unlike with the next_offset-based mechanism, re-opening the
> > directory will make invisible children re-appear.
> > 
> > Reported-by: Yu Kuai <yukuai3@huawei.com>
> > Closes: https://lore.kernel.org/stable/20241111005242.34654-1-cel@kernel.org/T/#m1c448e5bd4aae3632a09468affcfe1d1594c6a59
> > Fixes: 64a7ce76fb90 ("libfs: fix infinite directory reads for offset dir")
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> >  fs/libfs.c | 36 +++++++++++++++++-------------------
> >  1 file changed, 17 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/libfs.c b/fs/libfs.c
> > index bf67954b525b..862a603fd454 100644
> > --- a/fs/libfs.c
> > +++ b/fs/libfs.c
> > @@ -294,6 +294,7 @@ int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
> >  		return ret;
> >  
> >  	offset_set(dentry, offset);
> > +	WRITE_ONCE(dentry->d_time, jiffies);
> >  	return 0;
> >  }
> >  
> > @@ -454,9 +455,7 @@ void simple_offset_destroy(struct offset_ctx *octx)
> >  
> >  static int offset_dir_open(struct inode *inode, struct file *file)
> >  {
> > -	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
> > -
> > -	file->private_data = (void *)ctx->next_offset;
> > +	file->private_data = (void *)jiffies;
> >  	return 0;
> >  }
> >  
> > @@ -473,9 +472,6 @@ static int offset_dir_open(struct inode *inode, struct file *file)
> >   */
> >  static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> >  {
> > -	struct inode *inode = file->f_inode;
> > -	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
> > -
> >  	switch (whence) {
> >  	case SEEK_CUR:
> >  		offset += file->f_pos;
> > @@ -490,7 +486,8 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> >  
> >  	/* In this case, ->private_data is protected by f_pos_lock */
> >  	if (!offset)
> > -		file->private_data = (void *)ctx->next_offset;
> > +		/* Make newer child entries visible */
> > +		file->private_data = (void *)jiffies;
> >  	return vfs_setpos(file, offset, LONG_MAX);
> >  }
> >  
> > @@ -521,7 +518,8 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
> >  			  inode->i_ino, fs_umode_to_dtype(inode->i_mode));
> >  }
> >  
> > -static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, long last_index)
> > +static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx,
> > +			       unsigned long fence)
> >  {
> >  	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
> >  	struct dentry *dentry;
> > @@ -531,14 +529,15 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
> >  		if (!dentry)
> >  			return;
> >  
> > -		if (dentry2offset(dentry) >= last_index) {
> > -			dput(dentry);
> > -			return;
> > -		}
> > -
> > -		if (!offset_dir_emit(ctx, dentry)) {
> > -			dput(dentry);
> > -			return;
> > +		/*
> > +		 * Output only child entries created during or before
> > +		 * the current opendir epoch.
> > +		 */
> > +		if (time_before_eq(dentry->d_time, fence)) {
> > +			if (!offset_dir_emit(ctx, dentry)) {
> > +				dput(dentry);
> > +				return;
> > +			}
> >  		}
> >  
> >  		ctx->pos = dentry2offset(dentry) + 1;
> > @@ -569,15 +568,14 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
> >   */
> >  static int offset_readdir(struct file *file, struct dir_context *ctx)
> >  {
> > +	unsigned long fence = (unsigned long)file->private_data;
> >  	struct dentry *dir = file->f_path.dentry;
> > -	long last_index = (long)file->private_data;
> >  
> >  	lockdep_assert_held(&d_inode(dir)->i_rwsem);
> >  
> >  	if (!dir_emit_dots(file, ctx))
> >  		return 0;
> > -
> > -	offset_iterate_dir(d_inode(dir), ctx, last_index);
> > +	offset_iterate_dir(d_inode(dir), ctx, fence);
> >  	return 0;
> >  }
> >  
> 
> Using timestamps instead of directory ordering does seem less brittle,
> and the choice to use jiffies makes sense given that d_time is also an
> unsigned long.
> 
> Reviewed-by: Jeff Layton <jlayton@kernel.org>

Precisely. The goal was to re-use as much code as possible to avoid
perturbing the current size of "struct dentry".

That said, I'm not overjoyed with using jiffies, given it has
similar wrapping issues as ctx->next_offset on 32-bit systems. The
consequences of an offset value wrap are less severe, though, since
that can no longer make children entries disappear permanently.

I've been trying to imagine a solution that does not depend on the
range of an integer value and has solidly deterministic behavior in
the face of multiple child entry creations during one timer tick.

We could partially re-use the legacy cursor/list mechanism.

* When a child entry is created, it is added at the end of the
  parent's d_children list.
* When a child entry is unlinked, it is removed from the parent's
  d_children list.

This includes creation and removal of entries due to a rename.


* When a directory is opened, mark the current end of the d_children
  list with a cursor dentry. New entries would then be added to this
  directory following this cursor dentry in the directory's
  d_children list.
* When a directory is closed, its cursor dentry is removed from the
  d_children list and freed.

Each cursor dentry would need to refer to an opendir instance
(using, say, a pointer to the "struct file" for that open) so that
multiple cursors in the same directory can reside in its d_chilren
list and won't interfere with each other. Re-use the cursor dentry's
d_fsdata field for that.


* offset_readdir gets its starting entry using the mtree/xarray to
  map ctx->pos to a dentry.
* offset_readdir continues iterating by following the .next pointer
  in the current dentry's d_child field.
* offset_readdir returns EOD when it hits the cursor dentry matching
  this opendir instance.


I think all of these operations could be O(1), but it might require
some additional locking.
Christian Brauner Nov. 20, 2024, 8:59 a.m. UTC | #3
On Mon, Nov 18, 2024 at 03:58:09PM -0500, Chuck Lever wrote:
> On Mon, Nov 18, 2024 at 03:00:56PM -0500, Jeff Layton wrote:
> > On Sun, 2024-11-17 at 16:32 -0500, cel@kernel.org wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > The fix in commit 64a7ce76fb90 ("libfs: fix infinite directory reads
> > > for offset dir") introduced a fence in offset_iterate_dir() to stop
> > > the loop from returning child entries created after the directory
> > > was opened. This comparison relies on the strong ordering of
> > > DIR_OFFSET_MIN <= largest child offset <= next_offset to terminate
> > > the directory iteration.
> > > 
> > > However, because simple_offset_add() uses mtree_alloc_cyclic() to
> > > select each next new directory offset, ctx->next_offset is not
> > > always the highest unused offset. Once mtree_alloc_cyclic() allows
> > > a new offset value to wrap, ctx->next_offset will be set to a value
> > > less than the actual largest child offset.
> > > 
> > > The result is that readdir(3) no longer shows any entries in the
> > > directory because their offsets are above ctx->next_offset, which is
> > > now a small value. This situation is persistent, and the directory
> > > cannot be removed unless all current children are already known and
> > > can be explicitly removed by name first.
> > > 
> > > In the current Maple tree implementation, there is no practical way
> > > that 63-bit offset values can ever wrap, so this issue is cleverly
> > > avoided. But the ordering dependency is not documented via comments
> > > or code, making the mechanism somewhat brittle. And it makes the
> > > continued use of mtree_alloc_cyclic() somewhat confusing.
> > > 
> > > Further, if commit 64a7ce76fb90 ("libfs: fix infinite directory
> > > reads for offset dir") were to be backported to a kernel that still
> > > uses xarray to manage simple directory offsets, the directory offset
> > > value range is limited to 32-bits, which is small enough to allow a
> > > wrap after a few weeks of constant creation of entries in one
> > > directory.
> > > 
> > > Therefore, replace the use of ctx->next_offset for fencing new
> > > children from appearing in readdir results.
> > > 
> > > A jiffies timestamp marks the end of each opendir epoch. Entries
> > > created after this timestamp will not be visible to the file
> > > descriptor. I chose jiffies so that the dentry->d_time field can be
> > > re-used for storing the entry creation time.
> > > 
> > > The new mechanism has its own corner cases. For instance, I think
> > > if jiffies wraps twice while a directory is open, some children
> > > might become invisible. On 32-bit systems, the jiffies value wraps
> > > every 49 days. Double-wrapping is not a risk on systems with 64-bit
> > > jiffies. Unlike with the next_offset-based mechanism, re-opening the
> > > directory will make invisible children re-appear.
> > > 
> > > Reported-by: Yu Kuai <yukuai3@huawei.com>
> > > Closes: https://lore.kernel.org/stable/20241111005242.34654-1-cel@kernel.org/T/#m1c448e5bd4aae3632a09468affcfe1d1594c6a59
> > > Fixes: 64a7ce76fb90 ("libfs: fix infinite directory reads for offset dir")
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > ---
> > >  fs/libfs.c | 36 +++++++++++++++++-------------------
> > >  1 file changed, 17 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/fs/libfs.c b/fs/libfs.c
> > > index bf67954b525b..862a603fd454 100644
> > > --- a/fs/libfs.c
> > > +++ b/fs/libfs.c
> > > @@ -294,6 +294,7 @@ int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
> > >  		return ret;
> > >  
> > >  	offset_set(dentry, offset);
> > > +	WRITE_ONCE(dentry->d_time, jiffies);
> > >  	return 0;
> > >  }
> > >  
> > > @@ -454,9 +455,7 @@ void simple_offset_destroy(struct offset_ctx *octx)
> > >  
> > >  static int offset_dir_open(struct inode *inode, struct file *file)
> > >  {
> > > -	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
> > > -
> > > -	file->private_data = (void *)ctx->next_offset;
> > > +	file->private_data = (void *)jiffies;
> > >  	return 0;
> > >  }
> > >  
> > > @@ -473,9 +472,6 @@ static int offset_dir_open(struct inode *inode, struct file *file)
> > >   */
> > >  static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > >  {
> > > -	struct inode *inode = file->f_inode;
> > > -	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
> > > -
> > >  	switch (whence) {
> > >  	case SEEK_CUR:
> > >  		offset += file->f_pos;
> > > @@ -490,7 +486,8 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > >  
> > >  	/* In this case, ->private_data is protected by f_pos_lock */
> > >  	if (!offset)
> > > -		file->private_data = (void *)ctx->next_offset;
> > > +		/* Make newer child entries visible */
> > > +		file->private_data = (void *)jiffies;
> > >  	return vfs_setpos(file, offset, LONG_MAX);
> > >  }
> > >  
> > > @@ -521,7 +518,8 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
> > >  			  inode->i_ino, fs_umode_to_dtype(inode->i_mode));
> > >  }
> > >  
> > > -static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, long last_index)
> > > +static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx,
> > > +			       unsigned long fence)
> > >  {
> > >  	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
> > >  	struct dentry *dentry;
> > > @@ -531,14 +529,15 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
> > >  		if (!dentry)
> > >  			return;
> > >  
> > > -		if (dentry2offset(dentry) >= last_index) {
> > > -			dput(dentry);
> > > -			return;
> > > -		}
> > > -
> > > -		if (!offset_dir_emit(ctx, dentry)) {
> > > -			dput(dentry);
> > > -			return;
> > > +		/*
> > > +		 * Output only child entries created during or before
> > > +		 * the current opendir epoch.
> > > +		 */
> > > +		if (time_before_eq(dentry->d_time, fence)) {
> > > +			if (!offset_dir_emit(ctx, dentry)) {
> > > +				dput(dentry);
> > > +				return;
> > > +			}
> > >  		}
> > >  
> > >  		ctx->pos = dentry2offset(dentry) + 1;
> > > @@ -569,15 +568,14 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
> > >   */
> > >  static int offset_readdir(struct file *file, struct dir_context *ctx)
> > >  {
> > > +	unsigned long fence = (unsigned long)file->private_data;
> > >  	struct dentry *dir = file->f_path.dentry;
> > > -	long last_index = (long)file->private_data;
> > >  
> > >  	lockdep_assert_held(&d_inode(dir)->i_rwsem);
> > >  
> > >  	if (!dir_emit_dots(file, ctx))
> > >  		return 0;
> > > -
> > > -	offset_iterate_dir(d_inode(dir), ctx, last_index);
> > > +	offset_iterate_dir(d_inode(dir), ctx, fence);
> > >  	return 0;
> > >  }
> > >  
> > 
> > Using timestamps instead of directory ordering does seem less brittle,
> > and the choice to use jiffies makes sense given that d_time is also an
> > unsigned long.
> > 
> > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> 
> Precisely. The goal was to re-use as much code as possible to avoid
> perturbing the current size of "struct dentry".
> 
> That said, I'm not overjoyed with using jiffies, given it has
> similar wrapping issues as ctx->next_offset on 32-bit systems. The
> consequences of an offset value wrap are less severe, though, since
> that can no longer make children entries disappear permanently.
> 
> I've been trying to imagine a solution that does not depend on the
> range of an integer value and has solidly deterministic behavior in
> the face of multiple child entry creations during one timer tick.
> 
> We could partially re-use the legacy cursor/list mechanism.
> 
> * When a child entry is created, it is added at the end of the
>   parent's d_children list.
> * When a child entry is unlinked, it is removed from the parent's
>   d_children list.
> 
> This includes creation and removal of entries due to a rename.
> 
> 
> * When a directory is opened, mark the current end of the d_children
>   list with a cursor dentry. New entries would then be added to this
>   directory following this cursor dentry in the directory's
>   d_children list.
> * When a directory is closed, its cursor dentry is removed from the
>   d_children list and freed.
> 
> Each cursor dentry would need to refer to an opendir instance
> (using, say, a pointer to the "struct file" for that open) so that
> multiple cursors in the same directory can reside in its d_chilren
> list and won't interfere with each other. Re-use the cursor dentry's
> d_fsdata field for that.
> 
> 
> * offset_readdir gets its starting entry using the mtree/xarray to
>   map ctx->pos to a dentry.
> * offset_readdir continues iterating by following the .next pointer
>   in the current dentry's d_child field.
> * offset_readdir returns EOD when it hits the cursor dentry matching
>   this opendir instance.
> 
> 
> I think all of these operations could be O(1), but it might require
> some additional locking.

This would be a bigger refactor of the whole stable offset logic. So
even if we end up doing that I think we should use the jiffies solution
for now.
Chuck Lever III Nov. 20, 2024, 3:05 p.m. UTC | #4
On Wed, Nov 20, 2024 at 09:59:54AM +0100, Christian Brauner wrote:
> On Mon, Nov 18, 2024 at 03:58:09PM -0500, Chuck Lever wrote:
> > On Mon, Nov 18, 2024 at 03:00:56PM -0500, Jeff Layton wrote:
> > > On Sun, 2024-11-17 at 16:32 -0500, cel@kernel.org wrote:
> > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > 
> > > > The fix in commit 64a7ce76fb90 ("libfs: fix infinite directory reads
> > > > for offset dir") introduced a fence in offset_iterate_dir() to stop
> > > > the loop from returning child entries created after the directory
> > > > was opened. This comparison relies on the strong ordering of
> > > > DIR_OFFSET_MIN <= largest child offset <= next_offset to terminate
> > > > the directory iteration.
> > > > 
> > > > However, because simple_offset_add() uses mtree_alloc_cyclic() to
> > > > select each next new directory offset, ctx->next_offset is not
> > > > always the highest unused offset. Once mtree_alloc_cyclic() allows
> > > > a new offset value to wrap, ctx->next_offset will be set to a value
> > > > less than the actual largest child offset.
> > > > 
> > > > The result is that readdir(3) no longer shows any entries in the
> > > > directory because their offsets are above ctx->next_offset, which is
> > > > now a small value. This situation is persistent, and the directory
> > > > cannot be removed unless all current children are already known and
> > > > can be explicitly removed by name first.
> > > > 
> > > > In the current Maple tree implementation, there is no practical way
> > > > that 63-bit offset values can ever wrap, so this issue is cleverly
> > > > avoided. But the ordering dependency is not documented via comments
> > > > or code, making the mechanism somewhat brittle. And it makes the
> > > > continued use of mtree_alloc_cyclic() somewhat confusing.
> > > > 
> > > > Further, if commit 64a7ce76fb90 ("libfs: fix infinite directory
> > > > reads for offset dir") were to be backported to a kernel that still
> > > > uses xarray to manage simple directory offsets, the directory offset
> > > > value range is limited to 32-bits, which is small enough to allow a
> > > > wrap after a few weeks of constant creation of entries in one
> > > > directory.
> > > > 
> > > > Therefore, replace the use of ctx->next_offset for fencing new
> > > > children from appearing in readdir results.
> > > > 
> > > > A jiffies timestamp marks the end of each opendir epoch. Entries
> > > > created after this timestamp will not be visible to the file
> > > > descriptor. I chose jiffies so that the dentry->d_time field can be
> > > > re-used for storing the entry creation time.
> > > > 
> > > > The new mechanism has its own corner cases. For instance, I think
> > > > if jiffies wraps twice while a directory is open, some children
> > > > might become invisible. On 32-bit systems, the jiffies value wraps
> > > > every 49 days. Double-wrapping is not a risk on systems with 64-bit
> > > > jiffies. Unlike with the next_offset-based mechanism, re-opening the
> > > > directory will make invisible children re-appear.
> > > > 
> > > > Reported-by: Yu Kuai <yukuai3@huawei.com>
> > > > Closes: https://lore.kernel.org/stable/20241111005242.34654-1-cel@kernel.org/T/#m1c448e5bd4aae3632a09468affcfe1d1594c6a59
> > > > Fixes: 64a7ce76fb90 ("libfs: fix infinite directory reads for offset dir")
> > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > ---
> > > >  fs/libfs.c | 36 +++++++++++++++++-------------------
> > > >  1 file changed, 17 insertions(+), 19 deletions(-)
> > > > 
> > > > diff --git a/fs/libfs.c b/fs/libfs.c
> > > > index bf67954b525b..862a603fd454 100644
> > > > --- a/fs/libfs.c
> > > > +++ b/fs/libfs.c
> > > > @@ -294,6 +294,7 @@ int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
> > > >  		return ret;
> > > >  
> > > >  	offset_set(dentry, offset);
> > > > +	WRITE_ONCE(dentry->d_time, jiffies);
> > > >  	return 0;
> > > >  }
> > > >  
> > > > @@ -454,9 +455,7 @@ void simple_offset_destroy(struct offset_ctx *octx)
> > > >  
> > > >  static int offset_dir_open(struct inode *inode, struct file *file)
> > > >  {
> > > > -	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
> > > > -
> > > > -	file->private_data = (void *)ctx->next_offset;
> > > > +	file->private_data = (void *)jiffies;
> > > >  	return 0;
> > > >  }
> > > >  
> > > > @@ -473,9 +472,6 @@ static int offset_dir_open(struct inode *inode, struct file *file)
> > > >   */
> > > >  static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > > >  {
> > > > -	struct inode *inode = file->f_inode;
> > > > -	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
> > > > -
> > > >  	switch (whence) {
> > > >  	case SEEK_CUR:
> > > >  		offset += file->f_pos;
> > > > @@ -490,7 +486,8 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > > >  
> > > >  	/* In this case, ->private_data is protected by f_pos_lock */
> > > >  	if (!offset)
> > > > -		file->private_data = (void *)ctx->next_offset;
> > > > +		/* Make newer child entries visible */
> > > > +		file->private_data = (void *)jiffies;
> > > >  	return vfs_setpos(file, offset, LONG_MAX);
> > > >  }
> > > >  
> > > > @@ -521,7 +518,8 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
> > > >  			  inode->i_ino, fs_umode_to_dtype(inode->i_mode));
> > > >  }
> > > >  
> > > > -static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, long last_index)
> > > > +static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx,
> > > > +			       unsigned long fence)
> > > >  {
> > > >  	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
> > > >  	struct dentry *dentry;
> > > > @@ -531,14 +529,15 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
> > > >  		if (!dentry)
> > > >  			return;
> > > >  
> > > > -		if (dentry2offset(dentry) >= last_index) {
> > > > -			dput(dentry);
> > > > -			return;
> > > > -		}
> > > > -
> > > > -		if (!offset_dir_emit(ctx, dentry)) {
> > > > -			dput(dentry);
> > > > -			return;
> > > > +		/*
> > > > +		 * Output only child entries created during or before
> > > > +		 * the current opendir epoch.
> > > > +		 */
> > > > +		if (time_before_eq(dentry->d_time, fence)) {
> > > > +			if (!offset_dir_emit(ctx, dentry)) {
> > > > +				dput(dentry);
> > > > +				return;
> > > > +			}
> > > >  		}
> > > >  
> > > >  		ctx->pos = dentry2offset(dentry) + 1;
> > > > @@ -569,15 +568,14 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
> > > >   */
> > > >  static int offset_readdir(struct file *file, struct dir_context *ctx)
> > > >  {
> > > > +	unsigned long fence = (unsigned long)file->private_data;
> > > >  	struct dentry *dir = file->f_path.dentry;
> > > > -	long last_index = (long)file->private_data;
> > > >  
> > > >  	lockdep_assert_held(&d_inode(dir)->i_rwsem);
> > > >  
> > > >  	if (!dir_emit_dots(file, ctx))
> > > >  		return 0;
> > > > -
> > > > -	offset_iterate_dir(d_inode(dir), ctx, last_index);
> > > > +	offset_iterate_dir(d_inode(dir), ctx, fence);
> > > >  	return 0;
> > > >  }
> > > >  
> > > 
> > > Using timestamps instead of directory ordering does seem less brittle,
> > > and the choice to use jiffies makes sense given that d_time is also an
> > > unsigned long.
> > > 
> > > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > 
> > Precisely. The goal was to re-use as much code as possible to avoid
> > perturbing the current size of "struct dentry".
> > 
> > That said, I'm not overjoyed with using jiffies, given it has
> > similar wrapping issues as ctx->next_offset on 32-bit systems. The
> > consequences of an offset value wrap are less severe, though, since
> > that can no longer make children entries disappear permanently.
> > 
> > I've been trying to imagine a solution that does not depend on the
> > range of an integer value and has solidly deterministic behavior in
> > the face of multiple child entry creations during one timer tick.
> > 
> > We could partially re-use the legacy cursor/list mechanism.
> > 
> > * When a child entry is created, it is added at the end of the
> >   parent's d_children list.
> > * When a child entry is unlinked, it is removed from the parent's
> >   d_children list.
> > 
> > This includes creation and removal of entries due to a rename.
> > 
> > 
> > * When a directory is opened, mark the current end of the d_children
> >   list with a cursor dentry. New entries would then be added to this
> >   directory following this cursor dentry in the directory's
> >   d_children list.
> > * When a directory is closed, its cursor dentry is removed from the
> >   d_children list and freed.
> > 
> > Each cursor dentry would need to refer to an opendir instance
> > (using, say, a pointer to the "struct file" for that open) so that
> > multiple cursors in the same directory can reside in its d_chilren
> > list and won't interfere with each other. Re-use the cursor dentry's
> > d_fsdata field for that.
> > 
> > 
> > * offset_readdir gets its starting entry using the mtree/xarray to
> >   map ctx->pos to a dentry.
> > * offset_readdir continues iterating by following the .next pointer
> >   in the current dentry's d_child field.
> > * offset_readdir returns EOD when it hits the cursor dentry matching
> >   this opendir instance.
> > 
> > 
> > I think all of these operations could be O(1), but it might require
> > some additional locking.
> 
> This would be a bigger refactor of the whole stable offset logic. So
> even if we end up doing that I think we should use the jiffies solution
> for now.

How should I mark patches so they can be posted for discussion and
never applied? This series is marked RFC.

I am actually half-way through implementing the approach described
here. It is not as big a re-write as you might think, and addresses
some fundamental misunderstandings in the offset_iterate_dir() code.
Christian Brauner Nov. 21, 2024, 8:34 a.m. UTC | #5
On Wed, Nov 20, 2024 at 10:05:42AM -0500, Chuck Lever wrote:
> On Wed, Nov 20, 2024 at 09:59:54AM +0100, Christian Brauner wrote:
> > On Mon, Nov 18, 2024 at 03:58:09PM -0500, Chuck Lever wrote:
> > > On Mon, Nov 18, 2024 at 03:00:56PM -0500, Jeff Layton wrote:
> > > > On Sun, 2024-11-17 at 16:32 -0500, cel@kernel.org wrote:
> > > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > > 
> > > > > The fix in commit 64a7ce76fb90 ("libfs: fix infinite directory reads
> > > > > for offset dir") introduced a fence in offset_iterate_dir() to stop
> > > > > the loop from returning child entries created after the directory
> > > > > was opened. This comparison relies on the strong ordering of
> > > > > DIR_OFFSET_MIN <= largest child offset <= next_offset to terminate
> > > > > the directory iteration.
> > > > > 
> > > > > However, because simple_offset_add() uses mtree_alloc_cyclic() to
> > > > > select each next new directory offset, ctx->next_offset is not
> > > > > always the highest unused offset. Once mtree_alloc_cyclic() allows
> > > > > a new offset value to wrap, ctx->next_offset will be set to a value
> > > > > less than the actual largest child offset.
> > > > > 
> > > > > The result is that readdir(3) no longer shows any entries in the
> > > > > directory because their offsets are above ctx->next_offset, which is
> > > > > now a small value. This situation is persistent, and the directory
> > > > > cannot be removed unless all current children are already known and
> > > > > can be explicitly removed by name first.
> > > > > 
> > > > > In the current Maple tree implementation, there is no practical way
> > > > > that 63-bit offset values can ever wrap, so this issue is cleverly
> > > > > avoided. But the ordering dependency is not documented via comments
> > > > > or code, making the mechanism somewhat brittle. And it makes the
> > > > > continued use of mtree_alloc_cyclic() somewhat confusing.
> > > > > 
> > > > > Further, if commit 64a7ce76fb90 ("libfs: fix infinite directory
> > > > > reads for offset dir") were to be backported to a kernel that still
> > > > > uses xarray to manage simple directory offsets, the directory offset
> > > > > value range is limited to 32-bits, which is small enough to allow a
> > > > > wrap after a few weeks of constant creation of entries in one
> > > > > directory.
> > > > > 
> > > > > Therefore, replace the use of ctx->next_offset for fencing new
> > > > > children from appearing in readdir results.
> > > > > 
> > > > > A jiffies timestamp marks the end of each opendir epoch. Entries
> > > > > created after this timestamp will not be visible to the file
> > > > > descriptor. I chose jiffies so that the dentry->d_time field can be
> > > > > re-used for storing the entry creation time.
> > > > > 
> > > > > The new mechanism has its own corner cases. For instance, I think
> > > > > if jiffies wraps twice while a directory is open, some children
> > > > > might become invisible. On 32-bit systems, the jiffies value wraps
> > > > > every 49 days. Double-wrapping is not a risk on systems with 64-bit
> > > > > jiffies. Unlike with the next_offset-based mechanism, re-opening the
> > > > > directory will make invisible children re-appear.
> > > > > 
> > > > > Reported-by: Yu Kuai <yukuai3@huawei.com>
> > > > > Closes: https://lore.kernel.org/stable/20241111005242.34654-1-cel@kernel.org/T/#m1c448e5bd4aae3632a09468affcfe1d1594c6a59
> > > > > Fixes: 64a7ce76fb90 ("libfs: fix infinite directory reads for offset dir")
> > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > > ---
> > > > >  fs/libfs.c | 36 +++++++++++++++++-------------------
> > > > >  1 file changed, 17 insertions(+), 19 deletions(-)
> > > > > 
> > > > > diff --git a/fs/libfs.c b/fs/libfs.c
> > > > > index bf67954b525b..862a603fd454 100644
> > > > > --- a/fs/libfs.c
> > > > > +++ b/fs/libfs.c
> > > > > @@ -294,6 +294,7 @@ int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
> > > > >  		return ret;
> > > > >  
> > > > >  	offset_set(dentry, offset);
> > > > > +	WRITE_ONCE(dentry->d_time, jiffies);
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > @@ -454,9 +455,7 @@ void simple_offset_destroy(struct offset_ctx *octx)
> > > > >  
> > > > >  static int offset_dir_open(struct inode *inode, struct file *file)
> > > > >  {
> > > > > -	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
> > > > > -
> > > > > -	file->private_data = (void *)ctx->next_offset;
> > > > > +	file->private_data = (void *)jiffies;
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > @@ -473,9 +472,6 @@ static int offset_dir_open(struct inode *inode, struct file *file)
> > > > >   */
> > > > >  static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > > > >  {
> > > > > -	struct inode *inode = file->f_inode;
> > > > > -	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
> > > > > -
> > > > >  	switch (whence) {
> > > > >  	case SEEK_CUR:
> > > > >  		offset += file->f_pos;
> > > > > @@ -490,7 +486,8 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > > > >  
> > > > >  	/* In this case, ->private_data is protected by f_pos_lock */
> > > > >  	if (!offset)
> > > > > -		file->private_data = (void *)ctx->next_offset;
> > > > > +		/* Make newer child entries visible */
> > > > > +		file->private_data = (void *)jiffies;
> > > > >  	return vfs_setpos(file, offset, LONG_MAX);
> > > > >  }
> > > > >  
> > > > > @@ -521,7 +518,8 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
> > > > >  			  inode->i_ino, fs_umode_to_dtype(inode->i_mode));
> > > > >  }
> > > > >  
> > > > > -static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, long last_index)
> > > > > +static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx,
> > > > > +			       unsigned long fence)
> > > > >  {
> > > > >  	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
> > > > >  	struct dentry *dentry;
> > > > > @@ -531,14 +529,15 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
> > > > >  		if (!dentry)
> > > > >  			return;
> > > > >  
> > > > > -		if (dentry2offset(dentry) >= last_index) {
> > > > > -			dput(dentry);
> > > > > -			return;
> > > > > -		}
> > > > > -
> > > > > -		if (!offset_dir_emit(ctx, dentry)) {
> > > > > -			dput(dentry);
> > > > > -			return;
> > > > > +		/*
> > > > > +		 * Output only child entries created during or before
> > > > > +		 * the current opendir epoch.
> > > > > +		 */
> > > > > +		if (time_before_eq(dentry->d_time, fence)) {
> > > > > +			if (!offset_dir_emit(ctx, dentry)) {
> > > > > +				dput(dentry);
> > > > > +				return;
> > > > > +			}
> > > > >  		}
> > > > >  
> > > > >  		ctx->pos = dentry2offset(dentry) + 1;
> > > > > @@ -569,15 +568,14 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
> > > > >   */
> > > > >  static int offset_readdir(struct file *file, struct dir_context *ctx)
> > > > >  {
> > > > > +	unsigned long fence = (unsigned long)file->private_data;
> > > > >  	struct dentry *dir = file->f_path.dentry;
> > > > > -	long last_index = (long)file->private_data;
> > > > >  
> > > > >  	lockdep_assert_held(&d_inode(dir)->i_rwsem);
> > > > >  
> > > > >  	if (!dir_emit_dots(file, ctx))
> > > > >  		return 0;
> > > > > -
> > > > > -	offset_iterate_dir(d_inode(dir), ctx, last_index);
> > > > > +	offset_iterate_dir(d_inode(dir), ctx, fence);
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > 
> > > > Using timestamps instead of directory ordering does seem less brittle,
> > > > and the choice to use jiffies makes sense given that d_time is also an
> > > > unsigned long.
> > > > 
> > > > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > > 
> > > Precisely. The goal was to re-use as much code as possible to avoid
> > > perturbing the current size of "struct dentry".
> > > 
> > > That said, I'm not overjoyed with using jiffies, given it has
> > > similar wrapping issues as ctx->next_offset on 32-bit systems. The
> > > consequences of an offset value wrap are less severe, though, since
> > > that can no longer make children entries disappear permanently.
> > > 
> > > I've been trying to imagine a solution that does not depend on the
> > > range of an integer value and has solidly deterministic behavior in
> > > the face of multiple child entry creations during one timer tick.
> > > 
> > > We could partially re-use the legacy cursor/list mechanism.
> > > 
> > > * When a child entry is created, it is added at the end of the
> > >   parent's d_children list.
> > > * When a child entry is unlinked, it is removed from the parent's
> > >   d_children list.
> > > 
> > > This includes creation and removal of entries due to a rename.
> > > 
> > > 
> > > * When a directory is opened, mark the current end of the d_children
> > >   list with a cursor dentry. New entries would then be added to this
> > >   directory following this cursor dentry in the directory's
> > >   d_children list.
> > > * When a directory is closed, its cursor dentry is removed from the
> > >   d_children list and freed.
> > > 
> > > Each cursor dentry would need to refer to an opendir instance
> > > (using, say, a pointer to the "struct file" for that open) so that
> > > multiple cursors in the same directory can reside in its d_chilren
> > > list and won't interfere with each other. Re-use the cursor dentry's
> > > d_fsdata field for that.
> > > 
> > > 
> > > * offset_readdir gets its starting entry using the mtree/xarray to
> > >   map ctx->pos to a dentry.
> > > * offset_readdir continues iterating by following the .next pointer
> > >   in the current dentry's d_child field.
> > > * offset_readdir returns EOD when it hits the cursor dentry matching
> > >   this opendir instance.
> > > 
> > > 
> > > I think all of these operations could be O(1), but it might require
> > > some additional locking.
> > 
> > This would be a bigger refactor of the whole stable offset logic. So
> > even if we end up doing that I think we should use the jiffies solution
> > for now.
> 
> How should I mark patches so they can be posted for discussion and
> never applied? This series is marked RFC.

There's no reason to not have it tested. Generally I don't apply RFCs
but this code has caused various issues over multiple kernel releases so
I like to test this early.

> 
> I am actually half-way through implementing the approach described
> here. It is not as big a re-write as you might think, and addresses
> some fundamental misunderstandings in the offset_iterate_dir() code.

Ok, great then let's see it.
Chuck Lever III Nov. 21, 2024, 2:54 p.m. UTC | #6
> On Nov 21, 2024, at 3:34 AM, Christian Brauner <brauner@kernel.org> wrote:
> 
> On Wed, Nov 20, 2024 at 10:05:42AM -0500, Chuck Lever wrote:
>> On Wed, Nov 20, 2024 at 09:59:54AM +0100, Christian Brauner wrote:
>>> On Mon, Nov 18, 2024 at 03:58:09PM -0500, Chuck Lever wrote:
>>>> 
>>>> I've been trying to imagine a solution that does not depend on the
>>>> range of an integer value and has solidly deterministic behavior in
>>>> the face of multiple child entry creations during one timer tick.
>>>> 
>>>> We could partially re-use the legacy cursor/list mechanism.
>>>> 
>>>> * When a child entry is created, it is added at the end of the
>>>>  parent's d_children list.
>>>> * When a child entry is unlinked, it is removed from the parent's
>>>>  d_children list.
>>>> 
>>>> This includes creation and removal of entries due to a rename.
>>>> 
>>>> 
>>>> * When a directory is opened, mark the current end of the d_children
>>>>  list with a cursor dentry. New entries would then be added to this
>>>>  directory following this cursor dentry in the directory's
>>>>  d_children list.
>>>> * When a directory is closed, its cursor dentry is removed from the
>>>>  d_children list and freed.
>>>> 
>>>> Each cursor dentry would need to refer to an opendir instance
>>>> (using, say, a pointer to the "struct file" for that open) so that
>>>> multiple cursors in the same directory can reside in its d_chilren
>>>> list and won't interfere with each other. Re-use the cursor dentry's
>>>> d_fsdata field for that.
>>>> 
>>>> 
>>>> * offset_readdir gets its starting entry using the mtree/xarray to
>>>>  map ctx->pos to a dentry.
>>>> * offset_readdir continues iterating by following the .next pointer
>>>>  in the current dentry's d_child field.
>>>> * offset_readdir returns EOD when it hits the cursor dentry matching
>>>>  this opendir instance.
>>>> 
>>>> 
>>>> I think all of these operations could be O(1), but it might require
>>>> some additional locking.
>>> 
>>> This would be a bigger refactor of the whole stable offset logic. So
>>> even if we end up doing that I think we should use the jiffies solution
>>> for now.
>> 
>> How should I mark patches so they can be posted for discussion and
>> never applied? This series is marked RFC.
> 
> There's no reason to not have it tested.

1/2 is reasonable to apply.

2/2 is work-in-progress. So, fair enough, if you are applying
for testing purposes.


> Generally I don't apply RFCs
> but this code has caused various issues over multiple kernel releases so
> I like to test this early.

The biggest problem has been that I haven't found an
authoritative and comprehensive explanation of how
readdir / getdents needs to behave around renames.

The previous cursor-based solution has always been a "best
effort" implementation, and most of the other file systems
have interesting gaps and heuristics in this area. It's
difficult to piece all of these together to get the idealized
design requirements, and also a get a sense of where
compromises can be made.

Any advice/guidance is welcome.


>> I am actually half-way through implementing the approach described
>> here. It is not as big a re-write as you might think, and addresses
>> some fundamental misunderstandings in the offset_iterate_dir() code.
> 
> Ok, great then let's see it.

I'm finishing it up now. Unfortunately I have several other
(NFSD and not) bugs I'm working through.

I will note that tmpfs hangs during generic/449 for me 100%
of the time; the failure appears unrelated to renames. Do you
know if there is regular CI being done for tmpfs? I'm planning
to add it to my nightly test rig once I'm done here.


--
Chuck Lever
Hugh Dickins Nov. 21, 2024, 9:18 p.m. UTC | #7
On Thu, 21 Nov 2024, Chuck Lever III wrote:
> 
> I will note that tmpfs hangs during generic/449 for me 100%
> of the time; the failure appears unrelated to renames. Do you
> know if there is regular CI being done for tmpfs? I'm planning
> to add it to my nightly test rig once I'm done here.

For me generic/449 did not hang, just took a long time to discover
something uninteresting and eventually declare "not run".  Took
14 minutes six years ago, when I gave up on it and short-circuited
the "not run" with the patch below.

(I carry about twenty patches for my own tmpfs fstests testing; but
many of those are just for ancient 32-bit environment, or to suit the
"huge=always" option. I never have enough time/priority to review and
post them, but can send you a tarball if they might of use to you.)

generic/449 is one of those tests which expects metadata to occupy
space inside the "disk", in a way which it does not on tmpfs (and a
quick glance at its history suggests btrfs also had issues with it).

[PATCH] generic/449: not run on tmpfs earlier

Do not waste 14 minutes to discover that tmpfs succeeds in
setting acls despite running out of space for user attrs.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 tests/generic/449 | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/tests/generic/449 b/tests/generic/449
index 9cf814ad..a52a992b 100755
--- a/tests/generic/449
+++ b/tests/generic/449
@@ -22,6 +22,11 @@ _require_test
 _require_acls
 _require_attrs trusted
 
+if [ "$FSTYP" = "tmpfs" ]; then
+	# Do not waste 14 minutes to discover this:
+	_notrun "$FSTYP succeeds in setting acls despite running out of space for user attrs"
+fi
+
 _scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1
 _scratch_mount || _fail "mount failed"
Chuck Lever III Nov. 21, 2024, 9:29 p.m. UTC | #8
On Thu, Nov 21, 2024 at 01:18:05PM -0800, Hugh Dickins wrote:
> On Thu, 21 Nov 2024, Chuck Lever III wrote:
> > 
> > I will note that tmpfs hangs during generic/449 for me 100%
> > of the time; the failure appears unrelated to renames. Do you
> > know if there is regular CI being done for tmpfs? I'm planning
> > to add it to my nightly test rig once I'm done here.
> 
> For me generic/449 did not hang, just took a long time to discover
> something uninteresting and eventually declare "not run".  Took
> 14 minutes six years ago, when I gave up on it and short-circuited
> the "not run" with the patch below.
> 
> (I carry about twenty patches for my own tmpfs fstests testing; but
> many of those are just for ancient 32-bit environment, or to suit the
> "huge=always" option. I never have enough time/priority to review and
> post them, but can send you a tarball if they might of use to you.)
> 
> generic/449 is one of those tests which expects metadata to occupy
> space inside the "disk", in a way which it does not on tmpfs (and a
> quick glance at its history suggests btrfs also had issues with it).
> 
> [PATCH] generic/449: not run on tmpfs earlier
> 
> Do not waste 14 minutes to discover that tmpfs succeeds in
> setting acls despite running out of space for user attrs.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  tests/generic/449 | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/tests/generic/449 b/tests/generic/449
> index 9cf814ad..a52a992b 100755
> --- a/tests/generic/449
> +++ b/tests/generic/449
> @@ -22,6 +22,11 @@ _require_test
>  _require_acls
>  _require_attrs trusted
>  
> +if [ "$FSTYP" = "tmpfs" ]; then
> +	# Do not waste 14 minutes to discover this:
> +	_notrun "$FSTYP succeeds in setting acls despite running out of space for user attrs"
> +fi
> +
>  _scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1
>  _scratch_mount || _fail "mount failed"
>  
> -- 
> 2.35.3

My approach (until I could look into the failure more) has been
similar:

diff --git a/tests/generic/449 b/tests/generic/449
index 9cf814ad326c..8307a43ce87f 100755
--- a/tests/generic/449
+++ b/tests/generic/449
@@ -21,6 +21,7 @@ _require_scratch
 _require_test
 _require_acls
 _require_attrs trusted
+_supported_fs ^nfs ^overlay ^tmpfs
 
 _scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1
 _scratch_mount || _fail "mount failed"


I stole it from somewhere else, so it's not tmpfs-specific.
Amir Goldstein Nov. 22, 2024, 8:49 a.m. UTC | #9
On Thu, Nov 21, 2024 at 10:30 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On Thu, Nov 21, 2024 at 01:18:05PM -0800, Hugh Dickins wrote:
> > On Thu, 21 Nov 2024, Chuck Lever III wrote:
> > >
> > > I will note that tmpfs hangs during generic/449 for me 100%
> > > of the time; the failure appears unrelated to renames. Do you
> > > know if there is regular CI being done for tmpfs? I'm planning
> > > to add it to my nightly test rig once I'm done here.
> >
> > For me generic/449 did not hang, just took a long time to discover
> > something uninteresting and eventually declare "not run".  Took
> > 14 minutes six years ago, when I gave up on it and short-circuited
> > the "not run" with the patch below.
> >
> > (I carry about twenty patches for my own tmpfs fstests testing; but
> > many of those are just for ancient 32-bit environment, or to suit the
> > "huge=always" option. I never have enough time/priority to review and
> > post them, but can send you a tarball if they might of use to you.)
> >
> > generic/449 is one of those tests which expects metadata to occupy
> > space inside the "disk", in a way which it does not on tmpfs (and a
> > quick glance at its history suggests btrfs also had issues with it).
> >
> > [PATCH] generic/449: not run on tmpfs earlier
> >
> > Do not waste 14 minutes to discover that tmpfs succeeds in
> > setting acls despite running out of space for user attrs.
> >
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >  tests/generic/449 | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/tests/generic/449 b/tests/generic/449
> > index 9cf814ad..a52a992b 100755
> > --- a/tests/generic/449
> > +++ b/tests/generic/449
> > @@ -22,6 +22,11 @@ _require_test
> >  _require_acls
> >  _require_attrs trusted
> >
> > +if [ "$FSTYP" = "tmpfs" ]; then
> > +     # Do not waste 14 minutes to discover this:
> > +     _notrun "$FSTYP succeeds in setting acls despite running out of space for user attrs"
> > +fi
> > +
> >  _scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1
> >  _scratch_mount || _fail "mount failed"
> >
> > --
> > 2.35.3
>
> My approach (until I could look into the failure more) has been
> similar:
>
> diff --git a/tests/generic/449 b/tests/generic/449
> index 9cf814ad326c..8307a43ce87f 100755
> --- a/tests/generic/449
> +++ b/tests/generic/449
> @@ -21,6 +21,7 @@ _require_scratch
>  _require_test
>  _require_acls
>  _require_attrs trusted
> +_supported_fs ^nfs ^overlay ^tmpfs
>

nfs and overlay are _notrun because they do not support _scratch_mkfs_sized

>  _scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1
>  _scratch_mount || _fail "mount failed"
>
>
> I stole it from somewhere else, so it's not tmpfs-specific.

I think opt-out for a certain fs makes sense in some tests, but it is
prefered to describe the requirement that is behind the opt-out.

For example, you thought that nfs,overlay,tmpfs should all opt-out
from this test. Why? Which property do they share in common and
how can it be described in a generic way?

I am not talking about a property that can be checked.
Sometimes we need to make groups of filesystems that share a common
property that cannot be tested, to better express the requirements.

_fstyp_has_non_default_seek_data_hole() is the only example that
comes to mind but there could be others.

Thanks,
Amir.
Christian Brauner Nov. 22, 2024, 12:57 p.m. UTC | #10
On Thu, Nov 21, 2024 at 02:54:16PM +0000, Chuck Lever III wrote:
> 
> 
> > On Nov 21, 2024, at 3:34 AM, Christian Brauner <brauner@kernel.org> wrote:
> > 
> > On Wed, Nov 20, 2024 at 10:05:42AM -0500, Chuck Lever wrote:
> >> On Wed, Nov 20, 2024 at 09:59:54AM +0100, Christian Brauner wrote:
> >>> On Mon, Nov 18, 2024 at 03:58:09PM -0500, Chuck Lever wrote:
> >>>> 
> >>>> I've been trying to imagine a solution that does not depend on the
> >>>> range of an integer value and has solidly deterministic behavior in
> >>>> the face of multiple child entry creations during one timer tick.
> >>>> 
> >>>> We could partially re-use the legacy cursor/list mechanism.
> >>>> 
> >>>> * When a child entry is created, it is added at the end of the
> >>>>  parent's d_children list.
> >>>> * When a child entry is unlinked, it is removed from the parent's
> >>>>  d_children list.
> >>>> 
> >>>> This includes creation and removal of entries due to a rename.
> >>>> 
> >>>> 
> >>>> * When a directory is opened, mark the current end of the d_children
> >>>>  list with a cursor dentry. New entries would then be added to this
> >>>>  directory following this cursor dentry in the directory's
> >>>>  d_children list.
> >>>> * When a directory is closed, its cursor dentry is removed from the
> >>>>  d_children list and freed.
> >>>> 
> >>>> Each cursor dentry would need to refer to an opendir instance
> >>>> (using, say, a pointer to the "struct file" for that open) so that
> >>>> multiple cursors in the same directory can reside in its d_chilren
> >>>> list and won't interfere with each other. Re-use the cursor dentry's
> >>>> d_fsdata field for that.
> >>>> 
> >>>> 
> >>>> * offset_readdir gets its starting entry using the mtree/xarray to
> >>>>  map ctx->pos to a dentry.
> >>>> * offset_readdir continues iterating by following the .next pointer
> >>>>  in the current dentry's d_child field.
> >>>> * offset_readdir returns EOD when it hits the cursor dentry matching
> >>>>  this opendir instance.
> >>>> 
> >>>> 
> >>>> I think all of these operations could be O(1), but it might require
> >>>> some additional locking.
> >>> 
> >>> This would be a bigger refactor of the whole stable offset logic. So
> >>> even if we end up doing that I think we should use the jiffies solution
> >>> for now.
> >> 
> >> How should I mark patches so they can be posted for discussion and
> >> never applied? This series is marked RFC.
> > 
> > There's no reason to not have it tested.
> 
> 1/2 is reasonable to apply.
> 
> 2/2 is work-in-progress. So, fair enough, if you are applying
> for testing purposes.
> 
> 
> > Generally I don't apply RFCs
> > but this code has caused various issues over multiple kernel releases so
> > I like to test this early.
> 
> The biggest problem has been that I haven't found an
> authoritative and comprehensive explanation of how
> readdir / getdents needs to behave around renames.
> 
> The previous cursor-based solution has always been a "best
> effort" implementation, and most of the other file systems
> have interesting gaps and heuristics in this area. It's
> difficult to piece all of these together to get the idealized
> design requirements, and also a get a sense of where
> compromises can be made.
> 
> Any advice/guidance is welcome.

I didn't mean to make it sound like you did anything wrong or I was
blaming you. I was literally just trying to say we had weird behavior in
this area for legitimate reasons. Posix states this:

    If posix_getdents() is called concurrently with an operation that
    adds, deletes, or modifies a directory entry, the results from
    posix_getdents() shall reflect either all of the effects of the
    concurrent operation or none of them. If a sequence of calls to
    posix_getdents() is made that reads from offset zero to end-of-file
    and a file is removed from or added to the directory between the
    first and last of those calls, whether the sequence of calls returns
    an entry for that file is unspecified.

Which to me all reads like we're pretty much free in what to do as long
as we clearly document it.

> >> I am actually half-way through implementing the approach described
> >> here. It is not as big a re-write as you might think, and addresses
> >> some fundamental misunderstandings in the offset_iterate_dir() code.
> > 
> > Ok, great then let's see it.
> 
> I'm finishing it up now. Unfortunately I have several other
> (NFSD and not) bugs I'm working through.

No horry.
Chuck Lever III Nov. 22, 2024, 2:24 p.m. UTC | #11
> On Nov 22, 2024, at 3:49 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> 
> On Thu, Nov 21, 2024 at 10:30 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> On Thu, Nov 21, 2024 at 01:18:05PM -0800, Hugh Dickins wrote:
>>> On Thu, 21 Nov 2024, Chuck Lever III wrote:
>>>> 
>>>> I will note that tmpfs hangs during generic/449 for me 100%
>>>> of the time; the failure appears unrelated to renames. Do you
>>>> know if there is regular CI being done for tmpfs? I'm planning
>>>> to add it to my nightly test rig once I'm done here.
>>> 
>>> For me generic/449 did not hang, just took a long time to discover
>>> something uninteresting and eventually declare "not run".  Took
>>> 14 minutes six years ago, when I gave up on it and short-circuited
>>> the "not run" with the patch below.
>>> 
>>> (I carry about twenty patches for my own tmpfs fstests testing; but
>>> many of those are just for ancient 32-bit environment, or to suit the
>>> "huge=always" option. I never have enough time/priority to review and
>>> post them, but can send you a tarball if they might of use to you.)
>>> 
>>> generic/449 is one of those tests which expects metadata to occupy
>>> space inside the "disk", in a way which it does not on tmpfs (and a
>>> quick glance at its history suggests btrfs also had issues with it).
>>> 
>>> [PATCH] generic/449: not run on tmpfs earlier
>>> 
>>> Do not waste 14 minutes to discover that tmpfs succeeds in
>>> setting acls despite running out of space for user attrs.
>>> 
>>> Signed-off-by: Hugh Dickins <hughd@google.com>
>>> ---
>>> tests/generic/449 | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>> 
>>> diff --git a/tests/generic/449 b/tests/generic/449
>>> index 9cf814ad..a52a992b 100755
>>> --- a/tests/generic/449
>>> +++ b/tests/generic/449
>>> @@ -22,6 +22,11 @@ _require_test
>>> _require_acls
>>> _require_attrs trusted
>>> 
>>> +if [ "$FSTYP" = "tmpfs" ]; then
>>> +     # Do not waste 14 minutes to discover this:
>>> +     _notrun "$FSTYP succeeds in setting acls despite running out of space for user attrs"
>>> +fi
>>> +
>>> _scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1
>>> _scratch_mount || _fail "mount failed"
>>> 
>>> --
>>> 2.35.3
>> 
>> My approach (until I could look into the failure more) has been
>> similar:
>> 
>> diff --git a/tests/generic/449 b/tests/generic/449
>> index 9cf814ad326c..8307a43ce87f 100755
>> --- a/tests/generic/449
>> +++ b/tests/generic/449
>> @@ -21,6 +21,7 @@ _require_scratch
>> _require_test
>> _require_acls
>> _require_attrs trusted
>> +_supported_fs ^nfs ^overlay ^tmpfs
>> 
> 
> nfs and overlay are _notrun because they do not support _scratch_mkfs_sized
> 
>> _scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1
>> _scratch_mount || _fail "mount failed"
>> 
>> 
>> I stole it from somewhere else, so it's not tmpfs-specific.
> 
> I think opt-out for a certain fs makes sense in some tests, but it is
> prefered to describe the requirement that is behind the opt-out.
> 
> For example, you thought that nfs,overlay,tmpfs should all opt-out
> from this test. Why? Which property do they share in common and
> how can it be described in a generic way?
> 
> I am not talking about a property that can be checked.
> Sometimes we need to make groups of filesystems that share a common
> property that cannot be tested, to better express the requirements.
> 
> _fstyp_has_non_default_seek_data_hole() is the only example that
> comes to mind but there could be others.

Adding a rationale is sensible. I don't have one, but
here is the limit of my thinking:

"^nfs" is in this mix because I test NFSD with a tmpfs
export semi-regularly, and it suffers from the same
problem.

"^overlay" doesn't have any reason to be here except
that it was part of the line I stole from elsewhere.

But the top-level question is "does it make sense to
exclude tmpfs from generic/449, or is there something
that should be fixed to make it pass quickly?"


--
Chuck Lever
diff mbox series

Patch

diff --git a/fs/libfs.c b/fs/libfs.c
index bf67954b525b..862a603fd454 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -294,6 +294,7 @@  int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
 		return ret;
 
 	offset_set(dentry, offset);
+	WRITE_ONCE(dentry->d_time, jiffies);
 	return 0;
 }
 
@@ -454,9 +455,7 @@  void simple_offset_destroy(struct offset_ctx *octx)
 
 static int offset_dir_open(struct inode *inode, struct file *file)
 {
-	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
-
-	file->private_data = (void *)ctx->next_offset;
+	file->private_data = (void *)jiffies;
 	return 0;
 }
 
@@ -473,9 +472,6 @@  static int offset_dir_open(struct inode *inode, struct file *file)
  */
 static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
 {
-	struct inode *inode = file->f_inode;
-	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
-
 	switch (whence) {
 	case SEEK_CUR:
 		offset += file->f_pos;
@@ -490,7 +486,8 @@  static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
 
 	/* In this case, ->private_data is protected by f_pos_lock */
 	if (!offset)
-		file->private_data = (void *)ctx->next_offset;
+		/* Make newer child entries visible */
+		file->private_data = (void *)jiffies;
 	return vfs_setpos(file, offset, LONG_MAX);
 }
 
@@ -521,7 +518,8 @@  static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
 			  inode->i_ino, fs_umode_to_dtype(inode->i_mode));
 }
 
-static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, long last_index)
+static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx,
+			       unsigned long fence)
 {
 	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
 	struct dentry *dentry;
@@ -531,14 +529,15 @@  static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
 		if (!dentry)
 			return;
 
-		if (dentry2offset(dentry) >= last_index) {
-			dput(dentry);
-			return;
-		}
-
-		if (!offset_dir_emit(ctx, dentry)) {
-			dput(dentry);
-			return;
+		/*
+		 * Output only child entries created during or before
+		 * the current opendir epoch.
+		 */
+		if (time_before_eq(dentry->d_time, fence)) {
+			if (!offset_dir_emit(ctx, dentry)) {
+				dput(dentry);
+				return;
+			}
 		}
 
 		ctx->pos = dentry2offset(dentry) + 1;
@@ -569,15 +568,14 @@  static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
  */
 static int offset_readdir(struct file *file, struct dir_context *ctx)
 {
+	unsigned long fence = (unsigned long)file->private_data;
 	struct dentry *dir = file->f_path.dentry;
-	long last_index = (long)file->private_data;
 
 	lockdep_assert_held(&d_inode(dir)->i_rwsem);
 
 	if (!dir_emit_dots(file, ctx))
 		return 0;
-
-	offset_iterate_dir(d_inode(dir), ctx, last_index);
+	offset_iterate_dir(d_inode(dir), ctx, fence);
 	return 0;
 }