diff mbox series

[v2,2/2] fuse: remove tmp folio for writebacks and internal rb tree

Message ID 20241014182228.1941246-3-joannelkoong@gmail.com (mailing list archive)
State New
Headers show
Series fuse: remove extra page copies in writeback | expand

Commit Message

Joanne Koong Oct. 14, 2024, 6:22 p.m. UTC
Currently, we allocate and copy data to a temporary folio when
handling writeback in order to mitigate the following deadlock scenario
that may arise if reclaim waits on writeback to complete:
* single-threaded FUSE server is in the middle of handling a request
  that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in
  direct reclaim

To work around this, we allocate a temporary folio and copy over the
original folio to the temporary folio so that writeback can be
immediately cleared on the original folio. This additionally requires us
to maintain an internal rb tree to keep track of writeback state on the
temporary folios.

This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
FUSE folios are not reclaimed and waited on while in writeback, and
removes the temporary folio + extra copying and the internal rb tree.

fio benchmarks --
(using averages observed from 10 runs, throwing away outliers)

Setup:
sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
 ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount

fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
--numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount

        bs =  1k          4k            1M
Before  351 MiB/s     1818 MiB/s     1851 MiB/s
After   341 MiB/s     2246 MiB/s     2685 MiB/s
% diff        -3%          23%         45%

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 fs/fuse/file.c | 322 +++++--------------------------------------------
 1 file changed, 28 insertions(+), 294 deletions(-)

Comments

Miklos Szeredi Oct. 15, 2024, 10:01 a.m. UTC | #1
On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:

> This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> FUSE folios are not reclaimed and waited on while in writeback, and
> removes the temporary folio + extra copying and the internal rb tree.

What about sync(2)?   And page migration?

Hopefully there are no other cases, but I think a careful review of
places where generic code waits for writeback is needed before we can
say for sure.

Thanks,
Miklos
Joanne Koong Oct. 15, 2024, 5:06 p.m. UTC | #2
On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> > FUSE folios are not reclaimed and waited on while in writeback, and
> > removes the temporary folio + extra copying and the internal rb tree.
>
> What about sync(2)?   And page migration?
>
> Hopefully there are no other cases, but I think a careful review of
> places where generic code waits for writeback is needed before we can
> say for sure.

Sounds good, that's a great point. I'll audit the other places in the
mm code where we might call folio_wait_writeback() and report back
what I find.


Thanks,
Joanne

>
> Thanks,
> Miklos
Shakeel Butt Oct. 15, 2024, 7:17 p.m. UTC | #3
On Tue, Oct 15, 2024 at 10:06:52AM GMT, Joanne Koong wrote:
> On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> > > FUSE folios are not reclaimed and waited on while in writeback, and
> > > removes the temporary folio + extra copying and the internal rb tree.
> >
> > What about sync(2)?   And page migration?
> >
> > Hopefully there are no other cases, but I think a careful review of
> > places where generic code waits for writeback is needed before we can
> > say for sure.
> 
> Sounds good, that's a great point. I'll audit the other places in the
> mm code where we might call folio_wait_writeback() and report back
> what I find.
> 
> 

So, any operation that the fuse server can do which can cause wait on
writeback on the folios backed by the fuse is problematic. We know about
one scenario i.e. memory allocation causing reclaim which may do the
wait on unrelated folios which may be backed by the fuse server.

Now there are ways fuse server can shoot itself on the foot. Like sync()
syscall or accessing the folios backed by itself. The quesion is should
we really need to protect fuse from such cases?
Jingbo Xu Oct. 16, 2024, 9:44 a.m. UTC | #4
On 10/16/24 3:17 AM, Shakeel Butt wrote:
> On Tue, Oct 15, 2024 at 10:06:52AM GMT, Joanne Koong wrote:
>> On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>
>>> On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
>>>
>>>> This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
>>>> FUSE folios are not reclaimed and waited on while in writeback, and
>>>> removes the temporary folio + extra copying and the internal rb tree.
>>>
>>> What about sync(2)?   And page migration?
>>>
>>> Hopefully there are no other cases, but I think a careful review of
>>> places where generic code waits for writeback is needed before we can
>>> say for sure.
>>
>> Sounds good, that's a great point. I'll audit the other places in the
>> mm code where we might call folio_wait_writeback() and report back
>> what I find.
>>
>>
> 
> So, any operation that the fuse server can do which can cause wait on
> writeback on the folios backed by the fuse is problematic. We know about
> one scenario i.e. memory allocation causing reclaim which may do the
> wait on unrelated folios which may be backed by the fuse server.
> 
> Now there are ways fuse server can shoot itself on the foot. Like sync()
> syscall or accessing the folios backed by itself. The quesion is should
> we really need to protect fuse from such cases?

I think there are several scenarios shall be evaluated case by case:

1) a non-malicious fuse daemon wants to allocate some memory when
processing a fuse request, which in turn leads to memory reclaim and
thus waiting on the writeback of fuse dirty pages - a deadlock here.
This is reasonable and also the target scenario that this series wants
to fix.

2) a malicious fuse daemon refuses to process any request; or a buggy or
not well-written fuse daemon as you described, e.g. may call sync(2)
itself or access page cache backed by itself, then
  2.1) any unrelated user process attempting to initiate a sync(2)
itself, will hang there.  This scenario is also unexpected and shall be
fixed.
  2.2) any direct user of fuse filesystem (e.g. access files backed by
fuse filesystem) will hang in this case.  IMHO this is expected, and the
impact is affordable as it is controlled within certain range (the
direct user of the fs).
Miklos Szeredi Oct. 16, 2024, 9:51 a.m. UTC | #5
On Tue, 15 Oct 2024 at 21:17, Shakeel Butt <shakeel.butt@linux.dev> wrote:

> So, any operation that the fuse server can do which can cause wait on
> writeback on the folios backed by the fuse is problematic. We know about
> one scenario i.e. memory allocation causing reclaim which may do the
> wait on unrelated folios which may be backed by the fuse server.
>
> Now there are ways fuse server can shoot itself on the foot. Like sync()
> syscall or accessing the folios backed by itself. The quesion is should
> we really need to protect fuse from such cases?

That's not the issue with sync(2).  The issue is that a fuse server
can deny service to an unrelated and possibly higher privilege task by
blocking writeback.  We really don't want that to happen.

Thanks,
Miklos
Jingbo Xu Oct. 16, 2024, 9:56 a.m. UTC | #6
On 10/15/24 6:01 PM, Miklos Szeredi wrote:
> On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
> 
>> This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
>> FUSE folios are not reclaimed and waited on while in writeback, and
>> removes the temporary folio + extra copying and the internal rb tree.
> 
> What about sync(2)? 

FYI The posix sync(2) says, "The writing, although scheduled, is not
necessarily complete upon return from sync()." [1]

Thus hopefully it won't break the posix semantics of sync(2) down if we
skip the waiting on the writeback of fuse pages.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
Miklos Szeredi Oct. 16, 2024, 9:57 a.m. UTC | #7
On Wed, 16 Oct 2024 at 11:44, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:

> 1) a non-malicious fuse daemon wants to allocate some memory when
> processing a fuse request, which in turn leads to memory reclaim and
> thus waiting on the writeback of fuse dirty pages - a deadlock here.
> This is reasonable and also the target scenario that this series wants
> to fix.
>
> 2) a malicious fuse daemon refuses to process any request; or a buggy or
> not well-written fuse daemon as you described, e.g. may call sync(2)
> itself or access page cache backed by itself, then
>   2.1) any unrelated user process attempting to initiate a sync(2)
> itself, will hang there.  This scenario is also unexpected and shall be
> fixed.

Exactly.

We only care about

 - properly written server not deadlocking

 - buggy or malicious server not denying service to unrelated tasks,
where unrelated means it would not otherwise be able to access the
fuse mount. Typically this separation is done with a user namespace or
-oallow_other.

Thanks,
Miklos


>   2.2) any direct user of fuse filesystem (e.g. access files backed by
> fuse filesystem) will hang in this case.  IMHO this is expected, and the
> impact is affordable as it is controlled within certain range (the
> direct user of the fs).
>
> --
> Thanks,
> Jingbo
Miklos Szeredi Oct. 16, 2024, 10 a.m. UTC | #8
On Wed, 16 Oct 2024 at 11:56, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:

> FYI The posix sync(2) says, "The writing, although scheduled, is not
> necessarily complete upon return from sync()." [1]
>
> Thus hopefully it won't break the posix semantics of sync(2) down if we
> skip the waiting on the writeback of fuse pages.

Since this has been the case for fuse since day one and no one
complained, this is not a worry.

Thanks,
Miklos
Shakeel Butt Oct. 16, 2024, 5:52 p.m. UTC | #9
On Wed, Oct 16, 2024 at 11:51:39AM GMT, Miklos Szeredi wrote:
> On Tue, 15 Oct 2024 at 21:17, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > So, any operation that the fuse server can do which can cause wait on
> > writeback on the folios backed by the fuse is problematic. We know about
> > one scenario i.e. memory allocation causing reclaim which may do the
> > wait on unrelated folios which may be backed by the fuse server.
> >
> > Now there are ways fuse server can shoot itself on the foot. Like sync()
> > syscall or accessing the folios backed by itself. The quesion is should
> > we really need to protect fuse from such cases?
> 
> That's not the issue with sync(2).  The issue is that a fuse server
> can deny service to an unrelated and possibly higher privilege task by
> blocking writeback.  We really don't want that to happen.

If I understand you correctly, you are saying fuse server doing wrong
things like accessing the files it is serving is not something we need
to care about. More specifically all the operations which directly
manipulates the folios it is serving (like migration) should be ignored.
Is this correct?
Miklos Szeredi Oct. 16, 2024, 6:37 p.m. UTC | #10
On Wed, 16 Oct 2024 at 19:52, Shakeel Butt <shakeel.butt@linux.dev> wrote:

> If I understand you correctly, you are saying fuse server doing wrong
> things like accessing the files it is serving is not something we need
> to care about.

I don't think detecting such recursion is feasible or even generally possible.

> More specifically all the operations which directly
> manipulates the folios it is serving (like migration) should be ignored.
> Is this correct?

Um, not sure I understand.  If migration can be triggered on fuse
folios that results in the task that triggered the migration to wait
on the fuse folio, then that's bad.  Ignoring fuse folios is the only
solution that I can see, and that's basically what the current temp
page copy does.

Sprinkling mm code with fuse specific conditionals seems the only
solution if we want to get rid of the temp page thing.  Hopefully
there aren't too many of those.

Thanks,
Miklos
Shakeel Butt Oct. 16, 2024, 9:27 p.m. UTC | #11
On Wed, Oct 16, 2024 at 08:37:12PM GMT, Miklos Szeredi wrote:
> On Wed, 16 Oct 2024 at 19:52, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > If I understand you correctly, you are saying fuse server doing wrong
> > things like accessing the files it is serving is not something we need
> > to care about.
> 
> I don't think detecting such recursion is feasible or even generally possible.
> 
> > More specifically all the operations which directly
> > manipulates the folios it is serving (like migration) should be ignored.
> > Is this correct?
> 
> Um, not sure I understand.  If migration can be triggered on fuse
> folios that results in the task that triggered the migration to wait
> on the fuse folio, then that's bad.

Why is it bad? I can understand fuse server getting blocked on fuse
folios is bad but why it is bad for other applications/tasks? I am
wondering network filesystems have to handle similar situation then why
is it bad just for fuse?

> Ignoring fuse folios is the only
> solution that I can see, and that's basically what the current temp
> page copy does.
> 
> Sprinkling mm code with fuse specific conditionals seems the only
> solution if we want to get rid of the temp page thing.  Hopefully
> there aren't too many of those.

It might be a bit more than sprinkling. The reclaim code has to activate
the folio to avoid reclaiming the folio in near future. I am not sure
what we will need to do for move_pages() syscall.
Miklos Szeredi Oct. 17, 2024, 1:31 p.m. UTC | #12
On Wed, 16 Oct 2024 at 23:27, Shakeel Butt <shakeel.butt@linux.dev> wrote:

> Why is it bad? I can understand fuse server getting blocked on fuse
> folios is bad but why it is bad for other applications/tasks? I am
> wondering network filesystems have to handle similar situation then why
> is it bad just for fuse?

You need privileges (physical access) to unplug the network cable.
You don't need privileges (in most setups) to run a fuse server.

> It might be a bit more than sprinkling. The reclaim code has to activate
> the folio to avoid reclaiming the folio in near future. I am not sure
> what we will need to do for move_pages() syscall.

Maybe move_pages() is okay, because it is explicitly targeting a fuse
mmap.   Is this the only way to trigger MIGRATE_SYNC?

Thanks,
Miklos
Joanne Koong Oct. 18, 2024, 1:30 a.m. UTC | #13
On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> > FUSE folios are not reclaimed and waited on while in writeback, and
> > removes the temporary folio + extra copying and the internal rb tree.
>
> What about sync(2)?   And page migration?
>
> Hopefully there are no other cases, but I think a careful review of
> places where generic code waits for writeback is needed before we can
> say for sure.

The places where I see this potential deadlock still being possible are:
* page migration when handling a page fault:
     In particular, this path: handle_mm_fault() ->
__handle_mm_fault() -> handle_pte_fault() -> do_numa_page() ->
migrate_misplaced_folio() -> migrate_pages() -> migrate_pages_sync()
-> migrate_pages_batch() -> migrate_folio_unmap() ->
folio_wait_writeback()
* syscalls that trigger waits on writeback, which will lead to
deadlock if a single-threaded fuse server calls this when servicing
requests:
    - sync(), sync_file_range(), fsync(), fdatasync()
    - swapoff()
    - move_pages()

I need to analyze the page fault path more to get a clearer picture of
what is happening, but so far this looks like a valid case for a
correctly written fuse server to run into.
For the syscalls however, is it valid/safe in general (disregarding
the writeback deadlock scenario for a minute) for fuse servers to be
invoking these syscalls in their handlers anyways?

The other places where I see a generic wait on writeback seem safe:
* splice, page_cache_pipe_buf_try_steal() (fs/splice.c):
   We hit this in fuse when we try to move a page from the pipe buffer
into the page cache (fuse_try_move_page()) for the SPLICE_F_MOVE case.
This wait seems fine, since the folio that's being waited on is the
folio in the pipe buffer which is not a fuse folio.
* memory failure (mm/memory_failure.c):
   Soft offlining a page and handling page memory failure - these can
be triggered asynchronously (memory_failure_work_func()), but this
should be fine for the fuse use case since the server isn't blocked on
servicing any writeback requests while memory failure handling is
waiting on writeback
* page truncation (mm/truncate.c):
   Same here. These cases seem fine since the server isn't blocked on
servicing writeback requests while truncation waits on writeback


Thanks,
Joanne

>
> Thanks,
> Miklos
Shakeel Butt Oct. 18, 2024, 5:31 a.m. UTC | #14
On Thu, Oct 17, 2024 at 03:31:48PM GMT, Miklos Szeredi wrote:
> On Wed, 16 Oct 2024 at 23:27, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > Why is it bad? I can understand fuse server getting blocked on fuse
> > folios is bad but why it is bad for other applications/tasks? I am
> > wondering network filesystems have to handle similar situation then why
> > is it bad just for fuse?
> 
> You need privileges (physical access) to unplug the network cable.
> You don't need privileges (in most setups) to run a fuse server.

I feel like this is too much restrictive and I am still not sure why
blocking on fuse folios served by non-privileges fuse server is worse
than blocking on folios served from the network.

> 
> > It might be a bit more than sprinkling. The reclaim code has to activate
> > the folio to avoid reclaiming the folio in near future. I am not sure
> > what we will need to do for move_pages() syscall.
> 
> Maybe move_pages() is okay, because it is explicitly targeting a fuse
> mmap.   Is this the only way to trigger MIGRATE_SYNC?

It can happen through migrate_pages(), mbind(), compaction which can
caused by hugepage allcations.
Shakeel Butt Oct. 18, 2024, 5:57 a.m. UTC | #15
On Thu, Oct 17, 2024 at 06:30:08PM GMT, Joanne Koong wrote:
> On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> > > FUSE folios are not reclaimed and waited on while in writeback, and
> > > removes the temporary folio + extra copying and the internal rb tree.
> >
> > What about sync(2)?   And page migration?
> >
> > Hopefully there are no other cases, but I think a careful review of
> > places where generic code waits for writeback is needed before we can
> > say for sure.
> 
> The places where I see this potential deadlock still being possible are:
> * page migration when handling a page fault:
>      In particular, this path: handle_mm_fault() ->
> __handle_mm_fault() -> handle_pte_fault() -> do_numa_page() ->
> migrate_misplaced_folio() -> migrate_pages() -> migrate_pages_sync()
> -> migrate_pages_batch() -> migrate_folio_unmap() ->
> folio_wait_writeback()

So, this is numa fault and if fuse server is not mapping the fuse folios
which it is serving, in its address space then this is not an issue.
However hugepage allocation on page fault can cause compaction which
might migrate unrelated fuse folios. So, fuse server doing compaction
is an issue and we need to resolve similar to reclaim codepath. (Though
I think for THP it is not doing MIGRATE_SYNC but doing for gigantic
hugetlb pages).

> * syscalls that trigger waits on writeback, which will lead to
> deadlock if a single-threaded fuse server calls this when servicing
> requests:
>     - sync(), sync_file_range(), fsync(), fdatasync()
>     - swapoff()
>     - move_pages()
> 
> I need to analyze the page fault path more to get a clearer picture of
> what is happening, but so far this looks like a valid case for a
> correctly written fuse server to run into.
> For the syscalls however, is it valid/safe in general (disregarding
> the writeback deadlock scenario for a minute) for fuse servers to be
> invoking these syscalls in their handlers anyways?
> 
> The other places where I see a generic wait on writeback seem safe:
> * splice, page_cache_pipe_buf_try_steal() (fs/splice.c):
>    We hit this in fuse when we try to move a page from the pipe buffer
> into the page cache (fuse_try_move_page()) for the SPLICE_F_MOVE case.
> This wait seems fine, since the folio that's being waited on is the
> folio in the pipe buffer which is not a fuse folio.
> * memory failure (mm/memory_failure.c):
>    Soft offlining a page and handling page memory failure - these can
> be triggered asynchronously (memory_failure_work_func()), but this
> should be fine for the fuse use case since the server isn't blocked on
> servicing any writeback requests while memory failure handling is
> waiting on writeback
> * page truncation (mm/truncate.c):
>    Same here. These cases seem fine since the server isn't blocked on
> servicing writeback requests while truncation waits on writeback
> 
> 
> Thanks,
> Joanne
> 
> >
> > Thanks,
> > Miklos
Jingbo Xu Oct. 18, 2024, 9:24 a.m. UTC | #16
On 10/15/24 2:22 AM, Joanne Koong wrote:
>  static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> -					  struct folio *tmp_folio, uint32_t page_index)
> +					  uint32_t page_index)
>  {
>  	struct inode *inode = folio->mapping->host;
>  	struct fuse_args_pages *ap = &wpa->ia.ap;
>  
> -	folio_copy(tmp_folio, folio);
> -
> -	ap->pages[page_index] = &tmp_folio->page;
> +	folio_get(folio);

Why folio_get() is needed?


>  
> @@ -2203,68 +2047,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data)
>  	struct fuse_writepage_args *wpa = data->wpa;
>  	struct inode *inode = data->inode;
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> -	int num_pages = wpa->ia.ap.num_pages;
> -	int i;
>  
>  	spin_lock(&fi->lock);
>  	list_add_tail(&wpa->queue_entry, &fi->queued_writes);

Could we also drop fi->queued_writes list and writectr counter after we
dropping the temp folio?  I'm not sure.  When I grep all callsites of
fuse_set_nowrite(), it seems that apart from the direct IO
path(fuse_direct_io -> fuse_sync_writes), the truncation related code
also relies on this writectr mechanism.
Joanne Koong Oct. 18, 2024, 7:57 p.m. UTC | #17
On Thu, Oct 17, 2024 at 10:57 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Oct 17, 2024 at 06:30:08PM GMT, Joanne Koong wrote:
> > On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> > > > FUSE folios are not reclaimed and waited on while in writeback, and
> > > > removes the temporary folio + extra copying and the internal rb tree.
> > >
> > > What about sync(2)?   And page migration?
> > >
> > > Hopefully there are no other cases, but I think a careful review of
> > > places where generic code waits for writeback is needed before we can
> > > say for sure.
> >
> > The places where I see this potential deadlock still being possible are:
> > * page migration when handling a page fault:
> >      In particular, this path: handle_mm_fault() ->
> > __handle_mm_fault() -> handle_pte_fault() -> do_numa_page() ->
> > migrate_misplaced_folio() -> migrate_pages() -> migrate_pages_sync()
> > -> migrate_pages_batch() -> migrate_folio_unmap() ->
> > folio_wait_writeback()
>
> So, this is numa fault and if fuse server is not mapping the fuse folios
> which it is serving, in its address space then this is not an issue.
> However hugepage allocation on page fault can cause compaction which
> might migrate unrelated fuse folios. So, fuse server doing compaction
> is an issue and we need to resolve similar to reclaim codepath. (Though
> I think for THP it is not doing MIGRATE_SYNC but doing for gigantic
> hugetlb pages).

Thanks for the explanation. Would you mind pointing me to the
compaction function where this triggers the migrate? Is this in
compact_zone() where it calls migrate_pages() on the cc->migratepages
list?

>
> > * syscalls that trigger waits on writeback, which will lead to
> > deadlock if a single-threaded fuse server calls this when servicing
> > requests:
> >     - sync(), sync_file_range(), fsync(), fdatasync()
> >     - swapoff()
> >     - move_pages()
> >
> > I need to analyze the page fault path more to get a clearer picture of
> > what is happening, but so far this looks like a valid case for a
> > correctly written fuse server to run into.
> > For the syscalls however, is it valid/safe in general (disregarding
> > the writeback deadlock scenario for a minute) for fuse servers to be
> > invoking these syscalls in their handlers anyways?
> >
> > The other places where I see a generic wait on writeback seem safe:
> > * splice, page_cache_pipe_buf_try_steal() (fs/splice.c):
> >    We hit this in fuse when we try to move a page from the pipe buffer
> > into the page cache (fuse_try_move_page()) for the SPLICE_F_MOVE case.
> > This wait seems fine, since the folio that's being waited on is the
> > folio in the pipe buffer which is not a fuse folio.
> > * memory failure (mm/memory_failure.c):
> >    Soft offlining a page and handling page memory failure - these can
> > be triggered asynchronously (memory_failure_work_func()), but this
> > should be fine for the fuse use case since the server isn't blocked on
> > servicing any writeback requests while memory failure handling is
> > waiting on writeback
> > * page truncation (mm/truncate.c):
> >    Same here. These cases seem fine since the server isn't blocked on
> > servicing writeback requests while truncation waits on writeback
> >
> >
> > Thanks,
> > Joanne
> >
> > >
> > > Thanks,
> > > Miklos
Joanne Koong Oct. 18, 2024, 8:29 p.m. UTC | #18
On Fri, Oct 18, 2024 at 2:24 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
> On 10/15/24 2:22 AM, Joanne Koong wrote:
> >  static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> > -                                       struct folio *tmp_folio, uint32_t page_index)
> > +                                       uint32_t page_index)
> >  {
> >       struct inode *inode = folio->mapping->host;
> >       struct fuse_args_pages *ap = &wpa->ia.ap;
> >
> > -     folio_copy(tmp_folio, folio);
> > -
> > -     ap->pages[page_index] = &tmp_folio->page;
> > +     folio_get(folio);
>
> Why folio_get() is needed?
>

It's not strictly needed but I added this to ensure that the folio
will always be valid when we later on access it. My understanding is
that it's an implementation detail rather than a guarantee that the
folio can't be removed from the mapping until writeback finishes.

>
> >
> > @@ -2203,68 +2047,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data)
> >       struct fuse_writepage_args *wpa = data->wpa;
> >       struct inode *inode = data->inode;
> >       struct fuse_inode *fi = get_fuse_inode(inode);
> > -     int num_pages = wpa->ia.ap.num_pages;
> > -     int i;
> >
> >       spin_lock(&fi->lock);
> >       list_add_tail(&wpa->queue_entry, &fi->queued_writes);
>
> Could we also drop fi->queued_writes list and writectr counter after we
> dropping the temp folio?  I'm not sure.  When I grep all callsites of
> fuse_set_nowrite(), it seems that apart from the direct IO
> path(fuse_direct_io -> fuse_sync_writes), the truncation related code
> also relies on this writectr mechanism.

I'm not sure either, but I think we still need it. My understanding is
that the main purpose of this writectr is to detect when the inode has
pending writebacks. I think even without the temporary pages, we still
would need a way overall to track if the inode is writing back any
folio in its mapping.


Thanks,
Joanne

>
>
>
> --
> Thanks,
> Jingbo
Shakeel Butt Oct. 18, 2024, 8:46 p.m. UTC | #19
On Fri, Oct 18, 2024 at 12:57:08PM GMT, Joanne Koong wrote:
> On Thu, Oct 17, 2024 at 10:57 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Oct 17, 2024 at 06:30:08PM GMT, Joanne Koong wrote:
> > > On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that
> > > > > FUSE folios are not reclaimed and waited on while in writeback, and
> > > > > removes the temporary folio + extra copying and the internal rb tree.
> > > >
> > > > What about sync(2)?   And page migration?
> > > >
> > > > Hopefully there are no other cases, but I think a careful review of
> > > > places where generic code waits for writeback is needed before we can
> > > > say for sure.
> > >
> > > The places where I see this potential deadlock still being possible are:
> > > * page migration when handling a page fault:
> > >      In particular, this path: handle_mm_fault() ->
> > > __handle_mm_fault() -> handle_pte_fault() -> do_numa_page() ->
> > > migrate_misplaced_folio() -> migrate_pages() -> migrate_pages_sync()
> > > -> migrate_pages_batch() -> migrate_folio_unmap() ->
> > > folio_wait_writeback()
> >
> > So, this is numa fault and if fuse server is not mapping the fuse folios
> > which it is serving, in its address space then this is not an issue.
> > However hugepage allocation on page fault can cause compaction which
> > might migrate unrelated fuse folios. So, fuse server doing compaction
> > is an issue and we need to resolve similar to reclaim codepath. (Though
> > I think for THP it is not doing MIGRATE_SYNC but doing for gigantic
> > hugetlb pages).
> 
> Thanks for the explanation. Would you mind pointing me to the
> compaction function where this triggers the migrate? Is this in
> compact_zone() where it calls migrate_pages() on the cc->migratepages
> list?
> 

Something like the following:

__alloc_pages_direct_compact() ->
try_to_compact_pages() ->
compact_zone_order() -> /* MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT */
compact_zone() ->
migrate_pages() ->
migrate_pages_sync() ->
migrate_pages_batch() ->
migrate_folio_unmap() ->
folio_wait_writeback()


The following is one code path from hugetlb:

alloc_contig_range_noprof() -> /* MIGRATE_SYNC */
__alloc_contig_migrate_range() ->
migrate_pages() ->
migrate_pages_sync() ->
migrate_pages_batch() ->
migrate_folio_unmap() ->
folio_wait_writeback()
Miklos Szeredi Oct. 21, 2024, 9:32 a.m. UTC | #20
On Fri, 18 Oct 2024 at 03:30, Joanne Koong <joannelkoong@gmail.com> wrote:

> I need to analyze the page fault path more to get a clearer picture of
> what is happening, but so far this looks like a valid case for a
> correctly written fuse server to run into.

Yes.

> For the syscalls however, is it valid/safe in general (disregarding
> the writeback deadlock scenario for a minute) for fuse servers to be
> invoking these syscalls in their handlers anyways?

Generally no.  Any kind of recursion in fuse is a landmine.

E.g. CVE-2019-20794 was created to track an issue with a fuse server
going unkillable on namespace shutdown.  It didn't occur to the
reporter that this is just a special case of a plain old recursive
deadlock, because it happens to be triggered by kill.  But recursion
is clearly there: there's a file descriptor referring to the same fuse
mount that is being served.  When this fd is closed at process exit
the recursion is triggered and the thing deadlocks.  The fix: move the
recursive part of the code to a different process.  But people seem to
believe that recursion is okay and the kernel should deal with that
:-/

> The other places where I see a generic wait on writeback seem safe:
> * splice, page_cache_pipe_buf_try_steal() (fs/splice.c):
>    We hit this in fuse when we try to move a page from the pipe buffer
> into the page cache (fuse_try_move_page()) for the SPLICE_F_MOVE case.
> This wait seems fine, since the folio that's being waited on is the
> folio in the pipe buffer which is not a fuse folio.
> * memory failure (mm/memory_failure.c):
>    Soft offlining a page and handling page memory failure - these can
> be triggered asynchronously (memory_failure_work_func()), but this
> should be fine for the fuse use case since the server isn't blocked on
> servicing any writeback requests while memory failure handling is
> waiting on writeback
> * page truncation (mm/truncate.c):
>    Same here. These cases seem fine since the server isn't blocked on
> servicing writeback requests while truncation waits on writeback

Right.

Thanks,
Miklos
Miklos Szeredi Oct. 21, 2024, 10:15 a.m. UTC | #21
On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:

> I feel like this is too much restrictive and I am still not sure why
> blocking on fuse folios served by non-privileges fuse server is worse
> than blocking on folios served from the network.

Might be.  But historically fuse had this behavior and I'd be very
reluctant to change that unconditionally.

With a systemwide maximal timeout for fuse requests it might make
sense to allow sync(2), etc. to wait for fuse writeback.

Without a timeout allowing fuse servers to block sync(2) indefinitely
seems rather risky.

Thanks,
Miklos
Shakeel Butt Oct. 21, 2024, 5:01 p.m. UTC | #22
On Mon, Oct 21, 2024 at 12:15:36PM GMT, Miklos Szeredi wrote:
> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > I feel like this is too much restrictive and I am still not sure why
> > blocking on fuse folios served by non-privileges fuse server is worse
> > than blocking on folios served from the network.
> 
> Might be.  But historically fuse had this behavior and I'd be very
> reluctant to change that unconditionally.
> 
> With a systemwide maximal timeout for fuse requests it might make
> sense to allow sync(2), etc. to wait for fuse writeback.
> 
> Without a timeout allowing fuse servers to block sync(2) indefinitely
> seems rather risky.
> 

Thanks Miklos for the response. Just to be clear on where we disagree, let
me point out what I think is right and please tell me where you
disagree:

1. Fuse server should never access fuse folios (and files, directories,
   mounts, etc) directly it is providing.

2. Fuse server should not get blocked indirectly on the fuse folios (and
   related objects). This series is removing one such scenario caused
   due to reclaim.

3. Non fuse server processes can be blocked on fuse folios (and related
   objects) directly and indirectly.

Am I understanding correctly that we disagree on (3)?

thanks,
Shakeel
Joanne Koong Oct. 21, 2024, 9:05 p.m. UTC | #23
On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> > I feel like this is too much restrictive and I am still not sure why
> > blocking on fuse folios served by non-privileges fuse server is worse
> > than blocking on folios served from the network.
>
> Might be.  But historically fuse had this behavior and I'd be very
> reluctant to change that unconditionally.
>
> With a systemwide maximal timeout for fuse requests it might make
> sense to allow sync(2), etc. to wait for fuse writeback.
>
> Without a timeout allowing fuse servers to block sync(2) indefinitely
> seems rather risky.

Could we skip waiting on writeback in sync(2) if it's a fuse folio?
That seems in line with the sync(2) documentation Jingbo referenced
earlier where it states "The writing, although scheduled, is not
necessarily complete upon return from sync()."
https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html


Thanks,
Joanne
>
> Thanks,
> Miklos
Miklos Szeredi Oct. 22, 2024, 3:03 p.m. UTC | #24
On Mon, 21 Oct 2024 at 19:01, Shakeel Butt <shakeel.butt@linux.dev> wrote:

> Thanks Miklos for the response. Just to be clear on where we disagree, let
> me point out what I think is right and please tell me where you
> disagree:
>
> 1. Fuse server should never access fuse folios (and files, directories,
>    mounts, etc) directly it is providing.

Correct.

> 2. Fuse server should not get blocked indirectly on the fuse folios (and
>    related objects). This series is removing one such scenario caused
>    due to reclaim.

Correct.

>
> 3. Non fuse server processes can be blocked on fuse folios (and related
>    objects) directly and indirectly.

Agree on the direct access part, but disagree about allowing to block
unrelated tasks.

Accessing a fuse backed object is basically agreeing to give the fuse
server extra privileges.  Documentation/filesystems/fuse.rst explains
this below "How do non-privileged mounts work?".

Thanks,
Miklos
Joanne Koong Oct. 24, 2024, 4:54 p.m. UTC | #25
On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > > I feel like this is too much restrictive and I am still not sure why
> > > blocking on fuse folios served by non-privileges fuse server is worse
> > > than blocking on folios served from the network.
> >
> > Might be.  But historically fuse had this behavior and I'd be very
> > reluctant to change that unconditionally.
> >
> > With a systemwide maximal timeout for fuse requests it might make
> > sense to allow sync(2), etc. to wait for fuse writeback.
> >
> > Without a timeout allowing fuse servers to block sync(2) indefinitely
> > seems rather risky.
>
> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
> That seems in line with the sync(2) documentation Jingbo referenced
> earlier where it states "The writing, although scheduled, is not
> necessarily complete upon return from sync()."
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
>

So I think the answer to this is "no" for Linux. What the Linux man
page for sync(2) says:

"According to the standard specification (e.g., POSIX.1-2001), sync()
schedules the writes, but may return before the actual writing is
done.  However Linux waits for I/O completions, and thus sync() or
syncfs() provide the same guarantees as fsync() called on every file
in the system or filesystem respectively." [1]

Regardless of the compaction / page migration issue then, this
blocking sync(2) is a dealbreaker.

I see two workarounds around this:

1) Optimistically skip the tmp page but add a timeout where if the
server doesn't reply to the writeback in X seconds, then allocate the
tmp folio and clear writeback immediately on the original folio).
This would address any page migration deadlocks as well. And probably
we don't need the reclaim patch either then, since that could also be
handled by the timeout.
This would make 99% of writebacks fast but in the case of a malicious
fuse server, could make sync() and page migration wait an extra X
seconds.

2) Only skip the tmp folio for privileged fuse servers (we'd still
need to address the page migration path)

Would love to hear thoughts on this. Should we go ahead with option 1?

[1] https://man7.org/linux/man-pages/man2/sync.2.html

>
> Thanks,
> Joanne
> >
> > Thanks,
> > Miklos
Jingbo Xu Oct. 25, 2024, 1:38 a.m. UTC | #26
On 10/25/24 12:54 AM, Joanne Koong wrote:
> On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>
>>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>
>>>> I feel like this is too much restrictive and I am still not sure why
>>>> blocking on fuse folios served by non-privileges fuse server is worse
>>>> than blocking on folios served from the network.
>>>
>>> Might be.  But historically fuse had this behavior and I'd be very
>>> reluctant to change that unconditionally.
>>>
>>> With a systemwide maximal timeout for fuse requests it might make
>>> sense to allow sync(2), etc. to wait for fuse writeback.
>>>
>>> Without a timeout allowing fuse servers to block sync(2) indefinitely
>>> seems rather risky.
>>
>> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
>> That seems in line with the sync(2) documentation Jingbo referenced
>> earlier where it states "The writing, although scheduled, is not
>> necessarily complete upon return from sync()."
>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
>>
> 
> So I think the answer to this is "no" for Linux. What the Linux man
> page for sync(2) says:
> 
> "According to the standard specification (e.g., POSIX.1-2001), sync()
> schedules the writes, but may return before the actual writing is
> done.  However Linux waits for I/O completions, and thus sync() or
> syncfs() provide the same guarantees as fsync() called on every file
> in the system or filesystem respectively." [1]

Actually as for FUSE, IIUC the writeback is not guaranteed to be
completed when sync(2) returns since the temp page mechanism.  When
sync(2) returns, PG_writeback is indeed cleared for all original pages
(in the address_space), while the real writeback work (initiated from
temp page) may be still in progress.

I think this is also what Miklos means in:
https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/

Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages
in sync(2) codepath similar to what we have done for the direct reclaim
in patch 1.


> 
> Regardless of the compaction / page migration issue then, this
> blocking sync(2) is a dealbreaker.

I really should have figureg out the compaction / page migration
mechanism and the potential impact to FUSE when we dropping the temp
page.  Just too busy to take some time on this though.....
Miklos Szeredi Oct. 25, 2024, 3:32 p.m. UTC | #27
On Fri, 25 Oct 2024 at 03:38, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:

> Actually as for FUSE, IIUC the writeback is not guaranteed to be
> completed when sync(2) returns since the temp page mechanism.  When
> sync(2) returns, PG_writeback is indeed cleared for all original pages
> (in the address_space), while the real writeback work (initiated from
> temp page) may be still in progress.

Correct, this is the current behavior of fuse.

I'm not against changing this for the privileged server case.  I.e. a
server running with certain privileges (e.g. CAP_SYS_ADMIN in the
global namespace) then it should be able to opt in to waiting sync(2)
behavior.

> I think this is also what Miklos means in:
> https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/
>
> Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages
> in sync(2) codepath similar to what we have done for the direct reclaim
> in patch 1.

I'd love to get rid of the tmp page thing completely if the same
guarantees can be implemented in the mm.

Thanks,
Miklos
Joanne Koong Oct. 25, 2024, 5:36 p.m. UTC | #28
On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 10/25/24 12:54 AM, Joanne Koong wrote:
> > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>
> >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >>>
> >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >>>
> >>>> I feel like this is too much restrictive and I am still not sure why
> >>>> blocking on fuse folios served by non-privileges fuse server is worse
> >>>> than blocking on folios served from the network.
> >>>
> >>> Might be.  But historically fuse had this behavior and I'd be very
> >>> reluctant to change that unconditionally.
> >>>
> >>> With a systemwide maximal timeout for fuse requests it might make
> >>> sense to allow sync(2), etc. to wait for fuse writeback.
> >>>
> >>> Without a timeout allowing fuse servers to block sync(2) indefinitely
> >>> seems rather risky.
> >>
> >> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
> >> That seems in line with the sync(2) documentation Jingbo referenced
> >> earlier where it states "The writing, although scheduled, is not
> >> necessarily complete upon return from sync()."
> >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
> >>
> >
> > So I think the answer to this is "no" for Linux. What the Linux man
> > page for sync(2) says:
> >
> > "According to the standard specification (e.g., POSIX.1-2001), sync()
> > schedules the writes, but may return before the actual writing is
> > done.  However Linux waits for I/O completions, and thus sync() or
> > syncfs() provide the same guarantees as fsync() called on every file
> > in the system or filesystem respectively." [1]
>
> Actually as for FUSE, IIUC the writeback is not guaranteed to be
> completed when sync(2) returns since the temp page mechanism.  When
> sync(2) returns, PG_writeback is indeed cleared for all original pages
> (in the address_space), while the real writeback work (initiated from
> temp page) may be still in progress.
>

That's a great point. It seems like we can just skip waiting on
writeback to finish for fuse folios in sync(2) altogether then. I'll
look into what's the best way to do this.

> I think this is also what Miklos means in:
> https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/
>
> Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages
> in sync(2) codepath similar to what we have done for the direct reclaim
> in patch 1.
>
>
> >
> > Regardless of the compaction / page migration issue then, this
> > blocking sync(2) is a dealbreaker.
>
> I really should have figureg out the compaction / page migration
> mechanism and the potential impact to FUSE when we dropping the temp
> page.  Just too busy to take some time on this though.....

Same here, I need to look some more into the compaction / page
migration paths. I'm planning to do this early next week and will
report back with what I find.


Thanks,
Joanne
>
>
> --
> Thanks,
> Jingbo
Miklos Szeredi Oct. 25, 2024, 6:02 p.m. UTC | #29
On Fri, 25 Oct 2024 at 19:36, Joanne Koong <joannelkoong@gmail.com> wrote:

> That's a great point. It seems like we can just skip waiting on
> writeback to finish for fuse folios in sync(2) altogether then. I'll
> look into what's the best way to do this.

I just tested this, and it turns out this doesn't quite work the way
I'd expected.  I can trigger sync(2) being blocked by a suspended fuse
server:

task:kworker/u16:3   state:D stack:0     pid:172   tgid:172   ppid:2
   flags:0x00004000
Workqueue: writeback wb_workfn (flush-0:30)
Call Trace:
 __schedule+0x40b/0xad0
 schedule+0x36/0x120
 inode_sleep_on_writeback+0x9d/0xb0
 wb_writeback+0x104/0x3d0
 wb_workfn+0x325/0x490
 process_one_work+0x1d8/0x520
 worker_thread+0x1af/0x390
 kthread+0xcc/0x100
 ret_from_fork+0x2d/0x50
 ret_from_fork_asm+0x1a/0x30

task:dd              state:S stack:0     pid:1364  tgid:1364
ppid:1336   flags:0x00000002
Call Trace:
 __schedule+0x40b/0xad0
 schedule+0x36/0x120
 request_wait_answer+0x16b/0x200
 __fuse_simple_request+0xd6/0x290
 fuse_flush_times+0x119/0x140
 fuse_write_inode+0x6d/0xc0
 __writeback_single_inode+0x36d/0x480
 writeback_single_inode+0xa8/0x170
 write_inode_now+0x75/0xa0
 fuse_flush+0x85/0x1c0
 filp_flush+0x2c/0x70
 __x64_sys_close+0x2e/0x80
 do_syscall_64+0x64/0x140
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

task:sync            state:D stack:0     pid:1365  tgid:1365
ppid:1336   flags:0x00004002
Call Trace:
 __schedule+0x40b/0xad0
 schedule+0x36/0x120
 wb_wait_for_completion+0x56/0x80
 sync_inodes_sb+0xc5/0x450
 iterate_supers+0x69/0xd0
 ksys_sync+0x40/0xa0
 __do_sys_sync+0xa/0x20
 do_syscall_64+0x64/0x140
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Maybe I'm too paranoid about this, and in practice we can just let
sync(2) block in any case.  But I want to understand this better
first.

Thanks,
Miklos
Joanne Koong Oct. 25, 2024, 6:19 p.m. UTC | #30
On Fri, Oct 25, 2024 at 11:02 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, 25 Oct 2024 at 19:36, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > That's a great point. It seems like we can just skip waiting on
> > writeback to finish for fuse folios in sync(2) altogether then. I'll
> > look into what's the best way to do this.
>
> I just tested this, and it turns out this doesn't quite work the way
> I'd expected.  I can trigger sync(2) being blocked by a suspended fuse
> server:
>
> task:kworker/u16:3   state:D stack:0     pid:172   tgid:172   ppid:2
>    flags:0x00004000
> Workqueue: writeback wb_workfn (flush-0:30)
> Call Trace:
>  __schedule+0x40b/0xad0
>  schedule+0x36/0x120
>  inode_sleep_on_writeback+0x9d/0xb0
>  wb_writeback+0x104/0x3d0
>  wb_workfn+0x325/0x490
>  process_one_work+0x1d8/0x520
>  worker_thread+0x1af/0x390
>  kthread+0xcc/0x100
>  ret_from_fork+0x2d/0x50
>  ret_from_fork_asm+0x1a/0x30
>
> task:dd              state:S stack:0     pid:1364  tgid:1364
> ppid:1336   flags:0x00000002
> Call Trace:
>  __schedule+0x40b/0xad0
>  schedule+0x36/0x120
>  request_wait_answer+0x16b/0x200
>  __fuse_simple_request+0xd6/0x290
>  fuse_flush_times+0x119/0x140
>  fuse_write_inode+0x6d/0xc0
>  __writeback_single_inode+0x36d/0x480
>  writeback_single_inode+0xa8/0x170
>  write_inode_now+0x75/0xa0
>  fuse_flush+0x85/0x1c0
>  filp_flush+0x2c/0x70
>  __x64_sys_close+0x2e/0x80
>  do_syscall_64+0x64/0x140
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> task:sync            state:D stack:0     pid:1365  tgid:1365
> ppid:1336   flags:0x00004002
> Call Trace:
>  __schedule+0x40b/0xad0
>  schedule+0x36/0x120
>  wb_wait_for_completion+0x56/0x80
>  sync_inodes_sb+0xc5/0x450
>  iterate_supers+0x69/0xd0
>  ksys_sync+0x40/0xa0
>  __do_sys_sync+0xa/0x20
>  do_syscall_64+0x64/0x140
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>

Thanks for the trace. If i'm understanding it correctly, this only
blocks temporarily until the writeback wb_workfn is rescheduled?

> Maybe I'm too paranoid about this, and in practice we can just let
> sync(2) block in any case.  But I want to understand this better

in the case of a malicious fuse server, it could block sync(2) forever
so I think this means we have to skip the wait for fuse folios
altogether.


Thanks,
Joanne
> first.
>
> Thanks,
> Miklos
Joanne Koong Oct. 25, 2024, 6:47 p.m. UTC | #31
On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >
> >
> >
> > On 10/25/24 12:54 AM, Joanne Koong wrote:
> > > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >>
> > >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >>>
> > >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >>>
> > >>>> I feel like this is too much restrictive and I am still not sure why
> > >>>> blocking on fuse folios served by non-privileges fuse server is worse
> > >>>> than blocking on folios served from the network.
> > >>>
> > >>> Might be.  But historically fuse had this behavior and I'd be very
> > >>> reluctant to change that unconditionally.
> > >>>
> > >>> With a systemwide maximal timeout for fuse requests it might make
> > >>> sense to allow sync(2), etc. to wait for fuse writeback.
> > >>>
> > >>> Without a timeout allowing fuse servers to block sync(2) indefinitely
> > >>> seems rather risky.
> > >>
> > >> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
> > >> That seems in line with the sync(2) documentation Jingbo referenced
> > >> earlier where it states "The writing, although scheduled, is not
> > >> necessarily complete upon return from sync()."
> > >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
> > >>
> > >
> > > So I think the answer to this is "no" for Linux. What the Linux man
> > > page for sync(2) says:
> > >
> > > "According to the standard specification (e.g., POSIX.1-2001), sync()
> > > schedules the writes, but may return before the actual writing is
> > > done.  However Linux waits for I/O completions, and thus sync() or
> > > syncfs() provide the same guarantees as fsync() called on every file
> > > in the system or filesystem respectively." [1]
> >
> > Actually as for FUSE, IIUC the writeback is not guaranteed to be
> > completed when sync(2) returns since the temp page mechanism.  When
> > sync(2) returns, PG_writeback is indeed cleared for all original pages
> > (in the address_space), while the real writeback work (initiated from
> > temp page) may be still in progress.
> >
>
> That's a great point. It seems like we can just skip waiting on
> writeback to finish for fuse folios in sync(2) altogether then. I'll
> look into what's the best way to do this.

I think the most straightforward way to do this for sync(2) is to add
the mapping check inside sync_bdevs(). With something like:

diff --git a/block/bdev.c b/block/bdev.c
index 738e3c8457e7..bcb2b6d3db94 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1247,7 +1247,7 @@ void sync_bdevs(bool wait)
                mutex_lock(&bdev->bd_disk->open_mutex);
                if (!atomic_read(&bdev->bd_openers)) {
                        ; /* skip */
-               } else if (wait) {
+               } else if (wait &&
!mapping_no_writeback_wait(inode->i_mapping)) {
                        /*
                         * We keep the error status of individual mapping so
                         * that applications can catch the writeback error using


and changing AS_NO_WRITEBACK_RECLAIM to AS_NO_WRITEBACK_WAIT.


Thanks,
Joanne
>
Joanne Koong Oct. 25, 2024, 10:40 p.m. UTC | #32
On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >
> >
> >
> > On 10/25/24 12:54 AM, Joanne Koong wrote:
> > > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >>
> > >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >>>
> > >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >>>
> > >>>> I feel like this is too much restrictive and I am still not sure why
> > >>>> blocking on fuse folios served by non-privileges fuse server is worse
> > >>>> than blocking on folios served from the network.
> > >>>
> > >>> Might be.  But historically fuse had this behavior and I'd be very
> > >>> reluctant to change that unconditionally.
> > >>>
> > >>> With a systemwide maximal timeout for fuse requests it might make
> > >>> sense to allow sync(2), etc. to wait for fuse writeback.
> > >>>
> > >>> Without a timeout allowing fuse servers to block sync(2) indefinitely
> > >>> seems rather risky.
> > >>
> > >> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
> > >> That seems in line with the sync(2) documentation Jingbo referenced
> > >> earlier where it states "The writing, although scheduled, is not
> > >> necessarily complete upon return from sync()."
> > >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
> > >>
> > >
> > > So I think the answer to this is "no" for Linux. What the Linux man
> > > page for sync(2) says:
> > >
> > > "According to the standard specification (e.g., POSIX.1-2001), sync()
> > > schedules the writes, but may return before the actual writing is
> > > done.  However Linux waits for I/O completions, and thus sync() or
> > > syncfs() provide the same guarantees as fsync() called on every file
> > > in the system or filesystem respectively." [1]
> >
> > Actually as for FUSE, IIUC the writeback is not guaranteed to be
> > completed when sync(2) returns since the temp page mechanism.  When
> > sync(2) returns, PG_writeback is indeed cleared for all original pages
> > (in the address_space), while the real writeback work (initiated from
> > temp page) may be still in progress.
> >
>
> That's a great point. It seems like we can just skip waiting on
> writeback to finish for fuse folios in sync(2) altogether then. I'll
> look into what's the best way to do this.
>
> > I think this is also what Miklos means in:
> > https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/
> >
> > Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages
> > in sync(2) codepath similar to what we have done for the direct reclaim
> > in patch 1.
> >
> >
> > >
> > > Regardless of the compaction / page migration issue then, this
> > > blocking sync(2) is a dealbreaker.
> >
> > I really should have figureg out the compaction / page migration
> > mechanism and the potential impact to FUSE when we dropping the temp
> > page.  Just too busy to take some time on this though.....
>
> Same here, I need to look some more into the compaction / page
> migration paths. I'm planning to do this early next week and will
> report back with what I find.
>

These are my notes so far:

* We hit the folio_wait_writeback() path when callers call
migrate_pages() with mode MIGRATE_SYNC
   ... -> migrate_pages() -> migrate_pages_sync() ->
migrate_pages_batch() -> migrate_folio_unmap() ->
folio_wait_writeback()

* These are the places where we call migrate_pages():
1) demote_folio_list()
Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode

2) __damon_pa_migrate_folio_list()
Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode

3) migrate_misplaced_folio()
Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode

4) do_move_pages_to_node()
Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
this path is only invoked by the move_pages() syscall. It's fine to
wait on writeback for the move_pages() syscall since the user would
have to deliberately invoke this on the fuse server for this to apply
to the server's fuse folios

5)  migrate_to_node()
Can ignore this for the same reason as in 4. This path is only invoked
by the migrate_pages() syscall.

6) do_mbind()
Can ignore this for the same reason as 4 and 5. This path is only
invoked by the mbind() syscall.

7) soft_offline_in_use_page()
Can skip soft offlining fuse folios (eg folios with the
AS_NO_WRITEBACK_WAIT mapping flag set).
The path for this is soft_offline_page() -> soft_offline_in_use_page()
-> migrate_pages(). soft_offline_page() only invokes this for in-use
pages in a well-defined state (see ret value of get_hwpoison_page()).
My understanding of soft offlining pages is that it's a mitigation
strategy for handling pages that are experiencing errors but are not
yet completely unusable, and its main purpose is to prevent future
issues. It seems fine to skip this for fuse folios.

8) do_migrate_range()
9) compact_zone()
10) migrate_longterm_unpinnable_folios()
11) __alloc_contig_migrate_range()

8 to 11 needs more investigation / thinking about. I don't see a good
way around these tbh. I think we have to operate under the assumption
that the fuse server running is malicious or benevolently but
incorrectly written and could possibly never complete writeback. So we
definitely can't wait on these but it also doesn't seem like we can
skip waiting on these, especially for the case where the server uses
spliced pages, nor does it seem like we can just fail these with
-EBUSY or something.

Will continue looking more into this early next week.

Thanks,
Joanne
>
> Thanks,
> Joanne
> >
> >
> > --
> > Thanks,
> > Jingbo
Jingbo Xu Oct. 28, 2024, 2:02 a.m. UTC | #33
On 10/26/24 2:19 AM, Joanne Koong wrote:
> On Fri, Oct 25, 2024 at 11:02 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>
>> On Fri, 25 Oct 2024 at 19:36, Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>>> That's a great point. It seems like we can just skip waiting on
>>> writeback to finish for fuse folios in sync(2) altogether then. I'll
>>> look into what's the best way to do this.
>>
>> I just tested this, and it turns out this doesn't quite work the way
>> I'd expected.  I can trigger sync(2) being blocked by a suspended fuse
>> server:
>>
>> task:kworker/u16:3   state:D stack:0     pid:172   tgid:172   ppid:2
>>    flags:0x00004000
>> Workqueue: writeback wb_workfn (flush-0:30)
>> Call Trace:
>>  __schedule+0x40b/0xad0
>>  schedule+0x36/0x120
>>  inode_sleep_on_writeback+0x9d/0xb0
>>  wb_writeback+0x104/0x3d0
>>  wb_workfn+0x325/0x490
>>  process_one_work+0x1d8/0x520
>>  worker_thread+0x1af/0x390
>>  kthread+0xcc/0x100
>>  ret_from_fork+0x2d/0x50
>>  ret_from_fork_asm+0x1a/0x30
>>
>> task:dd              state:S stack:0     pid:1364  tgid:1364
>> ppid:1336   flags:0x00000002
>> Call Trace:
>>  __schedule+0x40b/0xad0
>>  schedule+0x36/0x120
>>  request_wait_answer+0x16b/0x200
>>  __fuse_simple_request+0xd6/0x290
>>  fuse_flush_times+0x119/0x140
>>  fuse_write_inode+0x6d/0xc0
>>  __writeback_single_inode+0x36d/0x480
>>  writeback_single_inode+0xa8/0x170
>>  write_inode_now+0x75/0xa0
>>  fuse_flush+0x85/0x1c0
>>  filp_flush+0x2c/0x70
>>  __x64_sys_close+0x2e/0x80
>>  do_syscall_64+0x64/0x140
>>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> task:sync            state:D stack:0     pid:1365  tgid:1365
>> ppid:1336   flags:0x00004002
>> Call Trace:
>>  __schedule+0x40b/0xad0
>>  schedule+0x36/0x120
>>  wb_wait_for_completion+0x56/0x80
>>  sync_inodes_sb+0xc5/0x450
>>  iterate_supers+0x69/0xd0
>>  ksys_sync+0x40/0xa0
>>  __do_sys_sync+0xa/0x20
>>  do_syscall_64+0x64/0x140
>>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
> 
> Thanks for the trace. If i'm understanding it correctly, this only
> blocks temporarily until the writeback wb_workfn is rescheduled?
> 
>> Maybe I'm too paranoid about this, and in practice we can just let
>> sync(2) block in any case.  But I want to understand this better
> 
> in the case of a malicious fuse server, it could block sync(2) forever
> so I think this means we have to skip the wait for fuse folios
> altogether.

Right. In summary we need to skip waiting on the completion for the
writeback in case of a malicious fuse server could block sync(2)
forever, and meanwhile the change won't break the backward compatibility
as the current fuse implementation also doesn't wait for the data
actually being written back and persistent on the backstore in sync(2).
Jingbo Xu Oct. 28, 2024, 2:28 a.m. UTC | #34
On 10/26/24 2:47 AM, Joanne Koong wrote:
> On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>> On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>
>>>
>>>
>>> On 10/25/24 12:54 AM, Joanne Koong wrote:
>>>> On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>>>>
>>>>> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>>>>
>>>>>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>>>>
>>>>>>> I feel like this is too much restrictive and I am still not sure why
>>>>>>> blocking on fuse folios served by non-privileges fuse server is worse
>>>>>>> than blocking on folios served from the network.
>>>>>>
>>>>>> Might be.  But historically fuse had this behavior and I'd be very
>>>>>> reluctant to change that unconditionally.
>>>>>>
>>>>>> With a systemwide maximal timeout for fuse requests it might make
>>>>>> sense to allow sync(2), etc. to wait for fuse writeback.
>>>>>>
>>>>>> Without a timeout allowing fuse servers to block sync(2) indefinitely
>>>>>> seems rather risky.
>>>>>
>>>>> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
>>>>> That seems in line with the sync(2) documentation Jingbo referenced
>>>>> earlier where it states "The writing, although scheduled, is not
>>>>> necessarily complete upon return from sync()."
>>>>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
>>>>>
>>>>
>>>> So I think the answer to this is "no" for Linux. What the Linux man
>>>> page for sync(2) says:
>>>>
>>>> "According to the standard specification (e.g., POSIX.1-2001), sync()
>>>> schedules the writes, but may return before the actual writing is
>>>> done.  However Linux waits for I/O completions, and thus sync() or
>>>> syncfs() provide the same guarantees as fsync() called on every file
>>>> in the system or filesystem respectively." [1]
>>>
>>> Actually as for FUSE, IIUC the writeback is not guaranteed to be
>>> completed when sync(2) returns since the temp page mechanism.  When
>>> sync(2) returns, PG_writeback is indeed cleared for all original pages
>>> (in the address_space), while the real writeback work (initiated from
>>> temp page) may be still in progress.
>>>
>>
>> That's a great point. It seems like we can just skip waiting on
>> writeback to finish for fuse folios in sync(2) altogether then. I'll
>> look into what's the best way to do this.
> 
> I think the most straightforward way to do this for sync(2) is to add
> the mapping check inside sync_bdevs(). With something like:
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 738e3c8457e7..bcb2b6d3db94 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -1247,7 +1247,7 @@ void sync_bdevs(bool wait)
>                 mutex_lock(&bdev->bd_disk->open_mutex);
>                 if (!atomic_read(&bdev->bd_openers)) {
>                         ; /* skip */
> -               } else if (wait) {
> +               } else if (wait &&
> !mapping_no_writeback_wait(inode->i_mapping)) {
>                         /*
>                          * We keep the error status of individual mapping so
>                          * that applications can catch the writeback error using
> 
> 

I'm afraid we are waiting in wait_sb_inodes (ksys_sync -> sync_inodes_sb
-> wait_sb_inodes) rather than sync_bdevs.  sync_bdevs() is used to
writeback and sync the metadata residing on the block device directly
such as the superblock.  It is sync_inodes_one_sb() that actually
writeback inodes.
Joanne Koong Oct. 28, 2024, 9:57 p.m. UTC | #35
On Sun, Oct 27, 2024 at 7:28 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 10/26/24 2:47 AM, Joanne Koong wrote:
> > On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>
> >> On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>
> >>>
> >>>
> >>> On 10/25/24 12:54 AM, Joanne Koong wrote:
> >>>> On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>>>>
> >>>>> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >>>>>>
> >>>>>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >>>>>>
> >>>>>>> I feel like this is too much restrictive and I am still not sure why
> >>>>>>> blocking on fuse folios served by non-privileges fuse server is worse
> >>>>>>> than blocking on folios served from the network.
> >>>>>>
> >>>>>> Might be.  But historically fuse had this behavior and I'd be very
> >>>>>> reluctant to change that unconditionally.
> >>>>>>
> >>>>>> With a systemwide maximal timeout for fuse requests it might make
> >>>>>> sense to allow sync(2), etc. to wait for fuse writeback.
> >>>>>>
> >>>>>> Without a timeout allowing fuse servers to block sync(2) indefinitely
> >>>>>> seems rather risky.
> >>>>>
> >>>>> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
> >>>>> That seems in line with the sync(2) documentation Jingbo referenced
> >>>>> earlier where it states "The writing, although scheduled, is not
> >>>>> necessarily complete upon return from sync()."
> >>>>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
> >>>>>
> >>>>
> >>>> So I think the answer to this is "no" for Linux. What the Linux man
> >>>> page for sync(2) says:
> >>>>
> >>>> "According to the standard specification (e.g., POSIX.1-2001), sync()
> >>>> schedules the writes, but may return before the actual writing is
> >>>> done.  However Linux waits for I/O completions, and thus sync() or
> >>>> syncfs() provide the same guarantees as fsync() called on every file
> >>>> in the system or filesystem respectively." [1]
> >>>
> >>> Actually as for FUSE, IIUC the writeback is not guaranteed to be
> >>> completed when sync(2) returns since the temp page mechanism.  When
> >>> sync(2) returns, PG_writeback is indeed cleared for all original pages
> >>> (in the address_space), while the real writeback work (initiated from
> >>> temp page) may be still in progress.
> >>>
> >>
> >> That's a great point. It seems like we can just skip waiting on
> >> writeback to finish for fuse folios in sync(2) altogether then. I'll
> >> look into what's the best way to do this.
> >
> > I think the most straightforward way to do this for sync(2) is to add
> > the mapping check inside sync_bdevs(). With something like:
> >
> > diff --git a/block/bdev.c b/block/bdev.c
> > index 738e3c8457e7..bcb2b6d3db94 100644
> > --- a/block/bdev.c
> > +++ b/block/bdev.c
> > @@ -1247,7 +1247,7 @@ void sync_bdevs(bool wait)
> >                 mutex_lock(&bdev->bd_disk->open_mutex);
> >                 if (!atomic_read(&bdev->bd_openers)) {
> >                         ; /* skip */
> > -               } else if (wait) {
> > +               } else if (wait &&
> > !mapping_no_writeback_wait(inode->i_mapping)) {
> >                         /*
> >                          * We keep the error status of individual mapping so
> >                          * that applications can catch the writeback error using
> >
> >
>
> I'm afraid we are waiting in wait_sb_inodes (ksys_sync -> sync_inodes_sb
> -> wait_sb_inodes) rather than sync_bdevs.  sync_bdevs() is used to
> writeback and sync the metadata residing on the block device directly
> such as the superblock.  It is sync_inodes_one_sb() that actually
> writeback inodes.
>

Great point, thanks for the info!

>
> --
> Thanks,
> Jingbo
Joanne Koong Oct. 28, 2024, 9:58 p.m. UTC | #36
On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > >
> > >
> > >
> > > On 10/25/24 12:54 AM, Joanne Koong wrote:
> > > > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >>
> > > >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >>>
> > > >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >>>
> > > >>>> I feel like this is too much restrictive and I am still not sure why
> > > >>>> blocking on fuse folios served by non-privileges fuse server is worse
> > > >>>> than blocking on folios served from the network.
> > > >>>
> > > >>> Might be.  But historically fuse had this behavior and I'd be very
> > > >>> reluctant to change that unconditionally.
> > > >>>
> > > >>> With a systemwide maximal timeout for fuse requests it might make
> > > >>> sense to allow sync(2), etc. to wait for fuse writeback.
> > > >>>
> > > >>> Without a timeout allowing fuse servers to block sync(2) indefinitely
> > > >>> seems rather risky.
> > > >>
> > > >> Could we skip waiting on writeback in sync(2) if it's a fuse folio?
> > > >> That seems in line with the sync(2) documentation Jingbo referenced
> > > >> earlier where it states "The writing, although scheduled, is not
> > > >> necessarily complete upon return from sync()."
> > > >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
> > > >>
> > > >
> > > > So I think the answer to this is "no" for Linux. What the Linux man
> > > > page for sync(2) says:
> > > >
> > > > "According to the standard specification (e.g., POSIX.1-2001), sync()
> > > > schedules the writes, but may return before the actual writing is
> > > > done.  However Linux waits for I/O completions, and thus sync() or
> > > > syncfs() provide the same guarantees as fsync() called on every file
> > > > in the system or filesystem respectively." [1]
> > >
> > > Actually as for FUSE, IIUC the writeback is not guaranteed to be
> > > completed when sync(2) returns since the temp page mechanism.  When
> > > sync(2) returns, PG_writeback is indeed cleared for all original pages
> > > (in the address_space), while the real writeback work (initiated from
> > > temp page) may be still in progress.
> > >
> >
> > That's a great point. It seems like we can just skip waiting on
> > writeback to finish for fuse folios in sync(2) altogether then. I'll
> > look into what's the best way to do this.
> >
> > > I think this is also what Miklos means in:
> > > https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/
> > >
> > > Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages
> > > in sync(2) codepath similar to what we have done for the direct reclaim
> > > in patch 1.
> > >
> > >
> > > >
> > > > Regardless of the compaction / page migration issue then, this
> > > > blocking sync(2) is a dealbreaker.
> > >
> > > I really should have figureg out the compaction / page migration
> > > mechanism and the potential impact to FUSE when we dropping the temp
> > > page.  Just too busy to take some time on this though.....
> >
> > Same here, I need to look some more into the compaction / page
> > migration paths. I'm planning to do this early next week and will
> > report back with what I find.
> >
>
> These are my notes so far:
>
> * We hit the folio_wait_writeback() path when callers call
> migrate_pages() with mode MIGRATE_SYNC
>    ... -> migrate_pages() -> migrate_pages_sync() ->
> migrate_pages_batch() -> migrate_folio_unmap() ->
> folio_wait_writeback()
>
> * These are the places where we call migrate_pages():
> 1) demote_folio_list()
> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>
> 2) __damon_pa_migrate_folio_list()
> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>
> 3) migrate_misplaced_folio()
> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>
> 4) do_move_pages_to_node()
> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
> this path is only invoked by the move_pages() syscall. It's fine to
> wait on writeback for the move_pages() syscall since the user would
> have to deliberately invoke this on the fuse server for this to apply
> to the server's fuse folios
>
> 5)  migrate_to_node()
> Can ignore this for the same reason as in 4. This path is only invoked
> by the migrate_pages() syscall.
>
> 6) do_mbind()
> Can ignore this for the same reason as 4 and 5. This path is only
> invoked by the mbind() syscall.
>
> 7) soft_offline_in_use_page()
> Can skip soft offlining fuse folios (eg folios with the
> AS_NO_WRITEBACK_WAIT mapping flag set).
> The path for this is soft_offline_page() -> soft_offline_in_use_page()
> -> migrate_pages(). soft_offline_page() only invokes this for in-use
> pages in a well-defined state (see ret value of get_hwpoison_page()).
> My understanding of soft offlining pages is that it's a mitigation
> strategy for handling pages that are experiencing errors but are not
> yet completely unusable, and its main purpose is to prevent future
> issues. It seems fine to skip this for fuse folios.
>
> 8) do_migrate_range()
> 9) compact_zone()
> 10) migrate_longterm_unpinnable_folios()
> 11) __alloc_contig_migrate_range()
>
> 8 to 11 needs more investigation / thinking about. I don't see a good
> way around these tbh. I think we have to operate under the assumption
> that the fuse server running is malicious or benevolently but
> incorrectly written and could possibly never complete writeback. So we
> definitely can't wait on these but it also doesn't seem like we can
> skip waiting on these, especially for the case where the server uses
> spliced pages, nor does it seem like we can just fail these with
> -EBUSY or something.
>

I'm still not seeing a good way around this.

What about this then? We add a new fuse sysctl called something like
"/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
admin sets this, then it opts into optimizing writeback to be as fast
as possible (eg skipping the page copies) and if the server doesn't
fulfill the writeback by the set timeout value, then the connection is
aborted.

Alternatively, we could also repurpose
/proc/sys/fs/fuse/max_request_timeout from the request timeout
patchset [1] but I like the additional flexibility and explicitness
having the "writeback_optimization_timeout" sysctl gives.

Any thoughts on this?

[1] https://lore.kernel.org/linux-fsdevel/20241011191320.91592-4-joannelkoong@gmail.com/

Thanks,
Joanne

> Will continue looking more into this early next week.
>
> Thanks,
> Joanne
> >
> > Thanks,
> > Joanne
> > >
> > >
> > > --
> > > Thanks,
> > > Jingbo
Bernd Schubert Oct. 29, 2024, 10:04 p.m. UTC | #37
On 10/17/24 15:31, Miklos Szeredi wrote:
> On Wed, 16 Oct 2024 at 23:27, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
>> Why is it bad? I can understand fuse server getting blocked on fuse
>> folios is bad but why it is bad for other applications/tasks? I am
>> wondering network filesystems have to handle similar situation then why
>> is it bad just for fuse?
> 
> You need privileges (physical access) to unplug the network cable.
> You don't need privileges (in most setups) to run a fuse server.

Not sure if that is the perfect example, you can also run a local user
space nfs server. I think the real difference is 'fusermount' which
typically has the s-bit set and allows mounts from unprivileged users.
I.e. I really think we should differentiate between 
privileged/non-privileged fuse server tasks.

Thanks,
Bernd
Bernd Schubert Oct. 30, 2024, 9:32 a.m. UTC | #38
On 10/28/24 22:58, Joanne Koong wrote:
> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>>> Same here, I need to look some more into the compaction / page
>>> migration paths. I'm planning to do this early next week and will
>>> report back with what I find.
>>>
>>
>> These are my notes so far:
>>
>> * We hit the folio_wait_writeback() path when callers call
>> migrate_pages() with mode MIGRATE_SYNC
>>    ... -> migrate_pages() -> migrate_pages_sync() ->
>> migrate_pages_batch() -> migrate_folio_unmap() ->
>> folio_wait_writeback()
>>
>> * These are the places where we call migrate_pages():
>> 1) demote_folio_list()
>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>
>> 2) __damon_pa_migrate_folio_list()
>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>
>> 3) migrate_misplaced_folio()
>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>
>> 4) do_move_pages_to_node()
>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
>> this path is only invoked by the move_pages() syscall. It's fine to
>> wait on writeback for the move_pages() syscall since the user would
>> have to deliberately invoke this on the fuse server for this to apply
>> to the server's fuse folios
>>
>> 5)  migrate_to_node()
>> Can ignore this for the same reason as in 4. This path is only invoked
>> by the migrate_pages() syscall.
>>
>> 6) do_mbind()
>> Can ignore this for the same reason as 4 and 5. This path is only
>> invoked by the mbind() syscall.
>>
>> 7) soft_offline_in_use_page()
>> Can skip soft offlining fuse folios (eg folios with the
>> AS_NO_WRITEBACK_WAIT mapping flag set).
>> The path for this is soft_offline_page() -> soft_offline_in_use_page()
>> -> migrate_pages(). soft_offline_page() only invokes this for in-use
>> pages in a well-defined state (see ret value of get_hwpoison_page()).
>> My understanding of soft offlining pages is that it's a mitigation
>> strategy for handling pages that are experiencing errors but are not
>> yet completely unusable, and its main purpose is to prevent future
>> issues. It seems fine to skip this for fuse folios.
>>
>> 8) do_migrate_range()
>> 9) compact_zone()
>> 10) migrate_longterm_unpinnable_folios()
>> 11) __alloc_contig_migrate_range()
>>
>> 8 to 11 needs more investigation / thinking about. I don't see a good
>> way around these tbh. I think we have to operate under the assumption
>> that the fuse server running is malicious or benevolently but
>> incorrectly written and could possibly never complete writeback. So we
>> definitely can't wait on these but it also doesn't seem like we can
>> skip waiting on these, especially for the case where the server uses
>> spliced pages, nor does it seem like we can just fail these with
>> -EBUSY or something.

I see some code paths with -EAGAIN in migration. Could you explain why
we can't just fail migration for fuse write-back pages?

>>
> 
> I'm still not seeing a good way around this.
> 
> What about this then? We add a new fuse sysctl called something like
> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
> admin sets this, then it opts into optimizing writeback to be as fast
> as possible (eg skipping the page copies) and if the server doesn't
> fulfill the writeback by the set timeout value, then the connection is
> aborted.
> 
> Alternatively, we could also repurpose
> /proc/sys/fs/fuse/max_request_timeout from the request timeout
> patchset [1] but I like the additional flexibility and explicitness
> having the "writeback_optimization_timeout" sysctl gives.
> 
> Any thoughts on this?


I'm a bit worried that we might lock up the system until time out is
reached - not ideal. Especially as timeouts are in minutes now. But
even a slightly stuttering video system not be great. I think we
should give users/admin the choice then, if they prefer slow page
copies or fast, but possibly shortly unresponsive system.


Thank,
Bernd
Joanne Koong Oct. 30, 2024, 4:04 p.m. UTC | #39
On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
> On 10/28/24 22:58, Joanne Koong wrote:
> > On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>
> >>> Same here, I need to look some more into the compaction / page
> >>> migration paths. I'm planning to do this early next week and will
> >>> report back with what I find.
> >>>
> >>
> >> These are my notes so far:
> >>
> >> * We hit the folio_wait_writeback() path when callers call
> >> migrate_pages() with mode MIGRATE_SYNC
> >>    ... -> migrate_pages() -> migrate_pages_sync() ->
> >> migrate_pages_batch() -> migrate_folio_unmap() ->
> >> folio_wait_writeback()
> >>
> >> * These are the places where we call migrate_pages():
> >> 1) demote_folio_list()
> >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>
> >> 2) __damon_pa_migrate_folio_list()
> >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>
> >> 3) migrate_misplaced_folio()
> >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>
> >> 4) do_move_pages_to_node()
> >> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
> >> this path is only invoked by the move_pages() syscall. It's fine to
> >> wait on writeback for the move_pages() syscall since the user would
> >> have to deliberately invoke this on the fuse server for this to apply
> >> to the server's fuse folios
> >>
> >> 5)  migrate_to_node()
> >> Can ignore this for the same reason as in 4. This path is only invoked
> >> by the migrate_pages() syscall.
> >>
> >> 6) do_mbind()
> >> Can ignore this for the same reason as 4 and 5. This path is only
> >> invoked by the mbind() syscall.
> >>
> >> 7) soft_offline_in_use_page()
> >> Can skip soft offlining fuse folios (eg folios with the
> >> AS_NO_WRITEBACK_WAIT mapping flag set).
> >> The path for this is soft_offline_page() -> soft_offline_in_use_page()
> >> -> migrate_pages(). soft_offline_page() only invokes this for in-use
> >> pages in a well-defined state (see ret value of get_hwpoison_page()).
> >> My understanding of soft offlining pages is that it's a mitigation
> >> strategy for handling pages that are experiencing errors but are not
> >> yet completely unusable, and its main purpose is to prevent future
> >> issues. It seems fine to skip this for fuse folios.
> >>
> >> 8) do_migrate_range()
> >> 9) compact_zone()
> >> 10) migrate_longterm_unpinnable_folios()
> >> 11) __alloc_contig_migrate_range()
> >>
> >> 8 to 11 needs more investigation / thinking about. I don't see a good
> >> way around these tbh. I think we have to operate under the assumption
> >> that the fuse server running is malicious or benevolently but
> >> incorrectly written and could possibly never complete writeback. So we
> >> definitely can't wait on these but it also doesn't seem like we can
> >> skip waiting on these, especially for the case where the server uses
> >> spliced pages, nor does it seem like we can just fail these with
> >> -EBUSY or something.
>
> I see some code paths with -EAGAIN in migration. Could you explain why
> we can't just fail migration for fuse write-back pages?
>

My understanding (and please correct me here Shakeel if I'm wrong) is
that this could block system optimizations, especially since if an
unprivileged malicious fuse server never replies to the writeback
request, then this completely stalls progress. In the best case
scenario, -EAGAIN could be used because the server might just be slow
in serving the writeback, but I think we need to also account for
servers that never complete the writeback. For
__alloc_contig_migrate_range() for example, my understanding is that
this is used to migrate pages so that there are more physically
contiguous ranges of memory freed up. If fuse writeback blocks that,
then that hurts system health overall.

> >>
> >
> > I'm still not seeing a good way around this.
> >
> > What about this then? We add a new fuse sysctl called something like
> > "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
> > admin sets this, then it opts into optimizing writeback to be as fast
> > as possible (eg skipping the page copies) and if the server doesn't
> > fulfill the writeback by the set timeout value, then the connection is
> > aborted.
> >
> > Alternatively, we could also repurpose
> > /proc/sys/fs/fuse/max_request_timeout from the request timeout
> > patchset [1] but I like the additional flexibility and explicitness
> > having the "writeback_optimization_timeout" sysctl gives.
> >
> > Any thoughts on this?
>
>
> I'm a bit worried that we might lock up the system until time out is
> reached - not ideal. Especially as timeouts are in minutes now. But
> even a slightly stuttering video system not be great. I think we
> should give users/admin the choice then, if they prefer slow page
> copies or fast, but possibly shortly unresponsive system.
>
I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout
would be in seconds, where the sys admin would probably set something
more reasonable like 5 seconds or so.
If this syctl value is set, then servers who want writebacks to be
fast can opt into it at mount time (and by doing so agree that they
will service writeback requests by the timeout or their connection
will be aborted).


Thanks,
Joanne
>
> Thank,
> Bernd
Bernd Schubert Oct. 30, 2024, 4:21 p.m. UTC | #40
On 10/30/24 17:04, Joanne Koong wrote:
> On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert
> <bernd.schubert@fastmail.fm> wrote:
>>
>> On 10/28/24 22:58, Joanne Koong wrote:
>>> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>>>
>>>>> Same here, I need to look some more into the compaction / page
>>>>> migration paths. I'm planning to do this early next week and will
>>>>> report back with what I find.
>>>>>
>>>>
>>>> These are my notes so far:
>>>>
>>>> * We hit the folio_wait_writeback() path when callers call
>>>> migrate_pages() with mode MIGRATE_SYNC
>>>>    ... -> migrate_pages() -> migrate_pages_sync() ->
>>>> migrate_pages_batch() -> migrate_folio_unmap() ->
>>>> folio_wait_writeback()
>>>>
>>>> * These are the places where we call migrate_pages():
>>>> 1) demote_folio_list()
>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>>>
>>>> 2) __damon_pa_migrate_folio_list()
>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>>>
>>>> 3) migrate_misplaced_folio()
>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>>>
>>>> 4) do_move_pages_to_node()
>>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
>>>> this path is only invoked by the move_pages() syscall. It's fine to
>>>> wait on writeback for the move_pages() syscall since the user would
>>>> have to deliberately invoke this on the fuse server for this to apply
>>>> to the server's fuse folios
>>>>
>>>> 5)  migrate_to_node()
>>>> Can ignore this for the same reason as in 4. This path is only invoked
>>>> by the migrate_pages() syscall.
>>>>
>>>> 6) do_mbind()
>>>> Can ignore this for the same reason as 4 and 5. This path is only
>>>> invoked by the mbind() syscall.
>>>>
>>>> 7) soft_offline_in_use_page()
>>>> Can skip soft offlining fuse folios (eg folios with the
>>>> AS_NO_WRITEBACK_WAIT mapping flag set).
>>>> The path for this is soft_offline_page() -> soft_offline_in_use_page()
>>>> -> migrate_pages(). soft_offline_page() only invokes this for in-use
>>>> pages in a well-defined state (see ret value of get_hwpoison_page()).
>>>> My understanding of soft offlining pages is that it's a mitigation
>>>> strategy for handling pages that are experiencing errors but are not
>>>> yet completely unusable, and its main purpose is to prevent future
>>>> issues. It seems fine to skip this for fuse folios.
>>>>
>>>> 8) do_migrate_range()
>>>> 9) compact_zone()
>>>> 10) migrate_longterm_unpinnable_folios()
>>>> 11) __alloc_contig_migrate_range()
>>>>
>>>> 8 to 11 needs more investigation / thinking about. I don't see a good
>>>> way around these tbh. I think we have to operate under the assumption
>>>> that the fuse server running is malicious or benevolently but
>>>> incorrectly written and could possibly never complete writeback. So we
>>>> definitely can't wait on these but it also doesn't seem like we can
>>>> skip waiting on these, especially for the case where the server uses
>>>> spliced pages, nor does it seem like we can just fail these with
>>>> -EBUSY or something.
>>
>> I see some code paths with -EAGAIN in migration. Could you explain why
>> we can't just fail migration for fuse write-back pages?
>>

Hi Joanne,

thanks a lot for your quick reply (especially as my reviews come in very 
late).

> 
> My understanding (and please correct me here Shakeel if I'm wrong) is
> that this could block system optimizations, especially since if an
> unprivileged malicious fuse server never replies to the writeback
> request, then this completely stalls progress. In the best case
> scenario, -EAGAIN could be used because the server might just be slow
> in serving the writeback, but I think we need to also account for
> servers that never complete the writeback. For
> __alloc_contig_migrate_range() for example, my understanding is that
> this is used to migrate pages so that there are more physically
> contiguous ranges of memory freed up. If fuse writeback blocks that,
> then that hurts system health overall.

Hmm, I wonder what is worse - tmp page copies or missing compaction.
Especially if we expect a low range of in-writeback pages/folios. 
One could argue that an evil user might spawn many fuse server
processes to work around the default low fuse write-back limits, but
does that make any difference with tmp pages? And these cannot be
compacted either?

And with timeouts that would be so far totally uncritical, I
think.


You also mentioned 

> especially for the case where the server uses spliced pages

could you provide more details for that? 



> 
>>>>
>>>
>>> I'm still not seeing a good way around this.
>>>
>>> What about this then? We add a new fuse sysctl called something like
>>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
>>> admin sets this, then it opts into optimizing writeback to be as fast
>>> as possible (eg skipping the page copies) and if the server doesn't
>>> fulfill the writeback by the set timeout value, then the connection is
>>> aborted.
>>>
>>> Alternatively, we could also repurpose
>>> /proc/sys/fs/fuse/max_request_timeout from the request timeout
>>> patchset [1] but I like the additional flexibility and explicitness
>>> having the "writeback_optimization_timeout" sysctl gives.
>>>
>>> Any thoughts on this?
>>
>>
>> I'm a bit worried that we might lock up the system until time out is
>> reached - not ideal. Especially as timeouts are in minutes now. But
>> even a slightly stuttering video system not be great. I think we
>> should give users/admin the choice then, if they prefer slow page
>> copies or fast, but possibly shortly unresponsive system.
>>
> I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout
> would be in seconds, where the sys admin would probably set something
> more reasonable like 5 seconds or so.
> If this syctl value is set, then servers who want writebacks to be
> fast can opt into it at mount time (and by doing so agree that they
> will service writeback requests by the timeout or their connection
> will be aborted).


I think your current patch set has it in minutes? (Should be easy
enough to change that.) Though I'm more worried about the impact
of _frequent_ timeout scanning through the different fuse lists 
on performance, than about missing compaction for folios that are
currently in write-back.


Thanks,
Bernd
Joanne Koong Oct. 30, 2024, 5:02 p.m. UTC | #41
On Wed, Oct 30, 2024 at 9:21 AM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
> On 10/30/24 17:04, Joanne Koong wrote:
> > On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert
> > <bernd.schubert@fastmail.fm> wrote:
> >>
> >> On 10/28/24 22:58, Joanne Koong wrote:
> >>> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>>>
> >>>>> Same here, I need to look some more into the compaction / page
> >>>>> migration paths. I'm planning to do this early next week and will
> >>>>> report back with what I find.
> >>>>>
> >>>>
> >>>> These are my notes so far:
> >>>>
> >>>> * We hit the folio_wait_writeback() path when callers call
> >>>> migrate_pages() with mode MIGRATE_SYNC
> >>>>    ... -> migrate_pages() -> migrate_pages_sync() ->
> >>>> migrate_pages_batch() -> migrate_folio_unmap() ->
> >>>> folio_wait_writeback()
> >>>>
> >>>> * These are the places where we call migrate_pages():
> >>>> 1) demote_folio_list()
> >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>>>
> >>>> 2) __damon_pa_migrate_folio_list()
> >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>>>
> >>>> 3) migrate_misplaced_folio()
> >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>>>
> >>>> 4) do_move_pages_to_node()
> >>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
> >>>> this path is only invoked by the move_pages() syscall. It's fine to
> >>>> wait on writeback for the move_pages() syscall since the user would
> >>>> have to deliberately invoke this on the fuse server for this to apply
> >>>> to the server's fuse folios
> >>>>
> >>>> 5)  migrate_to_node()
> >>>> Can ignore this for the same reason as in 4. This path is only invoked
> >>>> by the migrate_pages() syscall.
> >>>>
> >>>> 6) do_mbind()
> >>>> Can ignore this for the same reason as 4 and 5. This path is only
> >>>> invoked by the mbind() syscall.
> >>>>
> >>>> 7) soft_offline_in_use_page()
> >>>> Can skip soft offlining fuse folios (eg folios with the
> >>>> AS_NO_WRITEBACK_WAIT mapping flag set).
> >>>> The path for this is soft_offline_page() -> soft_offline_in_use_page()
> >>>> -> migrate_pages(). soft_offline_page() only invokes this for in-use
> >>>> pages in a well-defined state (see ret value of get_hwpoison_page()).
> >>>> My understanding of soft offlining pages is that it's a mitigation
> >>>> strategy for handling pages that are experiencing errors but are not
> >>>> yet completely unusable, and its main purpose is to prevent future
> >>>> issues. It seems fine to skip this for fuse folios.
> >>>>
> >>>> 8) do_migrate_range()
> >>>> 9) compact_zone()
> >>>> 10) migrate_longterm_unpinnable_folios()
> >>>> 11) __alloc_contig_migrate_range()
> >>>>
> >>>> 8 to 11 needs more investigation / thinking about. I don't see a good
> >>>> way around these tbh. I think we have to operate under the assumption
> >>>> that the fuse server running is malicious or benevolently but
> >>>> incorrectly written and could possibly never complete writeback. So we
> >>>> definitely can't wait on these but it also doesn't seem like we can
> >>>> skip waiting on these, especially for the case where the server uses
> >>>> spliced pages, nor does it seem like we can just fail these with
> >>>> -EBUSY or something.
> >>
> >> I see some code paths with -EAGAIN in migration. Could you explain why
> >> we can't just fail migration for fuse write-back pages?
> >>
>
> Hi Joanne,
>
> thanks a lot for your quick reply (especially as my reviews come in very
> late).
>

Thanks for your comments/reviews, Bernd! I always appreciate them.

> >
> > My understanding (and please correct me here Shakeel if I'm wrong) is
> > that this could block system optimizations, especially since if an
> > unprivileged malicious fuse server never replies to the writeback
> > request, then this completely stalls progress. In the best case
> > scenario, -EAGAIN could be used because the server might just be slow
> > in serving the writeback, but I think we need to also account for
> > servers that never complete the writeback. For
> > __alloc_contig_migrate_range() for example, my understanding is that
> > this is used to migrate pages so that there are more physically
> > contiguous ranges of memory freed up. If fuse writeback blocks that,
> > then that hurts system health overall.
>
> Hmm, I wonder what is worse - tmp page copies or missing compaction.
> Especially if we expect a low range of in-writeback pages/folios.
> One could argue that an evil user might spawn many fuse server
> processes to work around the default low fuse write-back limits, but
> does that make any difference with tmp pages? And these cannot be
> compacted either?

My understanding (and Shakeel please jump in here if this isn't right)
is that tmp pages can be migrated/compacted. I think it's only pages
marked as under writeback that are considered to be non-movable.

>
> And with timeouts that would be so far totally uncritical, I
> think.
>
>
> You also mentioned
>
> > especially for the case where the server uses spliced pages
>
> could you provide more details for that?
>

For the page migration / compaction paths, I don't think we can do the
workaround we could do for sync where we skip waiting on writeback for
fuse folios and continue on with the operation, because the migration
/ compaction paths operate on the pages. For the splice case, we
assign the page to the pipebuffer (fuse_ref_page()), so if the
migration/compaction happens on the page before the server has read
this page from the pipebuffer, it'll be incorrect data or maybe crash
the kernel.

>
>
> >
> >>>>
> >>>
> >>> I'm still not seeing a good way around this.
> >>>
> >>> What about this then? We add a new fuse sysctl called something like
> >>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
> >>> admin sets this, then it opts into optimizing writeback to be as fast
> >>> as possible (eg skipping the page copies) and if the server doesn't
> >>> fulfill the writeback by the set timeout value, then the connection is
> >>> aborted.
> >>>
> >>> Alternatively, we could also repurpose
> >>> /proc/sys/fs/fuse/max_request_timeout from the request timeout
> >>> patchset [1] but I like the additional flexibility and explicitness
> >>> having the "writeback_optimization_timeout" sysctl gives.
> >>>
> >>> Any thoughts on this?
> >>
> >>
> >> I'm a bit worried that we might lock up the system until time out is
> >> reached - not ideal. Especially as timeouts are in minutes now. But
> >> even a slightly stuttering video system not be great. I think we
> >> should give users/admin the choice then, if they prefer slow page
> >> copies or fast, but possibly shortly unresponsive system.
> >>
> > I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout
> > would be in seconds, where the sys admin would probably set something
> > more reasonable like 5 seconds or so.
> > If this syctl value is set, then servers who want writebacks to be
> > fast can opt into it at mount time (and by doing so agree that they
> > will service writeback requests by the timeout or their connection
> > will be aborted).
>
>
> I think your current patch set has it in minutes? (Should be easy
> enough to change that.) Though I'm more worried about the impact
> of _frequent_ timeout scanning through the different fuse lists
> on performance, than about missing compaction for folios that are
> currently in write-back.
>

Ah, for this the " /proc/sys/fs/fuse/writeback_optimization_timeout"
would be a separate thing from the
"/proc/sys/fs/fuse/max_request_timeout". The
"/proc/sys/fs/fuse/writeback_optimization_timeout" would only apply
for writeback requests. I was thinking implementation-wise, for
writebacks we could just have a timer associated with each request
(instead of having to grab locks with the fuse lists), since they
won't be super common.


Thanks,
Joanne
>
> Thanks,
> Bernd
Bernd Schubert Oct. 30, 2024, 5:27 p.m. UTC | #42
On 10/30/24 18:02, Joanne Koong wrote:
> On Wed, Oct 30, 2024 at 9:21 AM Bernd Schubert
> <bernd.schubert@fastmail.fm> wrote:
>>
>> On 10/30/24 17:04, Joanne Koong wrote:
>>> On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert
>>> <bernd.schubert@fastmail.fm> wrote:
>>>>
>>>> On 10/28/24 22:58, Joanne Koong wrote:
>>>>> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>>>>>
>>>>>>> Same here, I need to look some more into the compaction / page
>>>>>>> migration paths. I'm planning to do this early next week and will
>>>>>>> report back with what I find.
>>>>>>>
>>>>>>
>>>>>> These are my notes so far:
>>>>>>
>>>>>> * We hit the folio_wait_writeback() path when callers call
>>>>>> migrate_pages() with mode MIGRATE_SYNC
>>>>>>    ... -> migrate_pages() -> migrate_pages_sync() ->
>>>>>> migrate_pages_batch() -> migrate_folio_unmap() ->
>>>>>> folio_wait_writeback()
>>>>>>
>>>>>> * These are the places where we call migrate_pages():
>>>>>> 1) demote_folio_list()
>>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>>>>>
>>>>>> 2) __damon_pa_migrate_folio_list()
>>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>>>>>
>>>>>> 3) migrate_misplaced_folio()
>>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
>>>>>>
>>>>>> 4) do_move_pages_to_node()
>>>>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
>>>>>> this path is only invoked by the move_pages() syscall. It's fine to
>>>>>> wait on writeback for the move_pages() syscall since the user would
>>>>>> have to deliberately invoke this on the fuse server for this to apply
>>>>>> to the server's fuse folios
>>>>>>
>>>>>> 5)  migrate_to_node()
>>>>>> Can ignore this for the same reason as in 4. This path is only invoked
>>>>>> by the migrate_pages() syscall.
>>>>>>
>>>>>> 6) do_mbind()
>>>>>> Can ignore this for the same reason as 4 and 5. This path is only
>>>>>> invoked by the mbind() syscall.
>>>>>>
>>>>>> 7) soft_offline_in_use_page()
>>>>>> Can skip soft offlining fuse folios (eg folios with the
>>>>>> AS_NO_WRITEBACK_WAIT mapping flag set).
>>>>>> The path for this is soft_offline_page() -> soft_offline_in_use_page()
>>>>>> -> migrate_pages(). soft_offline_page() only invokes this for in-use
>>>>>> pages in a well-defined state (see ret value of get_hwpoison_page()).
>>>>>> My understanding of soft offlining pages is that it's a mitigation
>>>>>> strategy for handling pages that are experiencing errors but are not
>>>>>> yet completely unusable, and its main purpose is to prevent future
>>>>>> issues. It seems fine to skip this for fuse folios.
>>>>>>
>>>>>> 8) do_migrate_range()
>>>>>> 9) compact_zone()
>>>>>> 10) migrate_longterm_unpinnable_folios()
>>>>>> 11) __alloc_contig_migrate_range()
>>>>>>
>>>>>> 8 to 11 needs more investigation / thinking about. I don't see a good
>>>>>> way around these tbh. I think we have to operate under the assumption
>>>>>> that the fuse server running is malicious or benevolently but
>>>>>> incorrectly written and could possibly never complete writeback. So we
>>>>>> definitely can't wait on these but it also doesn't seem like we can
>>>>>> skip waiting on these, especially for the case where the server uses
>>>>>> spliced pages, nor does it seem like we can just fail these with
>>>>>> -EBUSY or something.
>>>>
>>>> I see some code paths with -EAGAIN in migration. Could you explain why
>>>> we can't just fail migration for fuse write-back pages?
>>>>
>>
>> Hi Joanne,
>>
>> thanks a lot for your quick reply (especially as my reviews come in very
>> late).
>>
> 
> Thanks for your comments/reviews, Bernd! I always appreciate them.
> 
>>>
>>> My understanding (and please correct me here Shakeel if I'm wrong) is
>>> that this could block system optimizations, especially since if an
>>> unprivileged malicious fuse server never replies to the writeback
>>> request, then this completely stalls progress. In the best case
>>> scenario, -EAGAIN could be used because the server might just be slow
>>> in serving the writeback, but I think we need to also account for
>>> servers that never complete the writeback. For
>>> __alloc_contig_migrate_range() for example, my understanding is that
>>> this is used to migrate pages so that there are more physically
>>> contiguous ranges of memory freed up. If fuse writeback blocks that,
>>> then that hurts system health overall.
>>
>> Hmm, I wonder what is worse - tmp page copies or missing compaction.
>> Especially if we expect a low range of in-writeback pages/folios.
>> One could argue that an evil user might spawn many fuse server
>> processes to work around the default low fuse write-back limits, but
>> does that make any difference with tmp pages? And these cannot be
>> compacted either?
> 
> My understanding (and Shakeel please jump in here if this isn't right)
> is that tmp pages can be migrated/compacted. I think it's only pages
> marked as under writeback that are considered to be non-movable.
> 
>>
>> And with timeouts that would be so far totally uncritical, I
>> think.
>>
>>
>> You also mentioned
>>
>>> especially for the case where the server uses spliced pages
>>
>> could you provide more details for that?
>>
7> 
> For the page migration / compaction paths, I don't think we can do the
> workaround we could do for sync where we skip waiting on writeback for
> fuse folios and continue on with the operation, because the migration
> / compaction paths operate on the pages. For the splice case, we
> assign the page to the pipebuffer (fuse_ref_page()), so if the
> migration/compaction happens on the page before the server has read
> this page from the pipebuffer, it'll be incorrect data or maybe crash
> the kernel.
> 
>>
>>
>>>
>>>>>>
>>>>>
>>>>> I'm still not seeing a good way around this.
>>>>>
>>>>> What about this then? We add a new fuse sysctl called something like
>>>>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
>>>>> admin sets this, then it opts into optimizing writeback to be as fast
>>>>> as possible (eg skipping the page copies) and if the server doesn't
>>>>> fulfill the writeback by the set timeout value, then the connection is
>>>>> aborted.
>>>>>
>>>>> Alternatively, we could also repurpose
>>>>> /proc/sys/fs/fuse/max_request_timeout from the request timeout
>>>>> patchset [1] but I like the additional flexibility and explicitness
>>>>> having the "writeback_optimization_timeout" sysctl gives.
>>>>>
>>>>> Any thoughts on this?
>>>>
>>>>
>>>> I'm a bit worried that we might lock up the system until time out is
>>>> reached - not ideal. Especially as timeouts are in minutes now. But
>>>> even a slightly stuttering video system not be great. I think we
>>>> should give users/admin the choice then, if they prefer slow page
>>>> copies or fast, but possibly shortly unresponsive system.
>>>>
>>> I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout
>>> would be in seconds, where the sys admin would probably set something
>>> more reasonable like 5 seconds or so.
>>> If this syctl value is set, then servers who want writebacks to be
>>> fast can opt into it at mount time (and by doing so agree that they
>>> will service writeback requests by the timeout or their connection
>>> will be aborted).
>>
>>
>> I think your current patch set has it in minutes? (Should be easy
>> enough to change that.) Though I'm more worried about the impact
>> of _frequent_ timeout scanning through the different fuse lists
>> on performance, than about missing compaction for folios that are
>> currently in write-back.

Hmm, if tmp pages can be compacted, isn't that a problem for splice?
I.e. I don't understand what the difference between tmp page and
write-back page for migration.


>>
> 
> Ah, for this the " /proc/sys/fs/fuse/writeback_optimization_timeout"
> would be a separate thing from the
> "/proc/sys/fs/fuse/max_request_timeout". The
> "/proc/sys/fs/fuse/writeback_optimization_timeout" would only apply
> for writeback requests. I was thinking implementation-wise, for
> writebacks we could just have a timer associated with each request
> (instead of having to grab locks with the fuse lists), since they
> won't be super common.

Ah, thank you! I had missed that this is another variable. Issue
with too short timeouts would probably be network hick-up that
would immediately kill fuse-server. I.e. if it just the missing
page compaction/migration, maybe larger time outs would be
acceptable.


Thanks,
Bernd
Joanne Koong Oct. 30, 2024, 5:35 p.m. UTC | #43
On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
>
>
> On 10/30/24 18:02, Joanne Koong wrote:
> > On Wed, Oct 30, 2024 at 9:21 AM Bernd Schubert
> > <bernd.schubert@fastmail.fm> wrote:
> >>
> >> On 10/30/24 17:04, Joanne Koong wrote:
> >>> On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert
> >>> <bernd.schubert@fastmail.fm> wrote:
> >>>>
> >>>> On 10/28/24 22:58, Joanne Koong wrote:
> >>>>> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >>>>>>
> >>>>>>> Same here, I need to look some more into the compaction / page
> >>>>>>> migration paths. I'm planning to do this early next week and will
> >>>>>>> report back with what I find.
> >>>>>>>
> >>>>>>
> >>>>>> These are my notes so far:
> >>>>>>
> >>>>>> * We hit the folio_wait_writeback() path when callers call
> >>>>>> migrate_pages() with mode MIGRATE_SYNC
> >>>>>>    ... -> migrate_pages() -> migrate_pages_sync() ->
> >>>>>> migrate_pages_batch() -> migrate_folio_unmap() ->
> >>>>>> folio_wait_writeback()
> >>>>>>
> >>>>>> * These are the places where we call migrate_pages():
> >>>>>> 1) demote_folio_list()
> >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>>>>>
> >>>>>> 2) __damon_pa_migrate_folio_list()
> >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>>>>>
> >>>>>> 3) migrate_misplaced_folio()
> >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode
> >>>>>>
> >>>>>> 4) do_move_pages_to_node()
> >>>>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but
> >>>>>> this path is only invoked by the move_pages() syscall. It's fine to
> >>>>>> wait on writeback for the move_pages() syscall since the user would
> >>>>>> have to deliberately invoke this on the fuse server for this to apply
> >>>>>> to the server's fuse folios
> >>>>>>
> >>>>>> 5)  migrate_to_node()
> >>>>>> Can ignore this for the same reason as in 4. This path is only invoked
> >>>>>> by the migrate_pages() syscall.
> >>>>>>
> >>>>>> 6) do_mbind()
> >>>>>> Can ignore this for the same reason as 4 and 5. This path is only
> >>>>>> invoked by the mbind() syscall.
> >>>>>>
> >>>>>> 7) soft_offline_in_use_page()
> >>>>>> Can skip soft offlining fuse folios (eg folios with the
> >>>>>> AS_NO_WRITEBACK_WAIT mapping flag set).
> >>>>>> The path for this is soft_offline_page() -> soft_offline_in_use_page()
> >>>>>> -> migrate_pages(). soft_offline_page() only invokes this for in-use
> >>>>>> pages in a well-defined state (see ret value of get_hwpoison_page()).
> >>>>>> My understanding of soft offlining pages is that it's a mitigation
> >>>>>> strategy for handling pages that are experiencing errors but are not
> >>>>>> yet completely unusable, and its main purpose is to prevent future
> >>>>>> issues. It seems fine to skip this for fuse folios.
> >>>>>>
> >>>>>> 8) do_migrate_range()
> >>>>>> 9) compact_zone()
> >>>>>> 10) migrate_longterm_unpinnable_folios()
> >>>>>> 11) __alloc_contig_migrate_range()
> >>>>>>
> >>>>>> 8 to 11 needs more investigation / thinking about. I don't see a good
> >>>>>> way around these tbh. I think we have to operate under the assumption
> >>>>>> that the fuse server running is malicious or benevolently but
> >>>>>> incorrectly written and could possibly never complete writeback. So we
> >>>>>> definitely can't wait on these but it also doesn't seem like we can
> >>>>>> skip waiting on these, especially for the case where the server uses
> >>>>>> spliced pages, nor does it seem like we can just fail these with
> >>>>>> -EBUSY or something.
> >>>>
> >>>> I see some code paths with -EAGAIN in migration. Could you explain why
> >>>> we can't just fail migration for fuse write-back pages?
> >>>>
> >>
> >> Hi Joanne,
> >>
> >> thanks a lot for your quick reply (especially as my reviews come in very
> >> late).
> >>
> >
> > Thanks for your comments/reviews, Bernd! I always appreciate them.
> >
> >>>
> >>> My understanding (and please correct me here Shakeel if I'm wrong) is
> >>> that this could block system optimizations, especially since if an
> >>> unprivileged malicious fuse server never replies to the writeback
> >>> request, then this completely stalls progress. In the best case
> >>> scenario, -EAGAIN could be used because the server might just be slow
> >>> in serving the writeback, but I think we need to also account for
> >>> servers that never complete the writeback. For
> >>> __alloc_contig_migrate_range() for example, my understanding is that
> >>> this is used to migrate pages so that there are more physically
> >>> contiguous ranges of memory freed up. If fuse writeback blocks that,
> >>> then that hurts system health overall.
> >>
> >> Hmm, I wonder what is worse - tmp page copies or missing compaction.
> >> Especially if we expect a low range of in-writeback pages/folios.
> >> One could argue that an evil user might spawn many fuse server
> >> processes to work around the default low fuse write-back limits, but
> >> does that make any difference with tmp pages? And these cannot be
> >> compacted either?
> >
> > My understanding (and Shakeel please jump in here if this isn't right)
> > is that tmp pages can be migrated/compacted. I think it's only pages
> > marked as under writeback that are considered to be non-movable.
> >
> >>
> >> And with timeouts that would be so far totally uncritical, I
> >> think.
> >>
> >>
> >> You also mentioned
> >>
> >>> especially for the case where the server uses spliced pages
> >>
> >> could you provide more details for that?
> >>
> 7>
> > For the page migration / compaction paths, I don't think we can do the
> > workaround we could do for sync where we skip waiting on writeback for
> > fuse folios and continue on with the operation, because the migration
> > / compaction paths operate on the pages. For the splice case, we
> > assign the page to the pipebuffer (fuse_ref_page()), so if the
> > migration/compaction happens on the page before the server has read
> > this page from the pipebuffer, it'll be incorrect data or maybe crash
> > the kernel.
> >
> >>
> >>
> >>>
> >>>>>>
> >>>>>
> >>>>> I'm still not seeing a good way around this.
> >>>>>
> >>>>> What about this then? We add a new fuse sysctl called something like
> >>>>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys
> >>>>> admin sets this, then it opts into optimizing writeback to be as fast
> >>>>> as possible (eg skipping the page copies) and if the server doesn't
> >>>>> fulfill the writeback by the set timeout value, then the connection is
> >>>>> aborted.
> >>>>>
> >>>>> Alternatively, we could also repurpose
> >>>>> /proc/sys/fs/fuse/max_request_timeout from the request timeout
> >>>>> patchset [1] but I like the additional flexibility and explicitness
> >>>>> having the "writeback_optimization_timeout" sysctl gives.
> >>>>>
> >>>>> Any thoughts on this?
> >>>>
> >>>>
> >>>> I'm a bit worried that we might lock up the system until time out is
> >>>> reached - not ideal. Especially as timeouts are in minutes now. But
> >>>> even a slightly stuttering video system not be great. I think we
> >>>> should give users/admin the choice then, if they prefer slow page
> >>>> copies or fast, but possibly shortly unresponsive system.
> >>>>
> >>> I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout
> >>> would be in seconds, where the sys admin would probably set something
> >>> more reasonable like 5 seconds or so.
> >>> If this syctl value is set, then servers who want writebacks to be
> >>> fast can opt into it at mount time (and by doing so agree that they
> >>> will service writeback requests by the timeout or their connection
> >>> will be aborted).
> >>
> >>
> >> I think your current patch set has it in minutes? (Should be easy
> >> enough to change that.) Though I'm more worried about the impact
> >> of _frequent_ timeout scanning through the different fuse lists
> >> on performance, than about missing compaction for folios that are
> >> currently in write-back.
>
> Hmm, if tmp pages can be compacted, isn't that a problem for splice?
> I.e. I don't understand what the difference between tmp page and
> write-back page for migration.
>

That's a great question! I have no idea how compaction works for pages
being used in splice. Shakeel, do you know the answer to this?


Thanks,
Joanne

>
> >>
> >
> > Ah, for this the " /proc/sys/fs/fuse/writeback_optimization_timeout"
> > would be a separate thing from the
> > "/proc/sys/fs/fuse/max_request_timeout". The
> > "/proc/sys/fs/fuse/writeback_optimization_timeout" would only apply
> > for writeback requests. I was thinking implementation-wise, for
> > writebacks we could just have a timer associated with each request
> > (instead of having to grab locks with the fuse lists), since they
> > won't be super common.
>
> Ah, thank you! I had missed that this is another variable. Issue
> with too short timeouts would probably be network hick-up that
> would immediately kill fuse-server. I.e. if it just the missing
> page compaction/migration, maybe larger time outs would be
> acceptable.
>
>
> Thanks,
> Bernd
Shakeel Butt Oct. 30, 2024, 9:56 p.m. UTC | #44
On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote:
> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert
> <bernd.schubert@fastmail.fm> wrote:
> >
> >
> > Hmm, if tmp pages can be compacted, isn't that a problem for splice?
> > I.e. I don't understand what the difference between tmp page and
> > write-back page for migration.
> >
> 
> That's a great question! I have no idea how compaction works for pages
> being used in splice. Shakeel, do you know the answer to this?
> 

Sorry for the late response. I still have to go through other unanswered
questions but let me answer this one quickly. From the way the tmp pages
are allocated, it does not seem like they are movable and thus are not
target for migration/compaction.

The page with the writeback bit set is actually just a user memory page
cache which is moveable but due to, at the moment, under writeback it
temporarily becomes unmovable to not cause corruption.

Shakeel
Bernd Schubert Oct. 30, 2024, 10:17 p.m. UTC | #45
On 10/30/24 22:56, Shakeel Butt wrote:
> On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote:
>> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert
>> <bernd.schubert@fastmail.fm> wrote:
>>>
>>>
>>> Hmm, if tmp pages can be compacted, isn't that a problem for splice?
>>> I.e. I don't understand what the difference between tmp page and
>>> write-back page for migration.
>>>
>>
>> That's a great question! I have no idea how compaction works for pages
>> being used in splice. Shakeel, do you know the answer to this?
>>
> 
> Sorry for the late response. I still have to go through other unanswered
> questions but let me answer this one quickly. From the way the tmp pages
> are allocated, it does not seem like they are movable and thus are not
> target for migration/compaction.
> 
> The page with the writeback bit set is actually just a user memory page
> cache which is moveable but due to, at the moment, under writeback it
> temporarily becomes unmovable to not cause corruption.

Thanks a lot for your quick reply Shakeel! (Actually very fast!).

With that, it confirms what I wrote earlier - removing tmp and ignoring
fuse writeback pages in migration should not make any difference 
regarding overall system performance. Unless I miss something,
more on the contrary as additional memory pressure expensive page
copying is being removed.


Thanks,
Bernd
Joanne Koong Oct. 30, 2024, 10:51 p.m. UTC | #46
On Wed, Oct 30, 2024 at 3:17 PM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
>
>
> On 10/30/24 22:56, Shakeel Butt wrote:
> > On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote:
> >> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert
> >> <bernd.schubert@fastmail.fm> wrote:
> >>>
> >>>
> >>> Hmm, if tmp pages can be compacted, isn't that a problem for splice?
> >>> I.e. I don't understand what the difference between tmp page and
> >>> write-back page for migration.
> >>>
> >>
> >> That's a great question! I have no idea how compaction works for pages
> >> being used in splice. Shakeel, do you know the answer to this?
> >>
> >
> > Sorry for the late response. I still have to go through other unanswered
> > questions but let me answer this one quickly. From the way the tmp pages
> > are allocated, it does not seem like they are movable and thus are not
> > target for migration/compaction.
> >
> > The page with the writeback bit set is actually just a user memory page
> > cache which is moveable but due to, at the moment, under writeback it
> > temporarily becomes unmovable to not cause corruption.
>
> Thanks a lot for your quick reply Shakeel! (Actually very fast!).
>
> With that, it confirms what I wrote earlier - removing tmp and ignoring
> fuse writeback pages in migration should not make any difference
> regarding overall system performance. Unless I miss something,
> more on the contrary as additional memory pressure expensive page
> copying is being removed.
>

Thanks for the information Shakeel, and thanks Bernd for bringing up
this point of discussion.

Before I celebrate too prematurely, a few additional questions:

Are tmp pages (eg from folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0)) and
page cache pages allocated from the same memory pool? Or are tmp pages
allocated from a special memory pool that isn't meant to be
compacted/optimized?

If they are allocated from the same memory pool, then it seems like
there's no difference between tmp pages blocking a memory range from
being compacted vs. a page cache page blocking a memory range from
being compacted (by not clearing writeback). But if they are not
allocated from the same pool, then it seems like the page cache page
blocking migration could adversely affect general system performance
in a way that the tmp page doesn't?


Thanks,
Joanne
>
> Thanks,
> Bernd
Shakeel Butt Oct. 31, 2024, 12:30 a.m. UTC | #47
On Wed, Oct 30, 2024 at 03:51:08PM GMT, Joanne Koong wrote:
> On Wed, Oct 30, 2024 at 3:17 PM Bernd Schubert
> <bernd.schubert@fastmail.fm> wrote:
> >
> >
> >
> > On 10/30/24 22:56, Shakeel Butt wrote:
> > > On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote:
> > >> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert
> > >> <bernd.schubert@fastmail.fm> wrote:
> > >>>
> > >>>
> > >>> Hmm, if tmp pages can be compacted, isn't that a problem for splice?
> > >>> I.e. I don't understand what the difference between tmp page and
> > >>> write-back page for migration.
> > >>>
> > >>
> > >> That's a great question! I have no idea how compaction works for pages
> > >> being used in splice. Shakeel, do you know the answer to this?
> > >>
> > >
> > > Sorry for the late response. I still have to go through other unanswered
> > > questions but let me answer this one quickly. From the way the tmp pages
> > > are allocated, it does not seem like they are movable and thus are not
> > > target for migration/compaction.
> > >
> > > The page with the writeback bit set is actually just a user memory page
> > > cache which is moveable but due to, at the moment, under writeback it
> > > temporarily becomes unmovable to not cause corruption.
> >
> > Thanks a lot for your quick reply Shakeel! (Actually very fast!).
> >
> > With that, it confirms what I wrote earlier - removing tmp and ignoring
> > fuse writeback pages in migration should not make any difference
> > regarding overall system performance. Unless I miss something,
> > more on the contrary as additional memory pressure expensive page
> > copying is being removed.
> >
> 
> Thanks for the information Shakeel, and thanks Bernd for bringing up
> this point of discussion.
> 
> Before I celebrate too prematurely, a few additional questions:

You are asking hard questions, so CCed couple more folks to correct me
if I am wrong.

> 
> Are tmp pages (eg from folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0)) and
> page cache pages allocated from the same memory pool? Or are tmp pages
> allocated from a special memory pool that isn't meant to be
> compacted/optimized?

Memory pool is a bit confusing term here. Most probably you are asking
about the migrate type of the page block from which tmp page is
allocated from. In a normal system, tmp page would be allocated from page
block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
depends on what gfp flag was used for its allocation. What does fuse fs
use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
allocations can get mixed up with different migrate types.

> 
> If they are allocated from the same memory pool, then it seems like
> there's no difference between tmp pages blocking a memory range from
> being compacted vs. a page cache page blocking a memory range from
> being compacted (by not clearing writeback). But if they are not
> allocated from the same pool, then it seems like the page cache page
> blocking migration could adversely affect general system performance
> in a way that the tmp page doesn't?

I think irrespective of where the page is coming from, the page under
writeback is non-movable and can fragment the memory. The question that
is that worse than a tmp page fragmenting the memory, I am not sure.
Joanne Koong Oct. 31, 2024, 7:06 p.m. UTC | #48
On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Oct 30, 2024 at 03:51:08PM GMT, Joanne Koong wrote:
> > On Wed, Oct 30, 2024 at 3:17 PM Bernd Schubert
> > <bernd.schubert@fastmail.fm> wrote:
> > >
> > >
> > >
> > > On 10/30/24 22:56, Shakeel Butt wrote:
> > > > On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote:
> > > >> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert
> > > >> <bernd.schubert@fastmail.fm> wrote:
> > > >>>
> > > >>>
> > > >>> Hmm, if tmp pages can be compacted, isn't that a problem for splice?
> > > >>> I.e. I don't understand what the difference between tmp page and
> > > >>> write-back page for migration.
> > > >>>
> > > >>
> > > >> That's a great question! I have no idea how compaction works for pages
> > > >> being used in splice. Shakeel, do you know the answer to this?
> > > >>
> > > >
> > > > Sorry for the late response. I still have to go through other unanswered
> > > > questions but let me answer this one quickly. From the way the tmp pages
> > > > are allocated, it does not seem like they are movable and thus are not
> > > > target for migration/compaction.
> > > >
> > > > The page with the writeback bit set is actually just a user memory page
> > > > cache which is moveable but due to, at the moment, under writeback it
> > > > temporarily becomes unmovable to not cause corruption.
> > >
> > > Thanks a lot for your quick reply Shakeel! (Actually very fast!).
> > >
> > > With that, it confirms what I wrote earlier - removing tmp and ignoring
> > > fuse writeback pages in migration should not make any difference
> > > regarding overall system performance. Unless I miss something,
> > > more on the contrary as additional memory pressure expensive page
> > > copying is being removed.
> > >
> >
> > Thanks for the information Shakeel, and thanks Bernd for bringing up
> > this point of discussion.
> >
> > Before I celebrate too prematurely, a few additional questions:
>
> You are asking hard questions, so CCed couple more folks to correct me
> if I am wrong.
>
> >
> > Are tmp pages (eg from folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0)) and
> > page cache pages allocated from the same memory pool? Or are tmp pages
> > allocated from a special memory pool that isn't meant to be
> > compacted/optimized?
>
> Memory pool is a bit confusing term here. Most probably you are asking
> about the migrate type of the page block from which tmp page is
> allocated from. In a normal system, tmp page would be allocated from page
> block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> depends on what gfp flag was used for its allocation. What does fuse fs
> use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> allocations can get mixed up with different migrate types.
>

I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
fuse doesn't set any additional gfp masks on the inode mapping.

Could we just allocate the fuse writeback pages with GFP_HIGHUSER
instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
where we pass in the gfp mask to __filemap_get_folio(). I think this
would give us the same behavior memory-wise as what the tmp pages
currently do, and would solve all our headaches regarding writeback
potentially blocking page migration/compaction.



Thanks,
Joanne

> >
> > If they are allocated from the same memory pool, then it seems like
> > there's no difference between tmp pages blocking a memory range from
> > being compacted vs. a page cache page blocking a memory range from
> > being compacted (by not clearing writeback). But if they are not
> > allocated from the same pool, then it seems like the page cache page
> > blocking migration could adversely affect general system performance
> > in a way that the tmp page doesn't?
>
> I think irrespective of where the page is coming from, the page under
> writeback is non-movable and can fragment the memory. The question that
> is that worse than a tmp page fragmenting the memory, I am not sure.
>
Shakeel Butt Oct. 31, 2024, 8:06 p.m. UTC | #49
On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
> On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
[...]
> >
> > Memory pool is a bit confusing term here. Most probably you are asking
> > about the migrate type of the page block from which tmp page is
> > allocated from. In a normal system, tmp page would be allocated from page
> > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> > depends on what gfp flag was used for its allocation. What does fuse fs
> > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> > allocations can get mixed up with different migrate types.
> >
> 
> I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
> fuse doesn't set any additional gfp masks on the inode mapping.
> 
> Could we just allocate the fuse writeback pages with GFP_HIGHUSER
> instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
> where we pass in the gfp mask to __filemap_get_folio(). I think this
> would give us the same behavior memory-wise as what the tmp pages
> currently do,

I don't think it would be the same behavior. From what I understand the
liftime of the tmp page is from the start of the writeback till the ack
from the fuse server that writeback is done. While the lifetime of the
page of the page cache can be arbitrarily large. We should just make it
unmovable for its lifetime. I think it is fine to make the page
unmovable during the writeback. We should not try to optimize for the
bad or buggy behavior of fuse server.

Regarding the avoidance of wait on writeback for fuse folios, I think we
can handle the migration similar to how you are handling reclaim and in
addition we can add a WARN() in folio_wait_writeback() if the kernel ever
sees a fuse folio in that function.
Joanne Koong Oct. 31, 2024, 9:52 p.m. UTC | #50
On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
> > On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> [...]
> > >
> > > Memory pool is a bit confusing term here. Most probably you are asking
> > > about the migrate type of the page block from which tmp page is
> > > allocated from. In a normal system, tmp page would be allocated from page
> > > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> > > depends on what gfp flag was used for its allocation. What does fuse fs
> > > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> > > allocations can get mixed up with different migrate types.
> > >
> >
> > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
> > fuse doesn't set any additional gfp masks on the inode mapping.
> >
> > Could we just allocate the fuse writeback pages with GFP_HIGHUSER
> > instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
> > where we pass in the gfp mask to __filemap_get_folio(). I think this
> > would give us the same behavior memory-wise as what the tmp pages
> > currently do,
>
> I don't think it would be the same behavior. From what I understand the
> liftime of the tmp page is from the start of the writeback till the ack
> from the fuse server that writeback is done. While the lifetime of the
> page of the page cache can be arbitrarily large. We should just make it
> unmovable for its lifetime. I think it is fine to make the page
> unmovable during the writeback. We should not try to optimize for the
> bad or buggy behavior of fuse server.
>
> Regarding the avoidance of wait on writeback for fuse folios, I think we
> can handle the migration similar to how you are handling reclaim and in
> addition we can add a WARN() in folio_wait_writeback() if the kernel ever
> sees a fuse folio in that function.

Awesome, this is what I'm planning to do in v3 to address migration then:

1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)

2) in the fuse filesystem's implementation of the
mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
under writeback.

Does this sound good?


Thanks,
Joanne
Shakeel Butt Oct. 31, 2024, 10:38 p.m. UTC | #51
On Thu, Oct 31, 2024 at 02:52:57PM GMT, Joanne Koong wrote:
> On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
> > > On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > [...]
> > > >
> > > > Memory pool is a bit confusing term here. Most probably you are asking
> > > > about the migrate type of the page block from which tmp page is
> > > > allocated from. In a normal system, tmp page would be allocated from page
> > > > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> > > > depends on what gfp flag was used for its allocation. What does fuse fs
> > > > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> > > > allocations can get mixed up with different migrate types.
> > > >
> > >
> > > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
> > > fuse doesn't set any additional gfp masks on the inode mapping.
> > >
> > > Could we just allocate the fuse writeback pages with GFP_HIGHUSER
> > > instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
> > > where we pass in the gfp mask to __filemap_get_folio(). I think this
> > > would give us the same behavior memory-wise as what the tmp pages
> > > currently do,
> >
> > I don't think it would be the same behavior. From what I understand the
> > liftime of the tmp page is from the start of the writeback till the ack
> > from the fuse server that writeback is done. While the lifetime of the
> > page of the page cache can be arbitrarily large. We should just make it
> > unmovable for its lifetime. I think it is fine to make the page
> > unmovable during the writeback. We should not try to optimize for the
> > bad or buggy behavior of fuse server.
> >
> > Regarding the avoidance of wait on writeback for fuse folios, I think we
> > can handle the migration similar to how you are handling reclaim and in
> > addition we can add a WARN() in folio_wait_writeback() if the kernel ever
> > sees a fuse folio in that function.
> 
> Awesome, this is what I'm planning to do in v3 to address migration then:
> 
> 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
> src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
> fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)
> 
> 2) in the fuse filesystem's implementation of the
> mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
> under writeback.

3) Add WARN_ONCE() in folio_wait_writeback() if folio->mapping has
AS_NO_WRITEBACK_WAIT set and return without waiting.

> 
> Does this sound good?

Yes.
Jingbo Xu Nov. 1, 2024, 11:44 a.m. UTC | #52
Hi Joanne,

Thanks for keeping pushing this forward.

On 11/1/24 5:52 AM, Joanne Koong wrote:
> On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
>>> On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>> [...]
>>>>
>>>> Memory pool is a bit confusing term here. Most probably you are asking
>>>> about the migrate type of the page block from which tmp page is
>>>> allocated from. In a normal system, tmp page would be allocated from page
>>>> block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
>>>> depends on what gfp flag was used for its allocation. What does fuse fs
>>>> use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
>>>> allocations can get mixed up with different migrate types.
>>>>
>>>
>>> I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
>>> fuse doesn't set any additional gfp masks on the inode mapping.
>>>
>>> Could we just allocate the fuse writeback pages with GFP_HIGHUSER
>>> instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
>>> where we pass in the gfp mask to __filemap_get_folio(). I think this
>>> would give us the same behavior memory-wise as what the tmp pages
>>> currently do,
>>
>> I don't think it would be the same behavior. From what I understand the
>> liftime of the tmp page is from the start of the writeback till the ack
>> from the fuse server that writeback is done. While the lifetime of the
>> page of the page cache can be arbitrarily large. We should just make it
>> unmovable for its lifetime. I think it is fine to make the page
>> unmovable during the writeback. We should not try to optimize for the
>> bad or buggy behavior of fuse server.
>>
>> Regarding the avoidance of wait on writeback for fuse folios, I think we
>> can handle the migration similar to how you are handling reclaim and in
>> addition we can add a WARN() in folio_wait_writeback() if the kernel ever
>> sees a fuse folio in that function.
> 
> Awesome, this is what I'm planning to do in v3 to address migration then:
> 
> 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
> src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
> fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)

I think it's generally okay to skip FUSE pages under writeback when the
sync migrate_pages() is called in low memory context, which only tries
to migrate as many pages as possible (i.e. best effort).

While more caution may be needed when the sync migrate_pages() is called
with an implicit hint that the migration can not fail.  For example,

```
offline_pages
	while {
		scan_movable_pages
		do_migrate_range
	}
```

If the malicious server never completes the writeback IO, no progress
will be made in the above while loop, and I'm afraid it will be a dead
loop then.


> 
> 2) in the fuse filesystem's implementation of the
> mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
> under writeback.

Is there any possibility that a_ops->migrate_folio() may be called with
the folio under writeback?

- for most pages without AS_NO_WRITEBACK_WAIT, a_ops->migrate_folio()
will be called only when Page_writeback is cleared;
- for AS_NO_WRITEBACK_WAIT pages, they are skipped if they are under
writeback
Joanne Koong Nov. 1, 2024, 8:54 p.m. UTC | #53
On Fri, Nov 1, 2024 at 4:44 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> Hi Joanne,
>
> Thanks for keeping pushing this forward.
>
> On 11/1/24 5:52 AM, Joanne Koong wrote:
> > On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >>
> >> On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
> >>> On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >> [...]
> >>>>
> >>>> Memory pool is a bit confusing term here. Most probably you are asking
> >>>> about the migrate type of the page block from which tmp page is
> >>>> allocated from. In a normal system, tmp page would be allocated from page
> >>>> block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> >>>> depends on what gfp flag was used for its allocation. What does fuse fs
> >>>> use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> >>>> allocations can get mixed up with different migrate types.
> >>>>
> >>>
> >>> I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
> >>> fuse doesn't set any additional gfp masks on the inode mapping.
> >>>
> >>> Could we just allocate the fuse writeback pages with GFP_HIGHUSER
> >>> instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
> >>> where we pass in the gfp mask to __filemap_get_folio(). I think this
> >>> would give us the same behavior memory-wise as what the tmp pages
> >>> currently do,
> >>
> >> I don't think it would be the same behavior. From what I understand the
> >> liftime of the tmp page is from the start of the writeback till the ack
> >> from the fuse server that writeback is done. While the lifetime of the
> >> page of the page cache can be arbitrarily large. We should just make it
> >> unmovable for its lifetime. I think it is fine to make the page
> >> unmovable during the writeback. We should not try to optimize for the
> >> bad or buggy behavior of fuse server.
> >>
> >> Regarding the avoidance of wait on writeback for fuse folios, I think we
> >> can handle the migration similar to how you are handling reclaim and in
> >> addition we can add a WARN() in folio_wait_writeback() if the kernel ever
> >> sees a fuse folio in that function.
> >
> > Awesome, this is what I'm planning to do in v3 to address migration then:
> >
> > 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
> > src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
> > fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)
>
> I think it's generally okay to skip FUSE pages under writeback when the
> sync migrate_pages() is called in low memory context, which only tries
> to migrate as many pages as possible (i.e. best effort).
>
> While more caution may be needed when the sync migrate_pages() is called
> with an implicit hint that the migration can not fail.  For example,
>
> ```
> offline_pages
>         while {
>                 scan_movable_pages
>                 do_migrate_range
>         }
> ```
>
> If the malicious server never completes the writeback IO, no progress
> will be made in the above while loop, and I'm afraid it will be a dead
> loop then.
>

Thanks for taking a look and sharing your thoughts.
I agree. I think for this offline_pages() path, we need to handle this
"TODO: fatal migration failures should bail out". For v3 I'm thinking
of handling this by having some number of retries where we try
do_migrate_range() but if it still doesn't succeed, to skip those
pages and move onto the next.

>
> >
> > 2) in the fuse filesystem's implementation of the
> > mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
> > under writeback.
>
> Is there any possibility that a_ops->migrate_folio() may be called with
> the folio under writeback?
>
> - for most pages without AS_NO_WRITEBACK_WAIT, a_ops->migrate_folio()
> will be called only when Page_writeback is cleared;
> - for AS_NO_WRITEBACK_WAIT pages, they are skipped if they are under
> writeback
>

For AS_NO_WRITEBACK_WAIT_PAGES, if we skip waiting on them if they are
under writeback, I think the a_ops->migrate_folio() will still get
called (by migrate_pages_batch() -> migrate_folio_move() ->
move_to_new_folio()).

Looking at migrate_folio_unmap() some more though,  I don't think we
can just skip the wait call like we can for the sync(2) case. I think
we need to error out here instead since after the wait call,
migrate_folio_unmap() will replace the folio's page table mappings
(try_to_migrate()). If we error out here, then there's no hitting
a_ops->migrate_folio() when the folio is under writeback.


Thanks,
Joanne
> --
> Thanks,
> Jingbo
Jingbo Xu Nov. 4, 2024, 8:09 a.m. UTC | #54
On 11/2/24 4:54 AM, Joanne Koong wrote:
> On Fri, Nov 1, 2024 at 4:44 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> Hi Joanne,
>>
>> Thanks for keeping pushing this forward.
>>
>> On 11/1/24 5:52 AM, Joanne Koong wrote:
>>> On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>>
>>>> On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
>>>>> On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>> [...]
>>>>>>
>>>>>> Memory pool is a bit confusing term here. Most probably you are asking
>>>>>> about the migrate type of the page block from which tmp page is
>>>>>> allocated from. In a normal system, tmp page would be allocated from page
>>>>>> block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
>>>>>> depends on what gfp flag was used for its allocation. What does fuse fs
>>>>>> use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
>>>>>> allocations can get mixed up with different migrate types.
>>>>>>
>>>>>
>>>>> I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
>>>>> fuse doesn't set any additional gfp masks on the inode mapping.
>>>>>
>>>>> Could we just allocate the fuse writeback pages with GFP_HIGHUSER
>>>>> instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
>>>>> where we pass in the gfp mask to __filemap_get_folio(). I think this
>>>>> would give us the same behavior memory-wise as what the tmp pages
>>>>> currently do,
>>>>
>>>> I don't think it would be the same behavior. From what I understand the
>>>> liftime of the tmp page is from the start of the writeback till the ack
>>>> from the fuse server that writeback is done. While the lifetime of the
>>>> page of the page cache can be arbitrarily large. We should just make it
>>>> unmovable for its lifetime. I think it is fine to make the page
>>>> unmovable during the writeback. We should not try to optimize for the
>>>> bad or buggy behavior of fuse server.
>>>>
>>>> Regarding the avoidance of wait on writeback for fuse folios, I think we
>>>> can handle the migration similar to how you are handling reclaim and in
>>>> addition we can add a WARN() in folio_wait_writeback() if the kernel ever
>>>> sees a fuse folio in that function.
>>>
>>> Awesome, this is what I'm planning to do in v3 to address migration then:
>>>
>>> 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
>>> src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
>>> fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)
>>
>> I think it's generally okay to skip FUSE pages under writeback when the
>> sync migrate_pages() is called in low memory context, which only tries
>> to migrate as many pages as possible (i.e. best effort).
>>
>> While more caution may be needed when the sync migrate_pages() is called
>> with an implicit hint that the migration can not fail.  For example,
>>
>> ```
>> offline_pages
>>         while {
>>                 scan_movable_pages
>>                 do_migrate_range
>>         }
>> ```
>>
>> If the malicious server never completes the writeback IO, no progress
>> will be made in the above while loop, and I'm afraid it will be a dead
>> loop then.
>>
> 
> Thanks for taking a look and sharing your thoughts.
> I agree. I think for this offline_pages() path, we need to handle this
> "TODO: fatal migration failures should bail out". For v3 I'm thinking
> of handling this by having some number of retries where we try
> do_migrate_range() but if it still doesn't succeed, to skip those
> pages and move onto the next.
> 
>>
>>>
>>> 2) in the fuse filesystem's implementation of the
>>> mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
>>> under writeback.
>>
>> Is there any possibility that a_ops->migrate_folio() may be called with
>> the folio under writeback?
>>
>> - for most pages without AS_NO_WRITEBACK_WAIT, a_ops->migrate_folio()
>> will be called only when Page_writeback is cleared;
>> - for AS_NO_WRITEBACK_WAIT pages, they are skipped if they are under
>> writeback
>>
> 
> For AS_NO_WRITEBACK_WAIT_PAGES, if we skip waiting on them if they are
> under writeback, I think the a_ops->migrate_folio() will still get
> called (by migrate_pages_batch() -> migrate_folio_move() ->
> move_to_new_folio()).
> 
> Looking at migrate_folio_unmap() some more though,  I don't think we
> can just skip the wait call like we can for the sync(2) case. I think
> we need to error out here instead since after the wait call,
> migrate_folio_unmap() will replace the folio's page table mappings
> (try_to_migrate()). If we error out here, then there's no hitting
> a_ops->migrate_folio() when the folio is under writeback.
> 

Right, we need to bail out to skip this page (under writeback), just
like how MIGRATE_SYNC does.
Joanne Koong Nov. 6, 2024, 11:37 p.m. UTC | #55
On Thu, Oct 31, 2024 at 3:38 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Oct 31, 2024 at 02:52:57PM GMT, Joanne Koong wrote:
> > On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
> > > > On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > [...]
> > > > >
> > > > > Memory pool is a bit confusing term here. Most probably you are asking
> > > > > about the migrate type of the page block from which tmp page is
> > > > > allocated from. In a normal system, tmp page would be allocated from page
> > > > > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> > > > > depends on what gfp flag was used for its allocation. What does fuse fs
> > > > > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> > > > > allocations can get mixed up with different migrate types.
> > > > >
> > > >
> > > > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
> > > > fuse doesn't set any additional gfp masks on the inode mapping.
> > > >
> > > > Could we just allocate the fuse writeback pages with GFP_HIGHUSER
> > > > instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
> > > > where we pass in the gfp mask to __filemap_get_folio(). I think this
> > > > would give us the same behavior memory-wise as what the tmp pages
> > > > currently do,
> > >
> > > I don't think it would be the same behavior. From what I understand the
> > > liftime of the tmp page is from the start of the writeback till the ack
> > > from the fuse server that writeback is done. While the lifetime of the
> > > page of the page cache can be arbitrarily large. We should just make it
> > > unmovable for its lifetime. I think it is fine to make the page
> > > unmovable during the writeback. We should not try to optimize for the
> > > bad or buggy behavior of fuse server.
> > >
> > > Regarding the avoidance of wait on writeback for fuse folios, I think we
> > > can handle the migration similar to how you are handling reclaim and in
> > > addition we can add a WARN() in folio_wait_writeback() if the kernel ever
> > > sees a fuse folio in that function.
> >
> > Awesome, this is what I'm planning to do in v3 to address migration then:
> >
> > 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
> > src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
> > fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)
> >
> > 2) in the fuse filesystem's implementation of the
> > mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
> > under writeback.
>
> 3) Add WARN_ONCE() in folio_wait_writeback() if folio->mapping has
> AS_NO_WRITEBACK_WAIT set and return without waiting.

For v3, I'm going to change AS_NO_WRITEBACK_RECLAIM to
AS_WRITEBACK_MAY_BLOCK and skip 3) because 3) may be too restrictive.
For example, for the sync_file_range() syscall, we do want to wait on
writeback - it's ok in this case to call folio_wait_writeback() on a
fuse folio since the caller would have intentionally passed in a fuse
fd to sync_file_range().


Thanks,
Joanne

>
> >
> > Does this sound good?
>
> Yes.
Shakeel Butt Nov. 6, 2024, 11:56 p.m. UTC | #56
On Wed, Nov 06, 2024 at 03:37:11PM -0800, Joanne Koong wrote:
> On Thu, Oct 31, 2024 at 3:38 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Oct 31, 2024 at 02:52:57PM GMT, Joanne Koong wrote:
> > > On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote:
> > > > > On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > > [...]
> > > > > >
> > > > > > Memory pool is a bit confusing term here. Most probably you are asking
> > > > > > about the migrate type of the page block from which tmp page is
> > > > > > allocated from. In a normal system, tmp page would be allocated from page
> > > > > > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it
> > > > > > depends on what gfp flag was used for its allocation. What does fuse fs
> > > > > > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation
> > > > > > allocations can get mixed up with different migrate types.
> > > > > >
> > > > >
> > > > > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since
> > > > > fuse doesn't set any additional gfp masks on the inode mapping.
> > > > >
> > > > > Could we just allocate the fuse writeback pages with GFP_HIGHUSER
> > > > > instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin()
> > > > > where we pass in the gfp mask to __filemap_get_folio(). I think this
> > > > > would give us the same behavior memory-wise as what the tmp pages
> > > > > currently do,
> > > >
> > > > I don't think it would be the same behavior. From what I understand the
> > > > liftime of the tmp page is from the start of the writeback till the ack
> > > > from the fuse server that writeback is done. While the lifetime of the
> > > > page of the page cache can be arbitrarily large. We should just make it
> > > > unmovable for its lifetime. I think it is fine to make the page
> > > > unmovable during the writeback. We should not try to optimize for the
> > > > bad or buggy behavior of fuse server.
> > > >
> > > > Regarding the avoidance of wait on writeback for fuse folios, I think we
> > > > can handle the migration similar to how you are handling reclaim and in
> > > > addition we can add a WARN() in folio_wait_writeback() if the kernel ever
> > > > sees a fuse folio in that function.
> > >
> > > Awesome, this is what I'm planning to do in v3 to address migration then:
> > >
> > > 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if
> > > src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg
> > > fuse folios will have that AS_NO_WRITEBACK_WAIT bit set)
> > >
> > > 2) in the fuse filesystem's implementation of the
> > > mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is
> > > under writeback.
> >
> > 3) Add WARN_ONCE() in folio_wait_writeback() if folio->mapping has
> > AS_NO_WRITEBACK_WAIT set and return without waiting.
> 
> For v3, I'm going to change AS_NO_WRITEBACK_RECLAIM to
> AS_WRITEBACK_MAY_BLOCK and skip 3) because 3) may be too restrictive.
> For example, for the sync_file_range() syscall, we do want to wait on
> writeback - it's ok in this case to call folio_wait_writeback() on a
> fuse folio since the caller would have intentionally passed in a fuse
> fd to sync_file_range().
> 

Sounds good.
diff mbox series

Patch

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 4304e44f32e6..c3c5ddb4f66b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -415,74 +415,11 @@  u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
 
 struct fuse_writepage_args {
 	struct fuse_io_args ia;
-	struct rb_node writepages_entry;
 	struct list_head queue_entry;
-	struct fuse_writepage_args *next;
 	struct inode *inode;
 	struct fuse_sync_bucket *bucket;
 };
 
-static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
-					    pgoff_t idx_from, pgoff_t idx_to)
-{
-	struct rb_node *n;
-
-	n = fi->writepages.rb_node;
-
-	while (n) {
-		struct fuse_writepage_args *wpa;
-		pgoff_t curr_index;
-
-		wpa = rb_entry(n, struct fuse_writepage_args, writepages_entry);
-		WARN_ON(get_fuse_inode(wpa->inode) != fi);
-		curr_index = wpa->ia.write.in.offset >> PAGE_SHIFT;
-		if (idx_from >= curr_index + wpa->ia.ap.num_pages)
-			n = n->rb_right;
-		else if (idx_to < curr_index)
-			n = n->rb_left;
-		else
-			return wpa;
-	}
-	return NULL;
-}
-
-/*
- * Check if any page in a range is under writeback
- */
-static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
-				   pgoff_t idx_to)
-{
-	struct fuse_inode *fi = get_fuse_inode(inode);
-	bool found;
-
-	if (RB_EMPTY_ROOT(&fi->writepages))
-		return false;
-
-	spin_lock(&fi->lock);
-	found = fuse_find_writeback(fi, idx_from, idx_to);
-	spin_unlock(&fi->lock);
-
-	return found;
-}
-
-static inline bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
-{
-	return fuse_range_is_writeback(inode, index, index);
-}
-
-/*
- * Wait for page writeback to be completed.
- *
- * Since fuse doesn't rely on the VM writeback tracking, this has to
- * use some other means.
- */
-static void fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
-{
-	struct fuse_inode *fi = get_fuse_inode(inode);
-
-	wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index));
-}
-
 /*
  * Wait for all pending writepages on the inode to finish.
  *
@@ -876,7 +813,7 @@  static int fuse_do_readpage(struct file *file, struct page *page)
 	 * page-cache page, so make sure we read a properly synced
 	 * page.
 	 */
-	fuse_wait_on_page_writeback(inode, page->index);
+	folio_wait_writeback(page_folio(page));
 
 	attr_ver = fuse_get_attr_version(fm->fc);
 
@@ -1024,8 +961,7 @@  static void fuse_readahead(struct readahead_control *rac)
 		ap = &ia->ap;
 		nr_pages = __readahead_batch(rac, ap->pages, nr_pages);
 		for (i = 0; i < nr_pages; i++) {
-			fuse_wait_on_page_writeback(inode,
-						    readahead_index(rac) + i);
+			folio_wait_writeback(page_folio(ap->pages[i]));
 			ap->descs[i].length = PAGE_SIZE;
 		}
 		ap->num_pages = nr_pages;
@@ -1147,7 +1083,7 @@  static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 	int err;
 
 	for (i = 0; i < ap->num_pages; i++)
-		fuse_wait_on_page_writeback(inode, ap->pages[i]->index);
+		folio_wait_writeback(page_folio(ap->pages[i]));
 
 	fuse_write_args_fill(ia, ff, pos, count);
 	ia->write.in.flags = fuse_write_flags(iocb);
@@ -1583,7 +1519,7 @@  ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 			return res;
 		}
 	}
-	if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
+	if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) {
 		if (!write)
 			inode_lock(inode);
 		fuse_sync_writes(inode);
@@ -1780,13 +1716,17 @@  static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
 static void fuse_writepage_free(struct fuse_writepage_args *wpa)
 {
 	struct fuse_args_pages *ap = &wpa->ia.ap;
+	struct folio *folio;
 	int i;
 
 	if (wpa->bucket)
 		fuse_sync_bucket_dec(wpa->bucket);
 
-	for (i = 0; i < ap->num_pages; i++)
-		__free_page(ap->pages[i]);
+	for (i = 0; i < ap->num_pages; i++) {
+		folio = page_folio(ap->pages[i]);
+		folio_end_writeback(folio);
+		folio_put(folio);
+	}
 
 	fuse_file_put(wpa->ia.ff, false);
 
@@ -1799,7 +1739,7 @@  static void fuse_writepage_finish_stat(struct inode *inode, struct page *page)
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 
 	dec_wb_stat(&bdi->wb, WB_WRITEBACK);
-	dec_node_page_state(page, NR_WRITEBACK_TEMP);
+	dec_node_page_state(page, NR_WRITEBACK);
 	wb_writeout_inc(&bdi->wb);
 }
 
@@ -1822,7 +1762,6 @@  static void fuse_send_writepage(struct fuse_mount *fm,
 __releases(fi->lock)
 __acquires(fi->lock)
 {
-	struct fuse_writepage_args *aux, *next;
 	struct fuse_inode *fi = get_fuse_inode(wpa->inode);
 	struct fuse_write_in *inarg = &wpa->ia.write.in;
 	struct fuse_args *args = &wpa->ia.ap.args;
@@ -1858,18 +1797,8 @@  __acquires(fi->lock)
 
  out_free:
 	fi->writectr--;
-	rb_erase(&wpa->writepages_entry, &fi->writepages);
 	fuse_writepage_finish(wpa);
 	spin_unlock(&fi->lock);
-
-	/* After rb_erase() aux request list is private */
-	for (aux = wpa->next; aux; aux = next) {
-		next = aux->next;
-		aux->next = NULL;
-		fuse_writepage_finish_stat(aux->inode, aux->ia.ap.pages[0]);
-		fuse_writepage_free(aux);
-	}
-
 	fuse_writepage_free(wpa);
 	spin_lock(&fi->lock);
 }
@@ -1897,43 +1826,6 @@  __acquires(fi->lock)
 	}
 }
 
-static struct fuse_writepage_args *fuse_insert_writeback(struct rb_root *root,
-						struct fuse_writepage_args *wpa)
-{
-	pgoff_t idx_from = wpa->ia.write.in.offset >> PAGE_SHIFT;
-	pgoff_t idx_to = idx_from + wpa->ia.ap.num_pages - 1;
-	struct rb_node **p = &root->rb_node;
-	struct rb_node  *parent = NULL;
-
-	WARN_ON(!wpa->ia.ap.num_pages);
-	while (*p) {
-		struct fuse_writepage_args *curr;
-		pgoff_t curr_index;
-
-		parent = *p;
-		curr = rb_entry(parent, struct fuse_writepage_args,
-				writepages_entry);
-		WARN_ON(curr->inode != wpa->inode);
-		curr_index = curr->ia.write.in.offset >> PAGE_SHIFT;
-
-		if (idx_from >= curr_index + curr->ia.ap.num_pages)
-			p = &(*p)->rb_right;
-		else if (idx_to < curr_index)
-			p = &(*p)->rb_left;
-		else
-			return curr;
-	}
-
-	rb_link_node(&wpa->writepages_entry, parent, p);
-	rb_insert_color(&wpa->writepages_entry, root);
-	return NULL;
-}
-
-static void tree_insert(struct rb_root *root, struct fuse_writepage_args *wpa)
-{
-	WARN_ON(fuse_insert_writeback(root, wpa));
-}
-
 static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
 			       int error)
 {
@@ -1953,41 +1845,6 @@  static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
 	if (!fc->writeback_cache)
 		fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
 	spin_lock(&fi->lock);
-	rb_erase(&wpa->writepages_entry, &fi->writepages);
-	while (wpa->next) {
-		struct fuse_mount *fm = get_fuse_mount(inode);
-		struct fuse_write_in *inarg = &wpa->ia.write.in;
-		struct fuse_writepage_args *next = wpa->next;
-
-		wpa->next = next->next;
-		next->next = NULL;
-		tree_insert(&fi->writepages, next);
-
-		/*
-		 * Skip fuse_flush_writepages() to make it easy to crop requests
-		 * based on primary request size.
-		 *
-		 * 1st case (trivial): there are no concurrent activities using
-		 * fuse_set/release_nowrite.  Then we're on safe side because
-		 * fuse_flush_writepages() would call fuse_send_writepage()
-		 * anyway.
-		 *
-		 * 2nd case: someone called fuse_set_nowrite and it is waiting
-		 * now for completion of all in-flight requests.  This happens
-		 * rarely and no more than once per page, so this should be
-		 * okay.
-		 *
-		 * 3rd case: someone (e.g. fuse_do_setattr()) is in the middle
-		 * of fuse_set_nowrite..fuse_release_nowrite section.  The fact
-		 * that fuse_set_nowrite returned implies that all in-flight
-		 * requests were completed along with all of their secondary
-		 * requests.  Further primary requests are blocked by negative
-		 * writectr.  Hence there cannot be any in-flight requests and
-		 * no invocations of fuse_writepage_end() while we're in
-		 * fuse_set_nowrite..fuse_release_nowrite section.
-		 */
-		fuse_send_writepage(fm, next, inarg->offset + inarg->size);
-	}
 	fi->writectr--;
 	fuse_writepage_finish(wpa);
 	spin_unlock(&fi->lock);
@@ -2074,19 +1931,18 @@  static void fuse_writepage_add_to_bucket(struct fuse_conn *fc,
 }
 
 static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
-					  struct folio *tmp_folio, uint32_t page_index)
+					  uint32_t page_index)
 {
 	struct inode *inode = folio->mapping->host;
 	struct fuse_args_pages *ap = &wpa->ia.ap;
 
-	folio_copy(tmp_folio, folio);
-
-	ap->pages[page_index] = &tmp_folio->page;
+	folio_get(folio);
+	ap->pages[page_index] = &folio->page;
 	ap->descs[page_index].offset = 0;
 	ap->descs[page_index].length = PAGE_SIZE;
 
 	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
-	inc_node_page_state(&tmp_folio->page, NR_WRITEBACK_TEMP);
+	inc_node_page_state(&folio->page, NR_WRITEBACK);
 }
 
 static struct fuse_writepage_args *fuse_writepage_args_setup(struct folio *folio,
@@ -2121,18 +1977,12 @@  static int fuse_writepage_locked(struct folio *folio)
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_writepage_args *wpa;
 	struct fuse_args_pages *ap;
-	struct folio *tmp_folio;
 	struct fuse_file *ff;
-	int error = -ENOMEM;
-
-	tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
-	if (!tmp_folio)
-		goto err;
+	int error = -EIO;
 
-	error = -EIO;
 	ff = fuse_write_file_get(fi);
 	if (!ff)
-		goto err_nofile;
+		goto err;
 
 	wpa = fuse_writepage_args_setup(folio, ff);
 	error = -ENOMEM;
@@ -2143,22 +1993,17 @@  static int fuse_writepage_locked(struct folio *folio)
 	ap->num_pages = 1;
 
 	folio_start_writeback(folio);
-	fuse_writepage_args_page_fill(wpa, folio, tmp_folio, 0);
+	fuse_writepage_args_page_fill(wpa, folio, 0);
 
 	spin_lock(&fi->lock);
-	tree_insert(&fi->writepages, wpa);
 	list_add_tail(&wpa->queue_entry, &fi->queued_writes);
 	fuse_flush_writepages(inode);
 	spin_unlock(&fi->lock);
 
-	folio_end_writeback(folio);
-
 	return 0;
 
 err_writepage_args:
 	fuse_file_put(ff, false);
-err_nofile:
-	folio_put(tmp_folio);
 err:
 	mapping_set_error(folio->mapping, error);
 	return error;
@@ -2168,7 +2013,6 @@  struct fuse_fill_wb_data {
 	struct fuse_writepage_args *wpa;
 	struct fuse_file *ff;
 	struct inode *inode;
-	struct page **orig_pages;
 	unsigned int max_pages;
 };
 
@@ -2203,68 +2047,11 @@  static void fuse_writepages_send(struct fuse_fill_wb_data *data)
 	struct fuse_writepage_args *wpa = data->wpa;
 	struct inode *inode = data->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
-	int num_pages = wpa->ia.ap.num_pages;
-	int i;
 
 	spin_lock(&fi->lock);
 	list_add_tail(&wpa->queue_entry, &fi->queued_writes);
 	fuse_flush_writepages(inode);
 	spin_unlock(&fi->lock);
-
-	for (i = 0; i < num_pages; i++)
-		end_page_writeback(data->orig_pages[i]);
-}
-
-/*
- * Check under fi->lock if the page is under writeback, and insert it onto the
- * rb_tree if not. Otherwise iterate auxiliary write requests, to see if there's
- * one already added for a page at this offset.  If there's none, then insert
- * this new request onto the auxiliary list, otherwise reuse the existing one by
- * swapping the new temp page with the old one.
- */
-static bool fuse_writepage_add(struct fuse_writepage_args *new_wpa,
-			       struct page *page)
-{
-	struct fuse_inode *fi = get_fuse_inode(new_wpa->inode);
-	struct fuse_writepage_args *tmp;
-	struct fuse_writepage_args *old_wpa;
-	struct fuse_args_pages *new_ap = &new_wpa->ia.ap;
-
-	WARN_ON(new_ap->num_pages != 0);
-	new_ap->num_pages = 1;
-
-	spin_lock(&fi->lock);
-	old_wpa = fuse_insert_writeback(&fi->writepages, new_wpa);
-	if (!old_wpa) {
-		spin_unlock(&fi->lock);
-		return true;
-	}
-
-	for (tmp = old_wpa->next; tmp; tmp = tmp->next) {
-		pgoff_t curr_index;
-
-		WARN_ON(tmp->inode != new_wpa->inode);
-		curr_index = tmp->ia.write.in.offset >> PAGE_SHIFT;
-		if (curr_index == page->index) {
-			WARN_ON(tmp->ia.ap.num_pages != 1);
-			swap(tmp->ia.ap.pages[0], new_ap->pages[0]);
-			break;
-		}
-	}
-
-	if (!tmp) {
-		new_wpa->next = old_wpa->next;
-		old_wpa->next = new_wpa;
-	}
-
-	spin_unlock(&fi->lock);
-
-	if (tmp) {
-		fuse_writepage_finish_stat(new_wpa->inode, new_ap->pages[0]);
-		fuse_writepage_free(new_wpa);
-	}
-
-	return false;
 }
 
 static bool fuse_writepage_need_send(struct fuse_conn *fc, struct page *page,
@@ -2273,15 +2060,6 @@  static bool fuse_writepage_need_send(struct fuse_conn *fc, struct page *page,
 {
 	WARN_ON(!ap->num_pages);
 
-	/*
-	 * Being under writeback is unlikely but possible.  For example direct
-	 * read to an mmaped fuse file will set the page dirty twice; once when
-	 * the pages are faulted with get_user_pages(), and then after the read
-	 * completed.
-	 */
-	if (fuse_page_is_writeback(data->inode, page->index))
-		return true;
-
 	/* Reached max pages */
 	if (ap->num_pages == fc->max_pages)
 		return true;
@@ -2291,7 +2069,7 @@  static bool fuse_writepage_need_send(struct fuse_conn *fc, struct page *page,
 		return true;
 
 	/* Discontinuity */
-	if (data->orig_pages[ap->num_pages - 1]->index + 1 != page->index)
+	if (ap->pages[ap->num_pages - 1]->index + 1 != page->index)
 		return true;
 
 	/* Need to grow the pages array?  If so, did the expansion fail? */
@@ -2308,9 +2086,7 @@  static int fuse_writepages_fill(struct folio *folio,
 	struct fuse_writepage_args *wpa = data->wpa;
 	struct fuse_args_pages *ap = &wpa->ia.ap;
 	struct inode *inode = data->inode;
-	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	struct folio *tmp_folio;
 	int err;
 
 	if (wpa && fuse_writepage_need_send(fc, &folio->page, ap, data)) {
@@ -2318,54 +2094,23 @@  static int fuse_writepages_fill(struct folio *folio,
 		data->wpa = NULL;
 	}
 
-	err = -ENOMEM;
-	tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
-	if (!tmp_folio)
-		goto out_unlock;
-
-	/*
-	 * The page must not be redirtied until the writeout is completed
-	 * (i.e. userspace has sent a reply to the write request).  Otherwise
-	 * there could be more than one temporary page instance for each real
-	 * page.
-	 *
-	 * This is ensured by holding the page lock in page_mkwrite() while
-	 * checking fuse_page_is_writeback().  We already hold the page lock
-	 * since clear_page_dirty_for_io() and keep it held until we add the
-	 * request to the fi->writepages list and increment ap->num_pages.
-	 * After this fuse_page_is_writeback() will indicate that the page is
-	 * under writeback, so we can release the page lock.
-	 */
 	if (data->wpa == NULL) {
 		err = -ENOMEM;
 		wpa = fuse_writepage_args_setup(folio, data->ff);
-		if (!wpa) {
-			folio_put(tmp_folio);
+		if (!wpa)
 			goto out_unlock;
-		}
 		fuse_file_get(wpa->ia.ff);
 		data->max_pages = 1;
 		ap = &wpa->ia.ap;
 	}
 	folio_start_writeback(folio);
 
-	fuse_writepage_args_page_fill(wpa, folio, tmp_folio, ap->num_pages);
-	data->orig_pages[ap->num_pages] = &folio->page;
+	fuse_writepage_args_page_fill(wpa, folio, ap->num_pages);
 
 	err = 0;
-	if (data->wpa) {
-		/*
-		 * Protected by fi->lock against concurrent access by
-		 * fuse_page_is_writeback().
-		 */
-		spin_lock(&fi->lock);
-		ap->num_pages++;
-		spin_unlock(&fi->lock);
-	} else if (fuse_writepage_add(wpa, &folio->page)) {
+	ap->num_pages++;
+	if (!data->wpa)
 		data->wpa = wpa;
-	} else {
-		folio_end_writeback(folio);
-	}
 out_unlock:
 	folio_unlock(folio);
 
@@ -2394,21 +2139,12 @@  static int fuse_writepages(struct address_space *mapping,
 	if (!data.ff)
 		return -EIO;
 
-	err = -ENOMEM;
-	data.orig_pages = kcalloc(fc->max_pages,
-				  sizeof(struct page *),
-				  GFP_NOFS);
-	if (!data.orig_pages)
-		goto out;
-
 	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
 	if (data.wpa) {
 		WARN_ON(!data.wpa->ia.ap.num_pages);
 		fuse_writepages_send(&data);
 	}
 
-	kfree(data.orig_pages);
-out:
 	fuse_file_put(data.ff, false);
 	return err;
 }
@@ -2433,7 +2169,7 @@  static int fuse_write_begin(struct file *file, struct address_space *mapping,
 	if (IS_ERR(folio))
 		goto error;
 
-	fuse_wait_on_page_writeback(mapping->host, folio->index);
+	folio_wait_writeback(folio);
 
 	if (folio_test_uptodate(folio) || len >= folio_size(folio))
 		goto success;
@@ -2497,13 +2233,11 @@  static int fuse_launder_folio(struct folio *folio)
 {
 	int err = 0;
 	if (folio_clear_dirty_for_io(folio)) {
-		struct inode *inode = folio->mapping->host;
-
 		/* Serialize with pending writeback for the same page */
-		fuse_wait_on_page_writeback(inode, folio->index);
+		folio_wait_writeback(folio);
 		err = fuse_writepage_locked(folio);
 		if (!err)
-			fuse_wait_on_page_writeback(inode, folio->index);
+			folio_wait_writeback(folio);
 	}
 	return err;
 }
@@ -2547,7 +2281,7 @@  static vm_fault_t fuse_page_mkwrite(struct vm_fault *vmf)
 		return VM_FAULT_NOPAGE;
 	}
 
-	fuse_wait_on_page_writeback(inode, page->index);
+	folio_wait_writeback(page_folio(page));
 	return VM_FAULT_LOCKED;
 }
 
@@ -3368,6 +3102,7 @@  void fuse_init_file_inode(struct inode *inode, unsigned int flags)
 
 	inode->i_fop = &fuse_file_operations;
 	inode->i_data.a_ops = &fuse_file_aops;
+	mapping_set_no_writeback_reclaim(&inode->i_data);
 
 	INIT_LIST_HEAD(&fi->write_files);
 	INIT_LIST_HEAD(&fi->queued_writes);
@@ -3375,7 +3110,6 @@  void fuse_init_file_inode(struct inode *inode, unsigned int flags)
 	fi->iocachectr = 0;
 	init_waitqueue_head(&fi->page_waitq);
 	init_waitqueue_head(&fi->direct_io_waitq);
-	fi->writepages = RB_ROOT;
 
 	if (IS_ENABLED(CONFIG_FUSE_DAX))
 		fuse_dax_inode_init(inode, flags);