Message ID | 20241014182228.1941246-3-joannelkoong@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fuse: remove extra page copies in writeback | expand |
On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote: > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that > FUSE folios are not reclaimed and waited on while in writeback, and > removes the temporary folio + extra copying and the internal rb tree. What about sync(2)? And page migration? Hopefully there are no other cases, but I think a careful review of places where generic code waits for writeback is needed before we can say for sure. Thanks, Miklos
On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote: > > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that > > FUSE folios are not reclaimed and waited on while in writeback, and > > removes the temporary folio + extra copying and the internal rb tree. > > What about sync(2)? And page migration? > > Hopefully there are no other cases, but I think a careful review of > places where generic code waits for writeback is needed before we can > say for sure. Sounds good, that's a great point. I'll audit the other places in the mm code where we might call folio_wait_writeback() and report back what I find. Thanks, Joanne > > Thanks, > Miklos
On Tue, Oct 15, 2024 at 10:06:52AM GMT, Joanne Koong wrote: > On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote: > > > > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that > > > FUSE folios are not reclaimed and waited on while in writeback, and > > > removes the temporary folio + extra copying and the internal rb tree. > > > > What about sync(2)? And page migration? > > > > Hopefully there are no other cases, but I think a careful review of > > places where generic code waits for writeback is needed before we can > > say for sure. > > Sounds good, that's a great point. I'll audit the other places in the > mm code where we might call folio_wait_writeback() and report back > what I find. > > So, any operation that the fuse server can do which can cause wait on writeback on the folios backed by the fuse is problematic. We know about one scenario i.e. memory allocation causing reclaim which may do the wait on unrelated folios which may be backed by the fuse server. Now there are ways fuse server can shoot itself on the foot. Like sync() syscall or accessing the folios backed by itself. The quesion is should we really need to protect fuse from such cases?
On 10/16/24 3:17 AM, Shakeel Butt wrote: > On Tue, Oct 15, 2024 at 10:06:52AM GMT, Joanne Koong wrote: >> On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote: >>> >>> On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote: >>> >>>> This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that >>>> FUSE folios are not reclaimed and waited on while in writeback, and >>>> removes the temporary folio + extra copying and the internal rb tree. >>> >>> What about sync(2)? And page migration? >>> >>> Hopefully there are no other cases, but I think a careful review of >>> places where generic code waits for writeback is needed before we can >>> say for sure. >> >> Sounds good, that's a great point. I'll audit the other places in the >> mm code where we might call folio_wait_writeback() and report back >> what I find. >> >> > > So, any operation that the fuse server can do which can cause wait on > writeback on the folios backed by the fuse is problematic. We know about > one scenario i.e. memory allocation causing reclaim which may do the > wait on unrelated folios which may be backed by the fuse server. > > Now there are ways fuse server can shoot itself on the foot. Like sync() > syscall or accessing the folios backed by itself. The quesion is should > we really need to protect fuse from such cases? I think there are several scenarios shall be evaluated case by case: 1) a non-malicious fuse daemon wants to allocate some memory when processing a fuse request, which in turn leads to memory reclaim and thus waiting on the writeback of fuse dirty pages - a deadlock here. This is reasonable and also the target scenario that this series wants to fix. 2) a malicious fuse daemon refuses to process any request; or a buggy or not well-written fuse daemon as you described, e.g. may call sync(2) itself or access page cache backed by itself, then 2.1) any unrelated user process attempting to initiate a sync(2) itself, will hang there. This scenario is also unexpected and shall be fixed. 2.2) any direct user of fuse filesystem (e.g. access files backed by fuse filesystem) will hang in this case. IMHO this is expected, and the impact is affordable as it is controlled within certain range (the direct user of the fs).
On Tue, 15 Oct 2024 at 21:17, Shakeel Butt <shakeel.butt@linux.dev> wrote: > So, any operation that the fuse server can do which can cause wait on > writeback on the folios backed by the fuse is problematic. We know about > one scenario i.e. memory allocation causing reclaim which may do the > wait on unrelated folios which may be backed by the fuse server. > > Now there are ways fuse server can shoot itself on the foot. Like sync() > syscall or accessing the folios backed by itself. The quesion is should > we really need to protect fuse from such cases? That's not the issue with sync(2). The issue is that a fuse server can deny service to an unrelated and possibly higher privilege task by blocking writeback. We really don't want that to happen. Thanks, Miklos
On 10/15/24 6:01 PM, Miklos Szeredi wrote: > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote: > >> This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that >> FUSE folios are not reclaimed and waited on while in writeback, and >> removes the temporary folio + extra copying and the internal rb tree. > > What about sync(2)? FYI The posix sync(2) says, "The writing, although scheduled, is not necessarily complete upon return from sync()." [1] Thus hopefully it won't break the posix semantics of sync(2) down if we skip the waiting on the writeback of fuse pages. [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
On Wed, 16 Oct 2024 at 11:44, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > 1) a non-malicious fuse daemon wants to allocate some memory when > processing a fuse request, which in turn leads to memory reclaim and > thus waiting on the writeback of fuse dirty pages - a deadlock here. > This is reasonable and also the target scenario that this series wants > to fix. > > 2) a malicious fuse daemon refuses to process any request; or a buggy or > not well-written fuse daemon as you described, e.g. may call sync(2) > itself or access page cache backed by itself, then > 2.1) any unrelated user process attempting to initiate a sync(2) > itself, will hang there. This scenario is also unexpected and shall be > fixed. Exactly. We only care about - properly written server not deadlocking - buggy or malicious server not denying service to unrelated tasks, where unrelated means it would not otherwise be able to access the fuse mount. Typically this separation is done with a user namespace or -oallow_other. Thanks, Miklos > 2.2) any direct user of fuse filesystem (e.g. access files backed by > fuse filesystem) will hang in this case. IMHO this is expected, and the > impact is affordable as it is controlled within certain range (the > direct user of the fs). > > -- > Thanks, > Jingbo
On Wed, 16 Oct 2024 at 11:56, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > FYI The posix sync(2) says, "The writing, although scheduled, is not > necessarily complete upon return from sync()." [1] > > Thus hopefully it won't break the posix semantics of sync(2) down if we > skip the waiting on the writeback of fuse pages. Since this has been the case for fuse since day one and no one complained, this is not a worry. Thanks, Miklos
On Wed, Oct 16, 2024 at 11:51:39AM GMT, Miklos Szeredi wrote: > On Tue, 15 Oct 2024 at 21:17, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > So, any operation that the fuse server can do which can cause wait on > > writeback on the folios backed by the fuse is problematic. We know about > > one scenario i.e. memory allocation causing reclaim which may do the > > wait on unrelated folios which may be backed by the fuse server. > > > > Now there are ways fuse server can shoot itself on the foot. Like sync() > > syscall or accessing the folios backed by itself. The quesion is should > > we really need to protect fuse from such cases? > > That's not the issue with sync(2). The issue is that a fuse server > can deny service to an unrelated and possibly higher privilege task by > blocking writeback. We really don't want that to happen. If I understand you correctly, you are saying fuse server doing wrong things like accessing the files it is serving is not something we need to care about. More specifically all the operations which directly manipulates the folios it is serving (like migration) should be ignored. Is this correct?
On Wed, 16 Oct 2024 at 19:52, Shakeel Butt <shakeel.butt@linux.dev> wrote: > If I understand you correctly, you are saying fuse server doing wrong > things like accessing the files it is serving is not something we need > to care about. I don't think detecting such recursion is feasible or even generally possible. > More specifically all the operations which directly > manipulates the folios it is serving (like migration) should be ignored. > Is this correct? Um, not sure I understand. If migration can be triggered on fuse folios that results in the task that triggered the migration to wait on the fuse folio, then that's bad. Ignoring fuse folios is the only solution that I can see, and that's basically what the current temp page copy does. Sprinkling mm code with fuse specific conditionals seems the only solution if we want to get rid of the temp page thing. Hopefully there aren't too many of those. Thanks, Miklos
On Wed, Oct 16, 2024 at 08:37:12PM GMT, Miklos Szeredi wrote: > On Wed, 16 Oct 2024 at 19:52, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > If I understand you correctly, you are saying fuse server doing wrong > > things like accessing the files it is serving is not something we need > > to care about. > > I don't think detecting such recursion is feasible or even generally possible. > > > More specifically all the operations which directly > > manipulates the folios it is serving (like migration) should be ignored. > > Is this correct? > > Um, not sure I understand. If migration can be triggered on fuse > folios that results in the task that triggered the migration to wait > on the fuse folio, then that's bad. Why is it bad? I can understand fuse server getting blocked on fuse folios is bad but why it is bad for other applications/tasks? I am wondering network filesystems have to handle similar situation then why is it bad just for fuse? > Ignoring fuse folios is the only > solution that I can see, and that's basically what the current temp > page copy does. > > Sprinkling mm code with fuse specific conditionals seems the only > solution if we want to get rid of the temp page thing. Hopefully > there aren't too many of those. It might be a bit more than sprinkling. The reclaim code has to activate the folio to avoid reclaiming the folio in near future. I am not sure what we will need to do for move_pages() syscall.
On Wed, 16 Oct 2024 at 23:27, Shakeel Butt <shakeel.butt@linux.dev> wrote: > Why is it bad? I can understand fuse server getting blocked on fuse > folios is bad but why it is bad for other applications/tasks? I am > wondering network filesystems have to handle similar situation then why > is it bad just for fuse? You need privileges (physical access) to unplug the network cable. You don't need privileges (in most setups) to run a fuse server. > It might be a bit more than sprinkling. The reclaim code has to activate > the folio to avoid reclaiming the folio in near future. I am not sure > what we will need to do for move_pages() syscall. Maybe move_pages() is okay, because it is explicitly targeting a fuse mmap. Is this the only way to trigger MIGRATE_SYNC? Thanks, Miklos
On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote: > > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that > > FUSE folios are not reclaimed and waited on while in writeback, and > > removes the temporary folio + extra copying and the internal rb tree. > > What about sync(2)? And page migration? > > Hopefully there are no other cases, but I think a careful review of > places where generic code waits for writeback is needed before we can > say for sure. The places where I see this potential deadlock still being possible are: * page migration when handling a page fault: In particular, this path: handle_mm_fault() -> __handle_mm_fault() -> handle_pte_fault() -> do_numa_page() -> migrate_misplaced_folio() -> migrate_pages() -> migrate_pages_sync() -> migrate_pages_batch() -> migrate_folio_unmap() -> folio_wait_writeback() * syscalls that trigger waits on writeback, which will lead to deadlock if a single-threaded fuse server calls this when servicing requests: - sync(), sync_file_range(), fsync(), fdatasync() - swapoff() - move_pages() I need to analyze the page fault path more to get a clearer picture of what is happening, but so far this looks like a valid case for a correctly written fuse server to run into. For the syscalls however, is it valid/safe in general (disregarding the writeback deadlock scenario for a minute) for fuse servers to be invoking these syscalls in their handlers anyways? The other places where I see a generic wait on writeback seem safe: * splice, page_cache_pipe_buf_try_steal() (fs/splice.c): We hit this in fuse when we try to move a page from the pipe buffer into the page cache (fuse_try_move_page()) for the SPLICE_F_MOVE case. This wait seems fine, since the folio that's being waited on is the folio in the pipe buffer which is not a fuse folio. * memory failure (mm/memory_failure.c): Soft offlining a page and handling page memory failure - these can be triggered asynchronously (memory_failure_work_func()), but this should be fine for the fuse use case since the server isn't blocked on servicing any writeback requests while memory failure handling is waiting on writeback * page truncation (mm/truncate.c): Same here. These cases seem fine since the server isn't blocked on servicing writeback requests while truncation waits on writeback Thanks, Joanne > > Thanks, > Miklos
On Thu, Oct 17, 2024 at 03:31:48PM GMT, Miklos Szeredi wrote: > On Wed, 16 Oct 2024 at 23:27, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > Why is it bad? I can understand fuse server getting blocked on fuse > > folios is bad but why it is bad for other applications/tasks? I am > > wondering network filesystems have to handle similar situation then why > > is it bad just for fuse? > > You need privileges (physical access) to unplug the network cable. > You don't need privileges (in most setups) to run a fuse server. I feel like this is too much restrictive and I am still not sure why blocking on fuse folios served by non-privileges fuse server is worse than blocking on folios served from the network. > > > It might be a bit more than sprinkling. The reclaim code has to activate > > the folio to avoid reclaiming the folio in near future. I am not sure > > what we will need to do for move_pages() syscall. > > Maybe move_pages() is okay, because it is explicitly targeting a fuse > mmap. Is this the only way to trigger MIGRATE_SYNC? It can happen through migrate_pages(), mbind(), compaction which can caused by hugepage allcations.
On Thu, Oct 17, 2024 at 06:30:08PM GMT, Joanne Koong wrote: > On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote: > > > > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that > > > FUSE folios are not reclaimed and waited on while in writeback, and > > > removes the temporary folio + extra copying and the internal rb tree. > > > > What about sync(2)? And page migration? > > > > Hopefully there are no other cases, but I think a careful review of > > places where generic code waits for writeback is needed before we can > > say for sure. > > The places where I see this potential deadlock still being possible are: > * page migration when handling a page fault: > In particular, this path: handle_mm_fault() -> > __handle_mm_fault() -> handle_pte_fault() -> do_numa_page() -> > migrate_misplaced_folio() -> migrate_pages() -> migrate_pages_sync() > -> migrate_pages_batch() -> migrate_folio_unmap() -> > folio_wait_writeback() So, this is numa fault and if fuse server is not mapping the fuse folios which it is serving, in its address space then this is not an issue. However hugepage allocation on page fault can cause compaction which might migrate unrelated fuse folios. So, fuse server doing compaction is an issue and we need to resolve similar to reclaim codepath. (Though I think for THP it is not doing MIGRATE_SYNC but doing for gigantic hugetlb pages). > * syscalls that trigger waits on writeback, which will lead to > deadlock if a single-threaded fuse server calls this when servicing > requests: > - sync(), sync_file_range(), fsync(), fdatasync() > - swapoff() > - move_pages() > > I need to analyze the page fault path more to get a clearer picture of > what is happening, but so far this looks like a valid case for a > correctly written fuse server to run into. > For the syscalls however, is it valid/safe in general (disregarding > the writeback deadlock scenario for a minute) for fuse servers to be > invoking these syscalls in their handlers anyways? > > The other places where I see a generic wait on writeback seem safe: > * splice, page_cache_pipe_buf_try_steal() (fs/splice.c): > We hit this in fuse when we try to move a page from the pipe buffer > into the page cache (fuse_try_move_page()) for the SPLICE_F_MOVE case. > This wait seems fine, since the folio that's being waited on is the > folio in the pipe buffer which is not a fuse folio. > * memory failure (mm/memory_failure.c): > Soft offlining a page and handling page memory failure - these can > be triggered asynchronously (memory_failure_work_func()), but this > should be fine for the fuse use case since the server isn't blocked on > servicing any writeback requests while memory failure handling is > waiting on writeback > * page truncation (mm/truncate.c): > Same here. These cases seem fine since the server isn't blocked on > servicing writeback requests while truncation waits on writeback > > > Thanks, > Joanne > > > > > Thanks, > > Miklos
On 10/15/24 2:22 AM, Joanne Koong wrote: > static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio, > - struct folio *tmp_folio, uint32_t page_index) > + uint32_t page_index) > { > struct inode *inode = folio->mapping->host; > struct fuse_args_pages *ap = &wpa->ia.ap; > > - folio_copy(tmp_folio, folio); > - > - ap->pages[page_index] = &tmp_folio->page; > + folio_get(folio); Why folio_get() is needed? > > @@ -2203,68 +2047,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data) > struct fuse_writepage_args *wpa = data->wpa; > struct inode *inode = data->inode; > struct fuse_inode *fi = get_fuse_inode(inode); > - int num_pages = wpa->ia.ap.num_pages; > - int i; > > spin_lock(&fi->lock); > list_add_tail(&wpa->queue_entry, &fi->queued_writes); Could we also drop fi->queued_writes list and writectr counter after we dropping the temp folio? I'm not sure. When I grep all callsites of fuse_set_nowrite(), it seems that apart from the direct IO path(fuse_direct_io -> fuse_sync_writes), the truncation related code also relies on this writectr mechanism.
On Thu, Oct 17, 2024 at 10:57 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Thu, Oct 17, 2024 at 06:30:08PM GMT, Joanne Koong wrote: > > On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote: > > > > > > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that > > > > FUSE folios are not reclaimed and waited on while in writeback, and > > > > removes the temporary folio + extra copying and the internal rb tree. > > > > > > What about sync(2)? And page migration? > > > > > > Hopefully there are no other cases, but I think a careful review of > > > places where generic code waits for writeback is needed before we can > > > say for sure. > > > > The places where I see this potential deadlock still being possible are: > > * page migration when handling a page fault: > > In particular, this path: handle_mm_fault() -> > > __handle_mm_fault() -> handle_pte_fault() -> do_numa_page() -> > > migrate_misplaced_folio() -> migrate_pages() -> migrate_pages_sync() > > -> migrate_pages_batch() -> migrate_folio_unmap() -> > > folio_wait_writeback() > > So, this is numa fault and if fuse server is not mapping the fuse folios > which it is serving, in its address space then this is not an issue. > However hugepage allocation on page fault can cause compaction which > might migrate unrelated fuse folios. So, fuse server doing compaction > is an issue and we need to resolve similar to reclaim codepath. (Though > I think for THP it is not doing MIGRATE_SYNC but doing for gigantic > hugetlb pages). Thanks for the explanation. Would you mind pointing me to the compaction function where this triggers the migrate? Is this in compact_zone() where it calls migrate_pages() on the cc->migratepages list? > > > * syscalls that trigger waits on writeback, which will lead to > > deadlock if a single-threaded fuse server calls this when servicing > > requests: > > - sync(), sync_file_range(), fsync(), fdatasync() > > - swapoff() > > - move_pages() > > > > I need to analyze the page fault path more to get a clearer picture of > > what is happening, but so far this looks like a valid case for a > > correctly written fuse server to run into. > > For the syscalls however, is it valid/safe in general (disregarding > > the writeback deadlock scenario for a minute) for fuse servers to be > > invoking these syscalls in their handlers anyways? > > > > The other places where I see a generic wait on writeback seem safe: > > * splice, page_cache_pipe_buf_try_steal() (fs/splice.c): > > We hit this in fuse when we try to move a page from the pipe buffer > > into the page cache (fuse_try_move_page()) for the SPLICE_F_MOVE case. > > This wait seems fine, since the folio that's being waited on is the > > folio in the pipe buffer which is not a fuse folio. > > * memory failure (mm/memory_failure.c): > > Soft offlining a page and handling page memory failure - these can > > be triggered asynchronously (memory_failure_work_func()), but this > > should be fine for the fuse use case since the server isn't blocked on > > servicing any writeback requests while memory failure handling is > > waiting on writeback > > * page truncation (mm/truncate.c): > > Same here. These cases seem fine since the server isn't blocked on > > servicing writeback requests while truncation waits on writeback > > > > > > Thanks, > > Joanne > > > > > > > > Thanks, > > > Miklos
On Fri, Oct 18, 2024 at 2:24 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > > > On 10/15/24 2:22 AM, Joanne Koong wrote: > > static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio, > > - struct folio *tmp_folio, uint32_t page_index) > > + uint32_t page_index) > > { > > struct inode *inode = folio->mapping->host; > > struct fuse_args_pages *ap = &wpa->ia.ap; > > > > - folio_copy(tmp_folio, folio); > > - > > - ap->pages[page_index] = &tmp_folio->page; > > + folio_get(folio); > > Why folio_get() is needed? > It's not strictly needed but I added this to ensure that the folio will always be valid when we later on access it. My understanding is that it's an implementation detail rather than a guarantee that the folio can't be removed from the mapping until writeback finishes. > > > > > @@ -2203,68 +2047,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data) > > struct fuse_writepage_args *wpa = data->wpa; > > struct inode *inode = data->inode; > > struct fuse_inode *fi = get_fuse_inode(inode); > > - int num_pages = wpa->ia.ap.num_pages; > > - int i; > > > > spin_lock(&fi->lock); > > list_add_tail(&wpa->queue_entry, &fi->queued_writes); > > Could we also drop fi->queued_writes list and writectr counter after we > dropping the temp folio? I'm not sure. When I grep all callsites of > fuse_set_nowrite(), it seems that apart from the direct IO > path(fuse_direct_io -> fuse_sync_writes), the truncation related code > also relies on this writectr mechanism. I'm not sure either, but I think we still need it. My understanding is that the main purpose of this writectr is to detect when the inode has pending writebacks. I think even without the temporary pages, we still would need a way overall to track if the inode is writing back any folio in its mapping. Thanks, Joanne > > > > -- > Thanks, > Jingbo
On Fri, Oct 18, 2024 at 12:57:08PM GMT, Joanne Koong wrote: > On Thu, Oct 17, 2024 at 10:57 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Thu, Oct 17, 2024 at 06:30:08PM GMT, Joanne Koong wrote: > > > On Tue, Oct 15, 2024 at 3:01 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > > > On Mon, 14 Oct 2024 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote: > > > > > > > > > This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that > > > > > FUSE folios are not reclaimed and waited on while in writeback, and > > > > > removes the temporary folio + extra copying and the internal rb tree. > > > > > > > > What about sync(2)? And page migration? > > > > > > > > Hopefully there are no other cases, but I think a careful review of > > > > places where generic code waits for writeback is needed before we can > > > > say for sure. > > > > > > The places where I see this potential deadlock still being possible are: > > > * page migration when handling a page fault: > > > In particular, this path: handle_mm_fault() -> > > > __handle_mm_fault() -> handle_pte_fault() -> do_numa_page() -> > > > migrate_misplaced_folio() -> migrate_pages() -> migrate_pages_sync() > > > -> migrate_pages_batch() -> migrate_folio_unmap() -> > > > folio_wait_writeback() > > > > So, this is numa fault and if fuse server is not mapping the fuse folios > > which it is serving, in its address space then this is not an issue. > > However hugepage allocation on page fault can cause compaction which > > might migrate unrelated fuse folios. So, fuse server doing compaction > > is an issue and we need to resolve similar to reclaim codepath. (Though > > I think for THP it is not doing MIGRATE_SYNC but doing for gigantic > > hugetlb pages). > > Thanks for the explanation. Would you mind pointing me to the > compaction function where this triggers the migrate? Is this in > compact_zone() where it calls migrate_pages() on the cc->migratepages > list? > Something like the following: __alloc_pages_direct_compact() -> try_to_compact_pages() -> compact_zone_order() -> /* MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT */ compact_zone() -> migrate_pages() -> migrate_pages_sync() -> migrate_pages_batch() -> migrate_folio_unmap() -> folio_wait_writeback() The following is one code path from hugetlb: alloc_contig_range_noprof() -> /* MIGRATE_SYNC */ __alloc_contig_migrate_range() -> migrate_pages() -> migrate_pages_sync() -> migrate_pages_batch() -> migrate_folio_unmap() -> folio_wait_writeback()
On Fri, 18 Oct 2024 at 03:30, Joanne Koong <joannelkoong@gmail.com> wrote: > I need to analyze the page fault path more to get a clearer picture of > what is happening, but so far this looks like a valid case for a > correctly written fuse server to run into. Yes. > For the syscalls however, is it valid/safe in general (disregarding > the writeback deadlock scenario for a minute) for fuse servers to be > invoking these syscalls in their handlers anyways? Generally no. Any kind of recursion in fuse is a landmine. E.g. CVE-2019-20794 was created to track an issue with a fuse server going unkillable on namespace shutdown. It didn't occur to the reporter that this is just a special case of a plain old recursive deadlock, because it happens to be triggered by kill. But recursion is clearly there: there's a file descriptor referring to the same fuse mount that is being served. When this fd is closed at process exit the recursion is triggered and the thing deadlocks. The fix: move the recursive part of the code to a different process. But people seem to believe that recursion is okay and the kernel should deal with that :-/ > The other places where I see a generic wait on writeback seem safe: > * splice, page_cache_pipe_buf_try_steal() (fs/splice.c): > We hit this in fuse when we try to move a page from the pipe buffer > into the page cache (fuse_try_move_page()) for the SPLICE_F_MOVE case. > This wait seems fine, since the folio that's being waited on is the > folio in the pipe buffer which is not a fuse folio. > * memory failure (mm/memory_failure.c): > Soft offlining a page and handling page memory failure - these can > be triggered asynchronously (memory_failure_work_func()), but this > should be fine for the fuse use case since the server isn't blocked on > servicing any writeback requests while memory failure handling is > waiting on writeback > * page truncation (mm/truncate.c): > Same here. These cases seem fine since the server isn't blocked on > servicing writeback requests while truncation waits on writeback Right. Thanks, Miklos
On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote: > I feel like this is too much restrictive and I am still not sure why > blocking on fuse folios served by non-privileges fuse server is worse > than blocking on folios served from the network. Might be. But historically fuse had this behavior and I'd be very reluctant to change that unconditionally. With a systemwide maximal timeout for fuse requests it might make sense to allow sync(2), etc. to wait for fuse writeback. Without a timeout allowing fuse servers to block sync(2) indefinitely seems rather risky. Thanks, Miklos
On Mon, Oct 21, 2024 at 12:15:36PM GMT, Miklos Szeredi wrote: > On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > I feel like this is too much restrictive and I am still not sure why > > blocking on fuse folios served by non-privileges fuse server is worse > > than blocking on folios served from the network. > > Might be. But historically fuse had this behavior and I'd be very > reluctant to change that unconditionally. > > With a systemwide maximal timeout for fuse requests it might make > sense to allow sync(2), etc. to wait for fuse writeback. > > Without a timeout allowing fuse servers to block sync(2) indefinitely > seems rather risky. > Thanks Miklos for the response. Just to be clear on where we disagree, let me point out what I think is right and please tell me where you disagree: 1. Fuse server should never access fuse folios (and files, directories, mounts, etc) directly it is providing. 2. Fuse server should not get blocked indirectly on the fuse folios (and related objects). This series is removing one such scenario caused due to reclaim. 3. Non fuse server processes can be blocked on fuse folios (and related objects) directly and indirectly. Am I understanding correctly that we disagree on (3)? thanks, Shakeel
On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > I feel like this is too much restrictive and I am still not sure why > > blocking on fuse folios served by non-privileges fuse server is worse > > than blocking on folios served from the network. > > Might be. But historically fuse had this behavior and I'd be very > reluctant to change that unconditionally. > > With a systemwide maximal timeout for fuse requests it might make > sense to allow sync(2), etc. to wait for fuse writeback. > > Without a timeout allowing fuse servers to block sync(2) indefinitely > seems rather risky. Could we skip waiting on writeback in sync(2) if it's a fuse folio? That seems in line with the sync(2) documentation Jingbo referenced earlier where it states "The writing, although scheduled, is not necessarily complete upon return from sync()." https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html Thanks, Joanne > > Thanks, > Miklos
On Mon, 21 Oct 2024 at 19:01, Shakeel Butt <shakeel.butt@linux.dev> wrote: > Thanks Miklos for the response. Just to be clear on where we disagree, let > me point out what I think is right and please tell me where you > disagree: > > 1. Fuse server should never access fuse folios (and files, directories, > mounts, etc) directly it is providing. Correct. > 2. Fuse server should not get blocked indirectly on the fuse folios (and > related objects). This series is removing one such scenario caused > due to reclaim. Correct. > > 3. Non fuse server processes can be blocked on fuse folios (and related > objects) directly and indirectly. Agree on the direct access part, but disagree about allowing to block unrelated tasks. Accessing a fuse backed object is basically agreeing to give the fuse server extra privileges. Documentation/filesystems/fuse.rst explains this below "How do non-privileged mounts work?". Thanks, Miklos
On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote: > > On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > I feel like this is too much restrictive and I am still not sure why > > > blocking on fuse folios served by non-privileges fuse server is worse > > > than blocking on folios served from the network. > > > > Might be. But historically fuse had this behavior and I'd be very > > reluctant to change that unconditionally. > > > > With a systemwide maximal timeout for fuse requests it might make > > sense to allow sync(2), etc. to wait for fuse writeback. > > > > Without a timeout allowing fuse servers to block sync(2) indefinitely > > seems rather risky. > > Could we skip waiting on writeback in sync(2) if it's a fuse folio? > That seems in line with the sync(2) documentation Jingbo referenced > earlier where it states "The writing, although scheduled, is not > necessarily complete upon return from sync()." > https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html > So I think the answer to this is "no" for Linux. What the Linux man page for sync(2) says: "According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However Linux waits for I/O completions, and thus sync() or syncfs() provide the same guarantees as fsync() called on every file in the system or filesystem respectively." [1] Regardless of the compaction / page migration issue then, this blocking sync(2) is a dealbreaker. I see two workarounds around this: 1) Optimistically skip the tmp page but add a timeout where if the server doesn't reply to the writeback in X seconds, then allocate the tmp folio and clear writeback immediately on the original folio). This would address any page migration deadlocks as well. And probably we don't need the reclaim patch either then, since that could also be handled by the timeout. This would make 99% of writebacks fast but in the case of a malicious fuse server, could make sync() and page migration wait an extra X seconds. 2) Only skip the tmp folio for privileged fuse servers (we'd still need to address the page migration path) Would love to hear thoughts on this. Should we go ahead with option 1? [1] https://man7.org/linux/man-pages/man2/sync.2.html > > Thanks, > Joanne > > > > Thanks, > > Miklos
On 10/25/24 12:54 AM, Joanne Koong wrote: > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote: >> >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote: >>> >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote: >>> >>>> I feel like this is too much restrictive and I am still not sure why >>>> blocking on fuse folios served by non-privileges fuse server is worse >>>> than blocking on folios served from the network. >>> >>> Might be. But historically fuse had this behavior and I'd be very >>> reluctant to change that unconditionally. >>> >>> With a systemwide maximal timeout for fuse requests it might make >>> sense to allow sync(2), etc. to wait for fuse writeback. >>> >>> Without a timeout allowing fuse servers to block sync(2) indefinitely >>> seems rather risky. >> >> Could we skip waiting on writeback in sync(2) if it's a fuse folio? >> That seems in line with the sync(2) documentation Jingbo referenced >> earlier where it states "The writing, although scheduled, is not >> necessarily complete upon return from sync()." >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html >> > > So I think the answer to this is "no" for Linux. What the Linux man > page for sync(2) says: > > "According to the standard specification (e.g., POSIX.1-2001), sync() > schedules the writes, but may return before the actual writing is > done. However Linux waits for I/O completions, and thus sync() or > syncfs() provide the same guarantees as fsync() called on every file > in the system or filesystem respectively." [1] Actually as for FUSE, IIUC the writeback is not guaranteed to be completed when sync(2) returns since the temp page mechanism. When sync(2) returns, PG_writeback is indeed cleared for all original pages (in the address_space), while the real writeback work (initiated from temp page) may be still in progress. I think this is also what Miklos means in: https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/ Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages in sync(2) codepath similar to what we have done for the direct reclaim in patch 1. > > Regardless of the compaction / page migration issue then, this > blocking sync(2) is a dealbreaker. I really should have figureg out the compaction / page migration mechanism and the potential impact to FUSE when we dropping the temp page. Just too busy to take some time on this though.....
On Fri, 25 Oct 2024 at 03:38, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > Actually as for FUSE, IIUC the writeback is not guaranteed to be > completed when sync(2) returns since the temp page mechanism. When > sync(2) returns, PG_writeback is indeed cleared for all original pages > (in the address_space), while the real writeback work (initiated from > temp page) may be still in progress. Correct, this is the current behavior of fuse. I'm not against changing this for the privileged server case. I.e. a server running with certain privileges (e.g. CAP_SYS_ADMIN in the global namespace) then it should be able to opt in to waiting sync(2) behavior. > I think this is also what Miklos means in: > https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/ > > Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages > in sync(2) codepath similar to what we have done for the direct reclaim > in patch 1. I'd love to get rid of the tmp page thing completely if the same guarantees can be implemented in the mm. Thanks, Miklos
On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > > > > On 10/25/24 12:54 AM, Joanne Koong wrote: > > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote: > >> > >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > >>> > >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote: > >>> > >>>> I feel like this is too much restrictive and I am still not sure why > >>>> blocking on fuse folios served by non-privileges fuse server is worse > >>>> than blocking on folios served from the network. > >>> > >>> Might be. But historically fuse had this behavior and I'd be very > >>> reluctant to change that unconditionally. > >>> > >>> With a systemwide maximal timeout for fuse requests it might make > >>> sense to allow sync(2), etc. to wait for fuse writeback. > >>> > >>> Without a timeout allowing fuse servers to block sync(2) indefinitely > >>> seems rather risky. > >> > >> Could we skip waiting on writeback in sync(2) if it's a fuse folio? > >> That seems in line with the sync(2) documentation Jingbo referenced > >> earlier where it states "The writing, although scheduled, is not > >> necessarily complete upon return from sync()." > >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html > >> > > > > So I think the answer to this is "no" for Linux. What the Linux man > > page for sync(2) says: > > > > "According to the standard specification (e.g., POSIX.1-2001), sync() > > schedules the writes, but may return before the actual writing is > > done. However Linux waits for I/O completions, and thus sync() or > > syncfs() provide the same guarantees as fsync() called on every file > > in the system or filesystem respectively." [1] > > Actually as for FUSE, IIUC the writeback is not guaranteed to be > completed when sync(2) returns since the temp page mechanism. When > sync(2) returns, PG_writeback is indeed cleared for all original pages > (in the address_space), while the real writeback work (initiated from > temp page) may be still in progress. > That's a great point. It seems like we can just skip waiting on writeback to finish for fuse folios in sync(2) altogether then. I'll look into what's the best way to do this. > I think this is also what Miklos means in: > https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/ > > Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages > in sync(2) codepath similar to what we have done for the direct reclaim > in patch 1. > > > > > > Regardless of the compaction / page migration issue then, this > > blocking sync(2) is a dealbreaker. > > I really should have figureg out the compaction / page migration > mechanism and the potential impact to FUSE when we dropping the temp > page. Just too busy to take some time on this though..... Same here, I need to look some more into the compaction / page migration paths. I'm planning to do this early next week and will report back with what I find. Thanks, Joanne > > > -- > Thanks, > Jingbo
On Fri, 25 Oct 2024 at 19:36, Joanne Koong <joannelkoong@gmail.com> wrote: > That's a great point. It seems like we can just skip waiting on > writeback to finish for fuse folios in sync(2) altogether then. I'll > look into what's the best way to do this. I just tested this, and it turns out this doesn't quite work the way I'd expected. I can trigger sync(2) being blocked by a suspended fuse server: task:kworker/u16:3 state:D stack:0 pid:172 tgid:172 ppid:2 flags:0x00004000 Workqueue: writeback wb_workfn (flush-0:30) Call Trace: __schedule+0x40b/0xad0 schedule+0x36/0x120 inode_sleep_on_writeback+0x9d/0xb0 wb_writeback+0x104/0x3d0 wb_workfn+0x325/0x490 process_one_work+0x1d8/0x520 worker_thread+0x1af/0x390 kthread+0xcc/0x100 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x1a/0x30 task:dd state:S stack:0 pid:1364 tgid:1364 ppid:1336 flags:0x00000002 Call Trace: __schedule+0x40b/0xad0 schedule+0x36/0x120 request_wait_answer+0x16b/0x200 __fuse_simple_request+0xd6/0x290 fuse_flush_times+0x119/0x140 fuse_write_inode+0x6d/0xc0 __writeback_single_inode+0x36d/0x480 writeback_single_inode+0xa8/0x170 write_inode_now+0x75/0xa0 fuse_flush+0x85/0x1c0 filp_flush+0x2c/0x70 __x64_sys_close+0x2e/0x80 do_syscall_64+0x64/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e task:sync state:D stack:0 pid:1365 tgid:1365 ppid:1336 flags:0x00004002 Call Trace: __schedule+0x40b/0xad0 schedule+0x36/0x120 wb_wait_for_completion+0x56/0x80 sync_inodes_sb+0xc5/0x450 iterate_supers+0x69/0xd0 ksys_sync+0x40/0xa0 __do_sys_sync+0xa/0x20 do_syscall_64+0x64/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e Maybe I'm too paranoid about this, and in practice we can just let sync(2) block in any case. But I want to understand this better first. Thanks, Miklos
On Fri, Oct 25, 2024 at 11:02 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Fri, 25 Oct 2024 at 19:36, Joanne Koong <joannelkoong@gmail.com> wrote: > > > That's a great point. It seems like we can just skip waiting on > > writeback to finish for fuse folios in sync(2) altogether then. I'll > > look into what's the best way to do this. > > I just tested this, and it turns out this doesn't quite work the way > I'd expected. I can trigger sync(2) being blocked by a suspended fuse > server: > > task:kworker/u16:3 state:D stack:0 pid:172 tgid:172 ppid:2 > flags:0x00004000 > Workqueue: writeback wb_workfn (flush-0:30) > Call Trace: > __schedule+0x40b/0xad0 > schedule+0x36/0x120 > inode_sleep_on_writeback+0x9d/0xb0 > wb_writeback+0x104/0x3d0 > wb_workfn+0x325/0x490 > process_one_work+0x1d8/0x520 > worker_thread+0x1af/0x390 > kthread+0xcc/0x100 > ret_from_fork+0x2d/0x50 > ret_from_fork_asm+0x1a/0x30 > > task:dd state:S stack:0 pid:1364 tgid:1364 > ppid:1336 flags:0x00000002 > Call Trace: > __schedule+0x40b/0xad0 > schedule+0x36/0x120 > request_wait_answer+0x16b/0x200 > __fuse_simple_request+0xd6/0x290 > fuse_flush_times+0x119/0x140 > fuse_write_inode+0x6d/0xc0 > __writeback_single_inode+0x36d/0x480 > writeback_single_inode+0xa8/0x170 > write_inode_now+0x75/0xa0 > fuse_flush+0x85/0x1c0 > filp_flush+0x2c/0x70 > __x64_sys_close+0x2e/0x80 > do_syscall_64+0x64/0x140 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > task:sync state:D stack:0 pid:1365 tgid:1365 > ppid:1336 flags:0x00004002 > Call Trace: > __schedule+0x40b/0xad0 > schedule+0x36/0x120 > wb_wait_for_completion+0x56/0x80 > sync_inodes_sb+0xc5/0x450 > iterate_supers+0x69/0xd0 > ksys_sync+0x40/0xa0 > __do_sys_sync+0xa/0x20 > do_syscall_64+0x64/0x140 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > Thanks for the trace. If i'm understanding it correctly, this only blocks temporarily until the writeback wb_workfn is rescheduled? > Maybe I'm too paranoid about this, and in practice we can just let > sync(2) block in any case. But I want to understand this better in the case of a malicious fuse server, it could block sync(2) forever so I think this means we have to skip the wait for fuse folios altogether. Thanks, Joanne > first. > > Thanks, > Miklos
On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote: > > On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > > > > > > > > On 10/25/24 12:54 AM, Joanne Koong wrote: > > > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote: > > >> > > >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > >>> > > >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > >>> > > >>>> I feel like this is too much restrictive and I am still not sure why > > >>>> blocking on fuse folios served by non-privileges fuse server is worse > > >>>> than blocking on folios served from the network. > > >>> > > >>> Might be. But historically fuse had this behavior and I'd be very > > >>> reluctant to change that unconditionally. > > >>> > > >>> With a systemwide maximal timeout for fuse requests it might make > > >>> sense to allow sync(2), etc. to wait for fuse writeback. > > >>> > > >>> Without a timeout allowing fuse servers to block sync(2) indefinitely > > >>> seems rather risky. > > >> > > >> Could we skip waiting on writeback in sync(2) if it's a fuse folio? > > >> That seems in line with the sync(2) documentation Jingbo referenced > > >> earlier where it states "The writing, although scheduled, is not > > >> necessarily complete upon return from sync()." > > >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html > > >> > > > > > > So I think the answer to this is "no" for Linux. What the Linux man > > > page for sync(2) says: > > > > > > "According to the standard specification (e.g., POSIX.1-2001), sync() > > > schedules the writes, but may return before the actual writing is > > > done. However Linux waits for I/O completions, and thus sync() or > > > syncfs() provide the same guarantees as fsync() called on every file > > > in the system or filesystem respectively." [1] > > > > Actually as for FUSE, IIUC the writeback is not guaranteed to be > > completed when sync(2) returns since the temp page mechanism. When > > sync(2) returns, PG_writeback is indeed cleared for all original pages > > (in the address_space), while the real writeback work (initiated from > > temp page) may be still in progress. > > > > That's a great point. It seems like we can just skip waiting on > writeback to finish for fuse folios in sync(2) altogether then. I'll > look into what's the best way to do this. I think the most straightforward way to do this for sync(2) is to add the mapping check inside sync_bdevs(). With something like: diff --git a/block/bdev.c b/block/bdev.c index 738e3c8457e7..bcb2b6d3db94 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -1247,7 +1247,7 @@ void sync_bdevs(bool wait) mutex_lock(&bdev->bd_disk->open_mutex); if (!atomic_read(&bdev->bd_openers)) { ; /* skip */ - } else if (wait) { + } else if (wait && !mapping_no_writeback_wait(inode->i_mapping)) { /* * We keep the error status of individual mapping so * that applications can catch the writeback error using and changing AS_NO_WRITEBACK_RECLAIM to AS_NO_WRITEBACK_WAIT. Thanks, Joanne >
On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote: > > On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > > > > > > > > On 10/25/24 12:54 AM, Joanne Koong wrote: > > > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote: > > >> > > >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > >>> > > >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > >>> > > >>>> I feel like this is too much restrictive and I am still not sure why > > >>>> blocking on fuse folios served by non-privileges fuse server is worse > > >>>> than blocking on folios served from the network. > > >>> > > >>> Might be. But historically fuse had this behavior and I'd be very > > >>> reluctant to change that unconditionally. > > >>> > > >>> With a systemwide maximal timeout for fuse requests it might make > > >>> sense to allow sync(2), etc. to wait for fuse writeback. > > >>> > > >>> Without a timeout allowing fuse servers to block sync(2) indefinitely > > >>> seems rather risky. > > >> > > >> Could we skip waiting on writeback in sync(2) if it's a fuse folio? > > >> That seems in line with the sync(2) documentation Jingbo referenced > > >> earlier where it states "The writing, although scheduled, is not > > >> necessarily complete upon return from sync()." > > >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html > > >> > > > > > > So I think the answer to this is "no" for Linux. What the Linux man > > > page for sync(2) says: > > > > > > "According to the standard specification (e.g., POSIX.1-2001), sync() > > > schedules the writes, but may return before the actual writing is > > > done. However Linux waits for I/O completions, and thus sync() or > > > syncfs() provide the same guarantees as fsync() called on every file > > > in the system or filesystem respectively." [1] > > > > Actually as for FUSE, IIUC the writeback is not guaranteed to be > > completed when sync(2) returns since the temp page mechanism. When > > sync(2) returns, PG_writeback is indeed cleared for all original pages > > (in the address_space), while the real writeback work (initiated from > > temp page) may be still in progress. > > > > That's a great point. It seems like we can just skip waiting on > writeback to finish for fuse folios in sync(2) altogether then. I'll > look into what's the best way to do this. > > > I think this is also what Miklos means in: > > https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/ > > > > Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages > > in sync(2) codepath similar to what we have done for the direct reclaim > > in patch 1. > > > > > > > > > > Regardless of the compaction / page migration issue then, this > > > blocking sync(2) is a dealbreaker. > > > > I really should have figureg out the compaction / page migration > > mechanism and the potential impact to FUSE when we dropping the temp > > page. Just too busy to take some time on this though..... > > Same here, I need to look some more into the compaction / page > migration paths. I'm planning to do this early next week and will > report back with what I find. > These are my notes so far: * We hit the folio_wait_writeback() path when callers call migrate_pages() with mode MIGRATE_SYNC ... -> migrate_pages() -> migrate_pages_sync() -> migrate_pages_batch() -> migrate_folio_unmap() -> folio_wait_writeback() * These are the places where we call migrate_pages(): 1) demote_folio_list() Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode 2) __damon_pa_migrate_folio_list() Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode 3) migrate_misplaced_folio() Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode 4) do_move_pages_to_node() Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but this path is only invoked by the move_pages() syscall. It's fine to wait on writeback for the move_pages() syscall since the user would have to deliberately invoke this on the fuse server for this to apply to the server's fuse folios 5) migrate_to_node() Can ignore this for the same reason as in 4. This path is only invoked by the migrate_pages() syscall. 6) do_mbind() Can ignore this for the same reason as 4 and 5. This path is only invoked by the mbind() syscall. 7) soft_offline_in_use_page() Can skip soft offlining fuse folios (eg folios with the AS_NO_WRITEBACK_WAIT mapping flag set). The path for this is soft_offline_page() -> soft_offline_in_use_page() -> migrate_pages(). soft_offline_page() only invokes this for in-use pages in a well-defined state (see ret value of get_hwpoison_page()). My understanding of soft offlining pages is that it's a mitigation strategy for handling pages that are experiencing errors but are not yet completely unusable, and its main purpose is to prevent future issues. It seems fine to skip this for fuse folios. 8) do_migrate_range() 9) compact_zone() 10) migrate_longterm_unpinnable_folios() 11) __alloc_contig_migrate_range() 8 to 11 needs more investigation / thinking about. I don't see a good way around these tbh. I think we have to operate under the assumption that the fuse server running is malicious or benevolently but incorrectly written and could possibly never complete writeback. So we definitely can't wait on these but it also doesn't seem like we can skip waiting on these, especially for the case where the server uses spliced pages, nor does it seem like we can just fail these with -EBUSY or something. Will continue looking more into this early next week. Thanks, Joanne > > Thanks, > Joanne > > > > > > -- > > Thanks, > > Jingbo
On 10/26/24 2:19 AM, Joanne Koong wrote: > On Fri, Oct 25, 2024 at 11:02 AM Miklos Szeredi <miklos@szeredi.hu> wrote: >> >> On Fri, 25 Oct 2024 at 19:36, Joanne Koong <joannelkoong@gmail.com> wrote: >> >>> That's a great point. It seems like we can just skip waiting on >>> writeback to finish for fuse folios in sync(2) altogether then. I'll >>> look into what's the best way to do this. >> >> I just tested this, and it turns out this doesn't quite work the way >> I'd expected. I can trigger sync(2) being blocked by a suspended fuse >> server: >> >> task:kworker/u16:3 state:D stack:0 pid:172 tgid:172 ppid:2 >> flags:0x00004000 >> Workqueue: writeback wb_workfn (flush-0:30) >> Call Trace: >> __schedule+0x40b/0xad0 >> schedule+0x36/0x120 >> inode_sleep_on_writeback+0x9d/0xb0 >> wb_writeback+0x104/0x3d0 >> wb_workfn+0x325/0x490 >> process_one_work+0x1d8/0x520 >> worker_thread+0x1af/0x390 >> kthread+0xcc/0x100 >> ret_from_fork+0x2d/0x50 >> ret_from_fork_asm+0x1a/0x30 >> >> task:dd state:S stack:0 pid:1364 tgid:1364 >> ppid:1336 flags:0x00000002 >> Call Trace: >> __schedule+0x40b/0xad0 >> schedule+0x36/0x120 >> request_wait_answer+0x16b/0x200 >> __fuse_simple_request+0xd6/0x290 >> fuse_flush_times+0x119/0x140 >> fuse_write_inode+0x6d/0xc0 >> __writeback_single_inode+0x36d/0x480 >> writeback_single_inode+0xa8/0x170 >> write_inode_now+0x75/0xa0 >> fuse_flush+0x85/0x1c0 >> filp_flush+0x2c/0x70 >> __x64_sys_close+0x2e/0x80 >> do_syscall_64+0x64/0x140 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> >> task:sync state:D stack:0 pid:1365 tgid:1365 >> ppid:1336 flags:0x00004002 >> Call Trace: >> __schedule+0x40b/0xad0 >> schedule+0x36/0x120 >> wb_wait_for_completion+0x56/0x80 >> sync_inodes_sb+0xc5/0x450 >> iterate_supers+0x69/0xd0 >> ksys_sync+0x40/0xa0 >> __do_sys_sync+0xa/0x20 >> do_syscall_64+0x64/0x140 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> > > Thanks for the trace. If i'm understanding it correctly, this only > blocks temporarily until the writeback wb_workfn is rescheduled? > >> Maybe I'm too paranoid about this, and in practice we can just let >> sync(2) block in any case. But I want to understand this better > > in the case of a malicious fuse server, it could block sync(2) forever > so I think this means we have to skip the wait for fuse folios > altogether. Right. In summary we need to skip waiting on the completion for the writeback in case of a malicious fuse server could block sync(2) forever, and meanwhile the change won't break the backward compatibility as the current fuse implementation also doesn't wait for the data actually being written back and persistent on the backstore in sync(2).
On 10/26/24 2:47 AM, Joanne Koong wrote: > On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote: >> >> On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: >>> >>> >>> >>> On 10/25/24 12:54 AM, Joanne Koong wrote: >>>> On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote: >>>>> >>>>> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote: >>>>>> >>>>>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote: >>>>>> >>>>>>> I feel like this is too much restrictive and I am still not sure why >>>>>>> blocking on fuse folios served by non-privileges fuse server is worse >>>>>>> than blocking on folios served from the network. >>>>>> >>>>>> Might be. But historically fuse had this behavior and I'd be very >>>>>> reluctant to change that unconditionally. >>>>>> >>>>>> With a systemwide maximal timeout for fuse requests it might make >>>>>> sense to allow sync(2), etc. to wait for fuse writeback. >>>>>> >>>>>> Without a timeout allowing fuse servers to block sync(2) indefinitely >>>>>> seems rather risky. >>>>> >>>>> Could we skip waiting on writeback in sync(2) if it's a fuse folio? >>>>> That seems in line with the sync(2) documentation Jingbo referenced >>>>> earlier where it states "The writing, although scheduled, is not >>>>> necessarily complete upon return from sync()." >>>>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html >>>>> >>>> >>>> So I think the answer to this is "no" for Linux. What the Linux man >>>> page for sync(2) says: >>>> >>>> "According to the standard specification (e.g., POSIX.1-2001), sync() >>>> schedules the writes, but may return before the actual writing is >>>> done. However Linux waits for I/O completions, and thus sync() or >>>> syncfs() provide the same guarantees as fsync() called on every file >>>> in the system or filesystem respectively." [1] >>> >>> Actually as for FUSE, IIUC the writeback is not guaranteed to be >>> completed when sync(2) returns since the temp page mechanism. When >>> sync(2) returns, PG_writeback is indeed cleared for all original pages >>> (in the address_space), while the real writeback work (initiated from >>> temp page) may be still in progress. >>> >> >> That's a great point. It seems like we can just skip waiting on >> writeback to finish for fuse folios in sync(2) altogether then. I'll >> look into what's the best way to do this. > > I think the most straightforward way to do this for sync(2) is to add > the mapping check inside sync_bdevs(). With something like: > > diff --git a/block/bdev.c b/block/bdev.c > index 738e3c8457e7..bcb2b6d3db94 100644 > --- a/block/bdev.c > +++ b/block/bdev.c > @@ -1247,7 +1247,7 @@ void sync_bdevs(bool wait) > mutex_lock(&bdev->bd_disk->open_mutex); > if (!atomic_read(&bdev->bd_openers)) { > ; /* skip */ > - } else if (wait) { > + } else if (wait && > !mapping_no_writeback_wait(inode->i_mapping)) { > /* > * We keep the error status of individual mapping so > * that applications can catch the writeback error using > > I'm afraid we are waiting in wait_sb_inodes (ksys_sync -> sync_inodes_sb -> wait_sb_inodes) rather than sync_bdevs. sync_bdevs() is used to writeback and sync the metadata residing on the block device directly such as the superblock. It is sync_inodes_one_sb() that actually writeback inodes.
On Sun, Oct 27, 2024 at 7:28 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > > > > On 10/26/24 2:47 AM, Joanne Koong wrote: > > On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote: > >> > >> On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > >>> > >>> > >>> > >>> On 10/25/24 12:54 AM, Joanne Koong wrote: > >>>> On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote: > >>>>> > >>>>> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > >>>>>> > >>>>>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote: > >>>>>> > >>>>>>> I feel like this is too much restrictive and I am still not sure why > >>>>>>> blocking on fuse folios served by non-privileges fuse server is worse > >>>>>>> than blocking on folios served from the network. > >>>>>> > >>>>>> Might be. But historically fuse had this behavior and I'd be very > >>>>>> reluctant to change that unconditionally. > >>>>>> > >>>>>> With a systemwide maximal timeout for fuse requests it might make > >>>>>> sense to allow sync(2), etc. to wait for fuse writeback. > >>>>>> > >>>>>> Without a timeout allowing fuse servers to block sync(2) indefinitely > >>>>>> seems rather risky. > >>>>> > >>>>> Could we skip waiting on writeback in sync(2) if it's a fuse folio? > >>>>> That seems in line with the sync(2) documentation Jingbo referenced > >>>>> earlier where it states "The writing, although scheduled, is not > >>>>> necessarily complete upon return from sync()." > >>>>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html > >>>>> > >>>> > >>>> So I think the answer to this is "no" for Linux. What the Linux man > >>>> page for sync(2) says: > >>>> > >>>> "According to the standard specification (e.g., POSIX.1-2001), sync() > >>>> schedules the writes, but may return before the actual writing is > >>>> done. However Linux waits for I/O completions, and thus sync() or > >>>> syncfs() provide the same guarantees as fsync() called on every file > >>>> in the system or filesystem respectively." [1] > >>> > >>> Actually as for FUSE, IIUC the writeback is not guaranteed to be > >>> completed when sync(2) returns since the temp page mechanism. When > >>> sync(2) returns, PG_writeback is indeed cleared for all original pages > >>> (in the address_space), while the real writeback work (initiated from > >>> temp page) may be still in progress. > >>> > >> > >> That's a great point. It seems like we can just skip waiting on > >> writeback to finish for fuse folios in sync(2) altogether then. I'll > >> look into what's the best way to do this. > > > > I think the most straightforward way to do this for sync(2) is to add > > the mapping check inside sync_bdevs(). With something like: > > > > diff --git a/block/bdev.c b/block/bdev.c > > index 738e3c8457e7..bcb2b6d3db94 100644 > > --- a/block/bdev.c > > +++ b/block/bdev.c > > @@ -1247,7 +1247,7 @@ void sync_bdevs(bool wait) > > mutex_lock(&bdev->bd_disk->open_mutex); > > if (!atomic_read(&bdev->bd_openers)) { > > ; /* skip */ > > - } else if (wait) { > > + } else if (wait && > > !mapping_no_writeback_wait(inode->i_mapping)) { > > /* > > * We keep the error status of individual mapping so > > * that applications can catch the writeback error using > > > > > > I'm afraid we are waiting in wait_sb_inodes (ksys_sync -> sync_inodes_sb > -> wait_sb_inodes) rather than sync_bdevs. sync_bdevs() is used to > writeback and sync the metadata residing on the block device directly > such as the superblock. It is sync_inodes_one_sb() that actually > writeback inodes. > Great point, thanks for the info! > > -- > Thanks, > Jingbo
On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote: > > On Fri, Oct 25, 2024 at 10:36 AM Joanne Koong <joannelkoong@gmail.com> wrote: > > > > On Thu, Oct 24, 2024 at 6:38 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > > > > > > > > > > > > On 10/25/24 12:54 AM, Joanne Koong wrote: > > > > On Mon, Oct 21, 2024 at 2:05 PM Joanne Koong <joannelkoong@gmail.com> wrote: > > > >> > > > >> On Mon, Oct 21, 2024 at 3:15 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > >>> > > > >>> On Fri, 18 Oct 2024 at 07:31, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > >>> > > > >>>> I feel like this is too much restrictive and I am still not sure why > > > >>>> blocking on fuse folios served by non-privileges fuse server is worse > > > >>>> than blocking on folios served from the network. > > > >>> > > > >>> Might be. But historically fuse had this behavior and I'd be very > > > >>> reluctant to change that unconditionally. > > > >>> > > > >>> With a systemwide maximal timeout for fuse requests it might make > > > >>> sense to allow sync(2), etc. to wait for fuse writeback. > > > >>> > > > >>> Without a timeout allowing fuse servers to block sync(2) indefinitely > > > >>> seems rather risky. > > > >> > > > >> Could we skip waiting on writeback in sync(2) if it's a fuse folio? > > > >> That seems in line with the sync(2) documentation Jingbo referenced > > > >> earlier where it states "The writing, although scheduled, is not > > > >> necessarily complete upon return from sync()." > > > >> https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html > > > >> > > > > > > > > So I think the answer to this is "no" for Linux. What the Linux man > > > > page for sync(2) says: > > > > > > > > "According to the standard specification (e.g., POSIX.1-2001), sync() > > > > schedules the writes, but may return before the actual writing is > > > > done. However Linux waits for I/O completions, and thus sync() or > > > > syncfs() provide the same guarantees as fsync() called on every file > > > > in the system or filesystem respectively." [1] > > > > > > Actually as for FUSE, IIUC the writeback is not guaranteed to be > > > completed when sync(2) returns since the temp page mechanism. When > > > sync(2) returns, PG_writeback is indeed cleared for all original pages > > > (in the address_space), while the real writeback work (initiated from > > > temp page) may be still in progress. > > > > > > > That's a great point. It seems like we can just skip waiting on > > writeback to finish for fuse folios in sync(2) altogether then. I'll > > look into what's the best way to do this. > > > > > I think this is also what Miklos means in: > > > https://lore.kernel.org/all/CAJfpegsJKD4YT5R5qfXXE=hyqKvhpTRbD4m1wsYNbGB6k4rC2A@mail.gmail.com/ > > > > > > Though we need special handling for AS_NO_WRITEBACK_RECLAIM marked pages > > > in sync(2) codepath similar to what we have done for the direct reclaim > > > in patch 1. > > > > > > > > > > > > > > Regardless of the compaction / page migration issue then, this > > > > blocking sync(2) is a dealbreaker. > > > > > > I really should have figureg out the compaction / page migration > > > mechanism and the potential impact to FUSE when we dropping the temp > > > page. Just too busy to take some time on this though..... > > > > Same here, I need to look some more into the compaction / page > > migration paths. I'm planning to do this early next week and will > > report back with what I find. > > > > These are my notes so far: > > * We hit the folio_wait_writeback() path when callers call > migrate_pages() with mode MIGRATE_SYNC > ... -> migrate_pages() -> migrate_pages_sync() -> > migrate_pages_batch() -> migrate_folio_unmap() -> > folio_wait_writeback() > > * These are the places where we call migrate_pages(): > 1) demote_folio_list() > Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > > 2) __damon_pa_migrate_folio_list() > Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > > 3) migrate_misplaced_folio() > Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > > 4) do_move_pages_to_node() > Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but > this path is only invoked by the move_pages() syscall. It's fine to > wait on writeback for the move_pages() syscall since the user would > have to deliberately invoke this on the fuse server for this to apply > to the server's fuse folios > > 5) migrate_to_node() > Can ignore this for the same reason as in 4. This path is only invoked > by the migrate_pages() syscall. > > 6) do_mbind() > Can ignore this for the same reason as 4 and 5. This path is only > invoked by the mbind() syscall. > > 7) soft_offline_in_use_page() > Can skip soft offlining fuse folios (eg folios with the > AS_NO_WRITEBACK_WAIT mapping flag set). > The path for this is soft_offline_page() -> soft_offline_in_use_page() > -> migrate_pages(). soft_offline_page() only invokes this for in-use > pages in a well-defined state (see ret value of get_hwpoison_page()). > My understanding of soft offlining pages is that it's a mitigation > strategy for handling pages that are experiencing errors but are not > yet completely unusable, and its main purpose is to prevent future > issues. It seems fine to skip this for fuse folios. > > 8) do_migrate_range() > 9) compact_zone() > 10) migrate_longterm_unpinnable_folios() > 11) __alloc_contig_migrate_range() > > 8 to 11 needs more investigation / thinking about. I don't see a good > way around these tbh. I think we have to operate under the assumption > that the fuse server running is malicious or benevolently but > incorrectly written and could possibly never complete writeback. So we > definitely can't wait on these but it also doesn't seem like we can > skip waiting on these, especially for the case where the server uses > spliced pages, nor does it seem like we can just fail these with > -EBUSY or something. > I'm still not seeing a good way around this. What about this then? We add a new fuse sysctl called something like "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys admin sets this, then it opts into optimizing writeback to be as fast as possible (eg skipping the page copies) and if the server doesn't fulfill the writeback by the set timeout value, then the connection is aborted. Alternatively, we could also repurpose /proc/sys/fs/fuse/max_request_timeout from the request timeout patchset [1] but I like the additional flexibility and explicitness having the "writeback_optimization_timeout" sysctl gives. Any thoughts on this? [1] https://lore.kernel.org/linux-fsdevel/20241011191320.91592-4-joannelkoong@gmail.com/ Thanks, Joanne > Will continue looking more into this early next week. > > Thanks, > Joanne > > > > Thanks, > > Joanne > > > > > > > > > -- > > > Thanks, > > > Jingbo
On 10/17/24 15:31, Miklos Szeredi wrote: > On Wed, 16 Oct 2024 at 23:27, Shakeel Butt <shakeel.butt@linux.dev> wrote: > >> Why is it bad? I can understand fuse server getting blocked on fuse >> folios is bad but why it is bad for other applications/tasks? I am >> wondering network filesystems have to handle similar situation then why >> is it bad just for fuse? > > You need privileges (physical access) to unplug the network cable. > You don't need privileges (in most setups) to run a fuse server. Not sure if that is the perfect example, you can also run a local user space nfs server. I think the real difference is 'fusermount' which typically has the s-bit set and allows mounts from unprivileged users. I.e. I really think we should differentiate between privileged/non-privileged fuse server tasks. Thanks, Bernd
On 10/28/24 22:58, Joanne Koong wrote: > On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote: >> >>> Same here, I need to look some more into the compaction / page >>> migration paths. I'm planning to do this early next week and will >>> report back with what I find. >>> >> >> These are my notes so far: >> >> * We hit the folio_wait_writeback() path when callers call >> migrate_pages() with mode MIGRATE_SYNC >> ... -> migrate_pages() -> migrate_pages_sync() -> >> migrate_pages_batch() -> migrate_folio_unmap() -> >> folio_wait_writeback() >> >> * These are the places where we call migrate_pages(): >> 1) demote_folio_list() >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode >> >> 2) __damon_pa_migrate_folio_list() >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode >> >> 3) migrate_misplaced_folio() >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode >> >> 4) do_move_pages_to_node() >> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but >> this path is only invoked by the move_pages() syscall. It's fine to >> wait on writeback for the move_pages() syscall since the user would >> have to deliberately invoke this on the fuse server for this to apply >> to the server's fuse folios >> >> 5) migrate_to_node() >> Can ignore this for the same reason as in 4. This path is only invoked >> by the migrate_pages() syscall. >> >> 6) do_mbind() >> Can ignore this for the same reason as 4 and 5. This path is only >> invoked by the mbind() syscall. >> >> 7) soft_offline_in_use_page() >> Can skip soft offlining fuse folios (eg folios with the >> AS_NO_WRITEBACK_WAIT mapping flag set). >> The path for this is soft_offline_page() -> soft_offline_in_use_page() >> -> migrate_pages(). soft_offline_page() only invokes this for in-use >> pages in a well-defined state (see ret value of get_hwpoison_page()). >> My understanding of soft offlining pages is that it's a mitigation >> strategy for handling pages that are experiencing errors but are not >> yet completely unusable, and its main purpose is to prevent future >> issues. It seems fine to skip this for fuse folios. >> >> 8) do_migrate_range() >> 9) compact_zone() >> 10) migrate_longterm_unpinnable_folios() >> 11) __alloc_contig_migrate_range() >> >> 8 to 11 needs more investigation / thinking about. I don't see a good >> way around these tbh. I think we have to operate under the assumption >> that the fuse server running is malicious or benevolently but >> incorrectly written and could possibly never complete writeback. So we >> definitely can't wait on these but it also doesn't seem like we can >> skip waiting on these, especially for the case where the server uses >> spliced pages, nor does it seem like we can just fail these with >> -EBUSY or something. I see some code paths with -EAGAIN in migration. Could you explain why we can't just fail migration for fuse write-back pages? >> > > I'm still not seeing a good way around this. > > What about this then? We add a new fuse sysctl called something like > "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys > admin sets this, then it opts into optimizing writeback to be as fast > as possible (eg skipping the page copies) and if the server doesn't > fulfill the writeback by the set timeout value, then the connection is > aborted. > > Alternatively, we could also repurpose > /proc/sys/fs/fuse/max_request_timeout from the request timeout > patchset [1] but I like the additional flexibility and explicitness > having the "writeback_optimization_timeout" sysctl gives. > > Any thoughts on this? I'm a bit worried that we might lock up the system until time out is reached - not ideal. Especially as timeouts are in minutes now. But even a slightly stuttering video system not be great. I think we should give users/admin the choice then, if they prefer slow page copies or fast, but possibly shortly unresponsive system. Thank, Bernd
On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > On 10/28/24 22:58, Joanne Koong wrote: > > On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote: > >> > >>> Same here, I need to look some more into the compaction / page > >>> migration paths. I'm planning to do this early next week and will > >>> report back with what I find. > >>> > >> > >> These are my notes so far: > >> > >> * We hit the folio_wait_writeback() path when callers call > >> migrate_pages() with mode MIGRATE_SYNC > >> ... -> migrate_pages() -> migrate_pages_sync() -> > >> migrate_pages_batch() -> migrate_folio_unmap() -> > >> folio_wait_writeback() > >> > >> * These are the places where we call migrate_pages(): > >> 1) demote_folio_list() > >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >> > >> 2) __damon_pa_migrate_folio_list() > >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >> > >> 3) migrate_misplaced_folio() > >> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >> > >> 4) do_move_pages_to_node() > >> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but > >> this path is only invoked by the move_pages() syscall. It's fine to > >> wait on writeback for the move_pages() syscall since the user would > >> have to deliberately invoke this on the fuse server for this to apply > >> to the server's fuse folios > >> > >> 5) migrate_to_node() > >> Can ignore this for the same reason as in 4. This path is only invoked > >> by the migrate_pages() syscall. > >> > >> 6) do_mbind() > >> Can ignore this for the same reason as 4 and 5. This path is only > >> invoked by the mbind() syscall. > >> > >> 7) soft_offline_in_use_page() > >> Can skip soft offlining fuse folios (eg folios with the > >> AS_NO_WRITEBACK_WAIT mapping flag set). > >> The path for this is soft_offline_page() -> soft_offline_in_use_page() > >> -> migrate_pages(). soft_offline_page() only invokes this for in-use > >> pages in a well-defined state (see ret value of get_hwpoison_page()). > >> My understanding of soft offlining pages is that it's a mitigation > >> strategy for handling pages that are experiencing errors but are not > >> yet completely unusable, and its main purpose is to prevent future > >> issues. It seems fine to skip this for fuse folios. > >> > >> 8) do_migrate_range() > >> 9) compact_zone() > >> 10) migrate_longterm_unpinnable_folios() > >> 11) __alloc_contig_migrate_range() > >> > >> 8 to 11 needs more investigation / thinking about. I don't see a good > >> way around these tbh. I think we have to operate under the assumption > >> that the fuse server running is malicious or benevolently but > >> incorrectly written and could possibly never complete writeback. So we > >> definitely can't wait on these but it also doesn't seem like we can > >> skip waiting on these, especially for the case where the server uses > >> spliced pages, nor does it seem like we can just fail these with > >> -EBUSY or something. > > I see some code paths with -EAGAIN in migration. Could you explain why > we can't just fail migration for fuse write-back pages? > My understanding (and please correct me here Shakeel if I'm wrong) is that this could block system optimizations, especially since if an unprivileged malicious fuse server never replies to the writeback request, then this completely stalls progress. In the best case scenario, -EAGAIN could be used because the server might just be slow in serving the writeback, but I think we need to also account for servers that never complete the writeback. For __alloc_contig_migrate_range() for example, my understanding is that this is used to migrate pages so that there are more physically contiguous ranges of memory freed up. If fuse writeback blocks that, then that hurts system health overall. > >> > > > > I'm still not seeing a good way around this. > > > > What about this then? We add a new fuse sysctl called something like > > "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys > > admin sets this, then it opts into optimizing writeback to be as fast > > as possible (eg skipping the page copies) and if the server doesn't > > fulfill the writeback by the set timeout value, then the connection is > > aborted. > > > > Alternatively, we could also repurpose > > /proc/sys/fs/fuse/max_request_timeout from the request timeout > > patchset [1] but I like the additional flexibility and explicitness > > having the "writeback_optimization_timeout" sysctl gives. > > > > Any thoughts on this? > > > I'm a bit worried that we might lock up the system until time out is > reached - not ideal. Especially as timeouts are in minutes now. But > even a slightly stuttering video system not be great. I think we > should give users/admin the choice then, if they prefer slow page > copies or fast, but possibly shortly unresponsive system. > I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout would be in seconds, where the sys admin would probably set something more reasonable like 5 seconds or so. If this syctl value is set, then servers who want writebacks to be fast can opt into it at mount time (and by doing so agree that they will service writeback requests by the timeout or their connection will be aborted). Thanks, Joanne > > Thank, > Bernd
On 10/30/24 17:04, Joanne Koong wrote: > On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert > <bernd.schubert@fastmail.fm> wrote: >> >> On 10/28/24 22:58, Joanne Koong wrote: >>> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote: >>>> >>>>> Same here, I need to look some more into the compaction / page >>>>> migration paths. I'm planning to do this early next week and will >>>>> report back with what I find. >>>>> >>>> >>>> These are my notes so far: >>>> >>>> * We hit the folio_wait_writeback() path when callers call >>>> migrate_pages() with mode MIGRATE_SYNC >>>> ... -> migrate_pages() -> migrate_pages_sync() -> >>>> migrate_pages_batch() -> migrate_folio_unmap() -> >>>> folio_wait_writeback() >>>> >>>> * These are the places where we call migrate_pages(): >>>> 1) demote_folio_list() >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode >>>> >>>> 2) __damon_pa_migrate_folio_list() >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode >>>> >>>> 3) migrate_misplaced_folio() >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode >>>> >>>> 4) do_move_pages_to_node() >>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but >>>> this path is only invoked by the move_pages() syscall. It's fine to >>>> wait on writeback for the move_pages() syscall since the user would >>>> have to deliberately invoke this on the fuse server for this to apply >>>> to the server's fuse folios >>>> >>>> 5) migrate_to_node() >>>> Can ignore this for the same reason as in 4. This path is only invoked >>>> by the migrate_pages() syscall. >>>> >>>> 6) do_mbind() >>>> Can ignore this for the same reason as 4 and 5. This path is only >>>> invoked by the mbind() syscall. >>>> >>>> 7) soft_offline_in_use_page() >>>> Can skip soft offlining fuse folios (eg folios with the >>>> AS_NO_WRITEBACK_WAIT mapping flag set). >>>> The path for this is soft_offline_page() -> soft_offline_in_use_page() >>>> -> migrate_pages(). soft_offline_page() only invokes this for in-use >>>> pages in a well-defined state (see ret value of get_hwpoison_page()). >>>> My understanding of soft offlining pages is that it's a mitigation >>>> strategy for handling pages that are experiencing errors but are not >>>> yet completely unusable, and its main purpose is to prevent future >>>> issues. It seems fine to skip this for fuse folios. >>>> >>>> 8) do_migrate_range() >>>> 9) compact_zone() >>>> 10) migrate_longterm_unpinnable_folios() >>>> 11) __alloc_contig_migrate_range() >>>> >>>> 8 to 11 needs more investigation / thinking about. I don't see a good >>>> way around these tbh. I think we have to operate under the assumption >>>> that the fuse server running is malicious or benevolently but >>>> incorrectly written and could possibly never complete writeback. So we >>>> definitely can't wait on these but it also doesn't seem like we can >>>> skip waiting on these, especially for the case where the server uses >>>> spliced pages, nor does it seem like we can just fail these with >>>> -EBUSY or something. >> >> I see some code paths with -EAGAIN in migration. Could you explain why >> we can't just fail migration for fuse write-back pages? >> Hi Joanne, thanks a lot for your quick reply (especially as my reviews come in very late). > > My understanding (and please correct me here Shakeel if I'm wrong) is > that this could block system optimizations, especially since if an > unprivileged malicious fuse server never replies to the writeback > request, then this completely stalls progress. In the best case > scenario, -EAGAIN could be used because the server might just be slow > in serving the writeback, but I think we need to also account for > servers that never complete the writeback. For > __alloc_contig_migrate_range() for example, my understanding is that > this is used to migrate pages so that there are more physically > contiguous ranges of memory freed up. If fuse writeback blocks that, > then that hurts system health overall. Hmm, I wonder what is worse - tmp page copies or missing compaction. Especially if we expect a low range of in-writeback pages/folios. One could argue that an evil user might spawn many fuse server processes to work around the default low fuse write-back limits, but does that make any difference with tmp pages? And these cannot be compacted either? And with timeouts that would be so far totally uncritical, I think. You also mentioned > especially for the case where the server uses spliced pages could you provide more details for that? > >>>> >>> >>> I'm still not seeing a good way around this. >>> >>> What about this then? We add a new fuse sysctl called something like >>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys >>> admin sets this, then it opts into optimizing writeback to be as fast >>> as possible (eg skipping the page copies) and if the server doesn't >>> fulfill the writeback by the set timeout value, then the connection is >>> aborted. >>> >>> Alternatively, we could also repurpose >>> /proc/sys/fs/fuse/max_request_timeout from the request timeout >>> patchset [1] but I like the additional flexibility and explicitness >>> having the "writeback_optimization_timeout" sysctl gives. >>> >>> Any thoughts on this? >> >> >> I'm a bit worried that we might lock up the system until time out is >> reached - not ideal. Especially as timeouts are in minutes now. But >> even a slightly stuttering video system not be great. I think we >> should give users/admin the choice then, if they prefer slow page >> copies or fast, but possibly shortly unresponsive system. >> > I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout > would be in seconds, where the sys admin would probably set something > more reasonable like 5 seconds or so. > If this syctl value is set, then servers who want writebacks to be > fast can opt into it at mount time (and by doing so agree that they > will service writeback requests by the timeout or their connection > will be aborted). I think your current patch set has it in minutes? (Should be easy enough to change that.) Though I'm more worried about the impact of _frequent_ timeout scanning through the different fuse lists on performance, than about missing compaction for folios that are currently in write-back. Thanks, Bernd
On Wed, Oct 30, 2024 at 9:21 AM Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > On 10/30/24 17:04, Joanne Koong wrote: > > On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert > > <bernd.schubert@fastmail.fm> wrote: > >> > >> On 10/28/24 22:58, Joanne Koong wrote: > >>> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote: > >>>> > >>>>> Same here, I need to look some more into the compaction / page > >>>>> migration paths. I'm planning to do this early next week and will > >>>>> report back with what I find. > >>>>> > >>>> > >>>> These are my notes so far: > >>>> > >>>> * We hit the folio_wait_writeback() path when callers call > >>>> migrate_pages() with mode MIGRATE_SYNC > >>>> ... -> migrate_pages() -> migrate_pages_sync() -> > >>>> migrate_pages_batch() -> migrate_folio_unmap() -> > >>>> folio_wait_writeback() > >>>> > >>>> * These are the places where we call migrate_pages(): > >>>> 1) demote_folio_list() > >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >>>> > >>>> 2) __damon_pa_migrate_folio_list() > >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >>>> > >>>> 3) migrate_misplaced_folio() > >>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >>>> > >>>> 4) do_move_pages_to_node() > >>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but > >>>> this path is only invoked by the move_pages() syscall. It's fine to > >>>> wait on writeback for the move_pages() syscall since the user would > >>>> have to deliberately invoke this on the fuse server for this to apply > >>>> to the server's fuse folios > >>>> > >>>> 5) migrate_to_node() > >>>> Can ignore this for the same reason as in 4. This path is only invoked > >>>> by the migrate_pages() syscall. > >>>> > >>>> 6) do_mbind() > >>>> Can ignore this for the same reason as 4 and 5. This path is only > >>>> invoked by the mbind() syscall. > >>>> > >>>> 7) soft_offline_in_use_page() > >>>> Can skip soft offlining fuse folios (eg folios with the > >>>> AS_NO_WRITEBACK_WAIT mapping flag set). > >>>> The path for this is soft_offline_page() -> soft_offline_in_use_page() > >>>> -> migrate_pages(). soft_offline_page() only invokes this for in-use > >>>> pages in a well-defined state (see ret value of get_hwpoison_page()). > >>>> My understanding of soft offlining pages is that it's a mitigation > >>>> strategy for handling pages that are experiencing errors but are not > >>>> yet completely unusable, and its main purpose is to prevent future > >>>> issues. It seems fine to skip this for fuse folios. > >>>> > >>>> 8) do_migrate_range() > >>>> 9) compact_zone() > >>>> 10) migrate_longterm_unpinnable_folios() > >>>> 11) __alloc_contig_migrate_range() > >>>> > >>>> 8 to 11 needs more investigation / thinking about. I don't see a good > >>>> way around these tbh. I think we have to operate under the assumption > >>>> that the fuse server running is malicious or benevolently but > >>>> incorrectly written and could possibly never complete writeback. So we > >>>> definitely can't wait on these but it also doesn't seem like we can > >>>> skip waiting on these, especially for the case where the server uses > >>>> spliced pages, nor does it seem like we can just fail these with > >>>> -EBUSY or something. > >> > >> I see some code paths with -EAGAIN in migration. Could you explain why > >> we can't just fail migration for fuse write-back pages? > >> > > Hi Joanne, > > thanks a lot for your quick reply (especially as my reviews come in very > late). > Thanks for your comments/reviews, Bernd! I always appreciate them. > > > > My understanding (and please correct me here Shakeel if I'm wrong) is > > that this could block system optimizations, especially since if an > > unprivileged malicious fuse server never replies to the writeback > > request, then this completely stalls progress. In the best case > > scenario, -EAGAIN could be used because the server might just be slow > > in serving the writeback, but I think we need to also account for > > servers that never complete the writeback. For > > __alloc_contig_migrate_range() for example, my understanding is that > > this is used to migrate pages so that there are more physically > > contiguous ranges of memory freed up. If fuse writeback blocks that, > > then that hurts system health overall. > > Hmm, I wonder what is worse - tmp page copies or missing compaction. > Especially if we expect a low range of in-writeback pages/folios. > One could argue that an evil user might spawn many fuse server > processes to work around the default low fuse write-back limits, but > does that make any difference with tmp pages? And these cannot be > compacted either? My understanding (and Shakeel please jump in here if this isn't right) is that tmp pages can be migrated/compacted. I think it's only pages marked as under writeback that are considered to be non-movable. > > And with timeouts that would be so far totally uncritical, I > think. > > > You also mentioned > > > especially for the case where the server uses spliced pages > > could you provide more details for that? > For the page migration / compaction paths, I don't think we can do the workaround we could do for sync where we skip waiting on writeback for fuse folios and continue on with the operation, because the migration / compaction paths operate on the pages. For the splice case, we assign the page to the pipebuffer (fuse_ref_page()), so if the migration/compaction happens on the page before the server has read this page from the pipebuffer, it'll be incorrect data or maybe crash the kernel. > > > > > >>>> > >>> > >>> I'm still not seeing a good way around this. > >>> > >>> What about this then? We add a new fuse sysctl called something like > >>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys > >>> admin sets this, then it opts into optimizing writeback to be as fast > >>> as possible (eg skipping the page copies) and if the server doesn't > >>> fulfill the writeback by the set timeout value, then the connection is > >>> aborted. > >>> > >>> Alternatively, we could also repurpose > >>> /proc/sys/fs/fuse/max_request_timeout from the request timeout > >>> patchset [1] but I like the additional flexibility and explicitness > >>> having the "writeback_optimization_timeout" sysctl gives. > >>> > >>> Any thoughts on this? > >> > >> > >> I'm a bit worried that we might lock up the system until time out is > >> reached - not ideal. Especially as timeouts are in minutes now. But > >> even a slightly stuttering video system not be great. I think we > >> should give users/admin the choice then, if they prefer slow page > >> copies or fast, but possibly shortly unresponsive system. > >> > > I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout > > would be in seconds, where the sys admin would probably set something > > more reasonable like 5 seconds or so. > > If this syctl value is set, then servers who want writebacks to be > > fast can opt into it at mount time (and by doing so agree that they > > will service writeback requests by the timeout or their connection > > will be aborted). > > > I think your current patch set has it in minutes? (Should be easy > enough to change that.) Though I'm more worried about the impact > of _frequent_ timeout scanning through the different fuse lists > on performance, than about missing compaction for folios that are > currently in write-back. > Ah, for this the " /proc/sys/fs/fuse/writeback_optimization_timeout" would be a separate thing from the "/proc/sys/fs/fuse/max_request_timeout". The "/proc/sys/fs/fuse/writeback_optimization_timeout" would only apply for writeback requests. I was thinking implementation-wise, for writebacks we could just have a timer associated with each request (instead of having to grab locks with the fuse lists), since they won't be super common. Thanks, Joanne > > Thanks, > Bernd
On 10/30/24 18:02, Joanne Koong wrote: > On Wed, Oct 30, 2024 at 9:21 AM Bernd Schubert > <bernd.schubert@fastmail.fm> wrote: >> >> On 10/30/24 17:04, Joanne Koong wrote: >>> On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert >>> <bernd.schubert@fastmail.fm> wrote: >>>> >>>> On 10/28/24 22:58, Joanne Koong wrote: >>>>> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote: >>>>>> >>>>>>> Same here, I need to look some more into the compaction / page >>>>>>> migration paths. I'm planning to do this early next week and will >>>>>>> report back with what I find. >>>>>>> >>>>>> >>>>>> These are my notes so far: >>>>>> >>>>>> * We hit the folio_wait_writeback() path when callers call >>>>>> migrate_pages() with mode MIGRATE_SYNC >>>>>> ... -> migrate_pages() -> migrate_pages_sync() -> >>>>>> migrate_pages_batch() -> migrate_folio_unmap() -> >>>>>> folio_wait_writeback() >>>>>> >>>>>> * These are the places where we call migrate_pages(): >>>>>> 1) demote_folio_list() >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode >>>>>> >>>>>> 2) __damon_pa_migrate_folio_list() >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode >>>>>> >>>>>> 3) migrate_misplaced_folio() >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode >>>>>> >>>>>> 4) do_move_pages_to_node() >>>>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but >>>>>> this path is only invoked by the move_pages() syscall. It's fine to >>>>>> wait on writeback for the move_pages() syscall since the user would >>>>>> have to deliberately invoke this on the fuse server for this to apply >>>>>> to the server's fuse folios >>>>>> >>>>>> 5) migrate_to_node() >>>>>> Can ignore this for the same reason as in 4. This path is only invoked >>>>>> by the migrate_pages() syscall. >>>>>> >>>>>> 6) do_mbind() >>>>>> Can ignore this for the same reason as 4 and 5. This path is only >>>>>> invoked by the mbind() syscall. >>>>>> >>>>>> 7) soft_offline_in_use_page() >>>>>> Can skip soft offlining fuse folios (eg folios with the >>>>>> AS_NO_WRITEBACK_WAIT mapping flag set). >>>>>> The path for this is soft_offline_page() -> soft_offline_in_use_page() >>>>>> -> migrate_pages(). soft_offline_page() only invokes this for in-use >>>>>> pages in a well-defined state (see ret value of get_hwpoison_page()). >>>>>> My understanding of soft offlining pages is that it's a mitigation >>>>>> strategy for handling pages that are experiencing errors but are not >>>>>> yet completely unusable, and its main purpose is to prevent future >>>>>> issues. It seems fine to skip this for fuse folios. >>>>>> >>>>>> 8) do_migrate_range() >>>>>> 9) compact_zone() >>>>>> 10) migrate_longterm_unpinnable_folios() >>>>>> 11) __alloc_contig_migrate_range() >>>>>> >>>>>> 8 to 11 needs more investigation / thinking about. I don't see a good >>>>>> way around these tbh. I think we have to operate under the assumption >>>>>> that the fuse server running is malicious or benevolently but >>>>>> incorrectly written and could possibly never complete writeback. So we >>>>>> definitely can't wait on these but it also doesn't seem like we can >>>>>> skip waiting on these, especially for the case where the server uses >>>>>> spliced pages, nor does it seem like we can just fail these with >>>>>> -EBUSY or something. >>>> >>>> I see some code paths with -EAGAIN in migration. Could you explain why >>>> we can't just fail migration for fuse write-back pages? >>>> >> >> Hi Joanne, >> >> thanks a lot for your quick reply (especially as my reviews come in very >> late). >> > > Thanks for your comments/reviews, Bernd! I always appreciate them. > >>> >>> My understanding (and please correct me here Shakeel if I'm wrong) is >>> that this could block system optimizations, especially since if an >>> unprivileged malicious fuse server never replies to the writeback >>> request, then this completely stalls progress. In the best case >>> scenario, -EAGAIN could be used because the server might just be slow >>> in serving the writeback, but I think we need to also account for >>> servers that never complete the writeback. For >>> __alloc_contig_migrate_range() for example, my understanding is that >>> this is used to migrate pages so that there are more physically >>> contiguous ranges of memory freed up. If fuse writeback blocks that, >>> then that hurts system health overall. >> >> Hmm, I wonder what is worse - tmp page copies or missing compaction. >> Especially if we expect a low range of in-writeback pages/folios. >> One could argue that an evil user might spawn many fuse server >> processes to work around the default low fuse write-back limits, but >> does that make any difference with tmp pages? And these cannot be >> compacted either? > > My understanding (and Shakeel please jump in here if this isn't right) > is that tmp pages can be migrated/compacted. I think it's only pages > marked as under writeback that are considered to be non-movable. > >> >> And with timeouts that would be so far totally uncritical, I >> think. >> >> >> You also mentioned >> >>> especially for the case where the server uses spliced pages >> >> could you provide more details for that? >> 7> > For the page migration / compaction paths, I don't think we can do the > workaround we could do for sync where we skip waiting on writeback for > fuse folios and continue on with the operation, because the migration > / compaction paths operate on the pages. For the splice case, we > assign the page to the pipebuffer (fuse_ref_page()), so if the > migration/compaction happens on the page before the server has read > this page from the pipebuffer, it'll be incorrect data or maybe crash > the kernel. > >> >> >>> >>>>>> >>>>> >>>>> I'm still not seeing a good way around this. >>>>> >>>>> What about this then? We add a new fuse sysctl called something like >>>>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys >>>>> admin sets this, then it opts into optimizing writeback to be as fast >>>>> as possible (eg skipping the page copies) and if the server doesn't >>>>> fulfill the writeback by the set timeout value, then the connection is >>>>> aborted. >>>>> >>>>> Alternatively, we could also repurpose >>>>> /proc/sys/fs/fuse/max_request_timeout from the request timeout >>>>> patchset [1] but I like the additional flexibility and explicitness >>>>> having the "writeback_optimization_timeout" sysctl gives. >>>>> >>>>> Any thoughts on this? >>>> >>>> >>>> I'm a bit worried that we might lock up the system until time out is >>>> reached - not ideal. Especially as timeouts are in minutes now. But >>>> even a slightly stuttering video system not be great. I think we >>>> should give users/admin the choice then, if they prefer slow page >>>> copies or fast, but possibly shortly unresponsive system. >>>> >>> I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout >>> would be in seconds, where the sys admin would probably set something >>> more reasonable like 5 seconds or so. >>> If this syctl value is set, then servers who want writebacks to be >>> fast can opt into it at mount time (and by doing so agree that they >>> will service writeback requests by the timeout or their connection >>> will be aborted). >> >> >> I think your current patch set has it in minutes? (Should be easy >> enough to change that.) Though I'm more worried about the impact >> of _frequent_ timeout scanning through the different fuse lists >> on performance, than about missing compaction for folios that are >> currently in write-back. Hmm, if tmp pages can be compacted, isn't that a problem for splice? I.e. I don't understand what the difference between tmp page and write-back page for migration. >> > > Ah, for this the " /proc/sys/fs/fuse/writeback_optimization_timeout" > would be a separate thing from the > "/proc/sys/fs/fuse/max_request_timeout". The > "/proc/sys/fs/fuse/writeback_optimization_timeout" would only apply > for writeback requests. I was thinking implementation-wise, for > writebacks we could just have a timer associated with each request > (instead of having to grab locks with the fuse lists), since they > won't be super common. Ah, thank you! I had missed that this is another variable. Issue with too short timeouts would probably be network hick-up that would immediately kill fuse-server. I.e. if it just the missing page compaction/migration, maybe larger time outs would be acceptable. Thanks, Bernd
On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > > > On 10/30/24 18:02, Joanne Koong wrote: > > On Wed, Oct 30, 2024 at 9:21 AM Bernd Schubert > > <bernd.schubert@fastmail.fm> wrote: > >> > >> On 10/30/24 17:04, Joanne Koong wrote: > >>> On Wed, Oct 30, 2024 at 2:32 AM Bernd Schubert > >>> <bernd.schubert@fastmail.fm> wrote: > >>>> > >>>> On 10/28/24 22:58, Joanne Koong wrote: > >>>>> On Fri, Oct 25, 2024 at 3:40 PM Joanne Koong <joannelkoong@gmail.com> wrote: > >>>>>> > >>>>>>> Same here, I need to look some more into the compaction / page > >>>>>>> migration paths. I'm planning to do this early next week and will > >>>>>>> report back with what I find. > >>>>>>> > >>>>>> > >>>>>> These are my notes so far: > >>>>>> > >>>>>> * We hit the folio_wait_writeback() path when callers call > >>>>>> migrate_pages() with mode MIGRATE_SYNC > >>>>>> ... -> migrate_pages() -> migrate_pages_sync() -> > >>>>>> migrate_pages_batch() -> migrate_folio_unmap() -> > >>>>>> folio_wait_writeback() > >>>>>> > >>>>>> * These are the places where we call migrate_pages(): > >>>>>> 1) demote_folio_list() > >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >>>>>> > >>>>>> 2) __damon_pa_migrate_folio_list() > >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >>>>>> > >>>>>> 3) migrate_misplaced_folio() > >>>>>> Can ignore this. It calls migrate_pages() in MIGRATE_ASYNC mode > >>>>>> > >>>>>> 4) do_move_pages_to_node() > >>>>>> Can ignore this. This calls migrate_pages() in MIGRATE_SYNC mode but > >>>>>> this path is only invoked by the move_pages() syscall. It's fine to > >>>>>> wait on writeback for the move_pages() syscall since the user would > >>>>>> have to deliberately invoke this on the fuse server for this to apply > >>>>>> to the server's fuse folios > >>>>>> > >>>>>> 5) migrate_to_node() > >>>>>> Can ignore this for the same reason as in 4. This path is only invoked > >>>>>> by the migrate_pages() syscall. > >>>>>> > >>>>>> 6) do_mbind() > >>>>>> Can ignore this for the same reason as 4 and 5. This path is only > >>>>>> invoked by the mbind() syscall. > >>>>>> > >>>>>> 7) soft_offline_in_use_page() > >>>>>> Can skip soft offlining fuse folios (eg folios with the > >>>>>> AS_NO_WRITEBACK_WAIT mapping flag set). > >>>>>> The path for this is soft_offline_page() -> soft_offline_in_use_page() > >>>>>> -> migrate_pages(). soft_offline_page() only invokes this for in-use > >>>>>> pages in a well-defined state (see ret value of get_hwpoison_page()). > >>>>>> My understanding of soft offlining pages is that it's a mitigation > >>>>>> strategy for handling pages that are experiencing errors but are not > >>>>>> yet completely unusable, and its main purpose is to prevent future > >>>>>> issues. It seems fine to skip this for fuse folios. > >>>>>> > >>>>>> 8) do_migrate_range() > >>>>>> 9) compact_zone() > >>>>>> 10) migrate_longterm_unpinnable_folios() > >>>>>> 11) __alloc_contig_migrate_range() > >>>>>> > >>>>>> 8 to 11 needs more investigation / thinking about. I don't see a good > >>>>>> way around these tbh. I think we have to operate under the assumption > >>>>>> that the fuse server running is malicious or benevolently but > >>>>>> incorrectly written and could possibly never complete writeback. So we > >>>>>> definitely can't wait on these but it also doesn't seem like we can > >>>>>> skip waiting on these, especially for the case where the server uses > >>>>>> spliced pages, nor does it seem like we can just fail these with > >>>>>> -EBUSY or something. > >>>> > >>>> I see some code paths with -EAGAIN in migration. Could you explain why > >>>> we can't just fail migration for fuse write-back pages? > >>>> > >> > >> Hi Joanne, > >> > >> thanks a lot for your quick reply (especially as my reviews come in very > >> late). > >> > > > > Thanks for your comments/reviews, Bernd! I always appreciate them. > > > >>> > >>> My understanding (and please correct me here Shakeel if I'm wrong) is > >>> that this could block system optimizations, especially since if an > >>> unprivileged malicious fuse server never replies to the writeback > >>> request, then this completely stalls progress. In the best case > >>> scenario, -EAGAIN could be used because the server might just be slow > >>> in serving the writeback, but I think we need to also account for > >>> servers that never complete the writeback. For > >>> __alloc_contig_migrate_range() for example, my understanding is that > >>> this is used to migrate pages so that there are more physically > >>> contiguous ranges of memory freed up. If fuse writeback blocks that, > >>> then that hurts system health overall. > >> > >> Hmm, I wonder what is worse - tmp page copies or missing compaction. > >> Especially if we expect a low range of in-writeback pages/folios. > >> One could argue that an evil user might spawn many fuse server > >> processes to work around the default low fuse write-back limits, but > >> does that make any difference with tmp pages? And these cannot be > >> compacted either? > > > > My understanding (and Shakeel please jump in here if this isn't right) > > is that tmp pages can be migrated/compacted. I think it's only pages > > marked as under writeback that are considered to be non-movable. > > > >> > >> And with timeouts that would be so far totally uncritical, I > >> think. > >> > >> > >> You also mentioned > >> > >>> especially for the case where the server uses spliced pages > >> > >> could you provide more details for that? > >> > 7> > > For the page migration / compaction paths, I don't think we can do the > > workaround we could do for sync where we skip waiting on writeback for > > fuse folios and continue on with the operation, because the migration > > / compaction paths operate on the pages. For the splice case, we > > assign the page to the pipebuffer (fuse_ref_page()), so if the > > migration/compaction happens on the page before the server has read > > this page from the pipebuffer, it'll be incorrect data or maybe crash > > the kernel. > > > >> > >> > >>> > >>>>>> > >>>>> > >>>>> I'm still not seeing a good way around this. > >>>>> > >>>>> What about this then? We add a new fuse sysctl called something like > >>>>> "/proc/sys/fs/fuse/writeback_optimization_timeout" where if the sys > >>>>> admin sets this, then it opts into optimizing writeback to be as fast > >>>>> as possible (eg skipping the page copies) and if the server doesn't > >>>>> fulfill the writeback by the set timeout value, then the connection is > >>>>> aborted. > >>>>> > >>>>> Alternatively, we could also repurpose > >>>>> /proc/sys/fs/fuse/max_request_timeout from the request timeout > >>>>> patchset [1] but I like the additional flexibility and explicitness > >>>>> having the "writeback_optimization_timeout" sysctl gives. > >>>>> > >>>>> Any thoughts on this? > >>>> > >>>> > >>>> I'm a bit worried that we might lock up the system until time out is > >>>> reached - not ideal. Especially as timeouts are in minutes now. But > >>>> even a slightly stuttering video system not be great. I think we > >>>> should give users/admin the choice then, if they prefer slow page > >>>> copies or fast, but possibly shortly unresponsive system. > >>>> > >>> I was thinking the /proc/sys/fs/fuse/writeback_optimization_timeout > >>> would be in seconds, where the sys admin would probably set something > >>> more reasonable like 5 seconds or so. > >>> If this syctl value is set, then servers who want writebacks to be > >>> fast can opt into it at mount time (and by doing so agree that they > >>> will service writeback requests by the timeout or their connection > >>> will be aborted). > >> > >> > >> I think your current patch set has it in minutes? (Should be easy > >> enough to change that.) Though I'm more worried about the impact > >> of _frequent_ timeout scanning through the different fuse lists > >> on performance, than about missing compaction for folios that are > >> currently in write-back. > > Hmm, if tmp pages can be compacted, isn't that a problem for splice? > I.e. I don't understand what the difference between tmp page and > write-back page for migration. > That's a great question! I have no idea how compaction works for pages being used in splice. Shakeel, do you know the answer to this? Thanks, Joanne > > >> > > > > Ah, for this the " /proc/sys/fs/fuse/writeback_optimization_timeout" > > would be a separate thing from the > > "/proc/sys/fs/fuse/max_request_timeout". The > > "/proc/sys/fs/fuse/writeback_optimization_timeout" would only apply > > for writeback requests. I was thinking implementation-wise, for > > writebacks we could just have a timer associated with each request > > (instead of having to grab locks with the fuse lists), since they > > won't be super common. > > Ah, thank you! I had missed that this is another variable. Issue > with too short timeouts would probably be network hick-up that > would immediately kill fuse-server. I.e. if it just the missing > page compaction/migration, maybe larger time outs would be > acceptable. > > > Thanks, > Bernd
On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote: > On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert > <bernd.schubert@fastmail.fm> wrote: > > > > > > Hmm, if tmp pages can be compacted, isn't that a problem for splice? > > I.e. I don't understand what the difference between tmp page and > > write-back page for migration. > > > > That's a great question! I have no idea how compaction works for pages > being used in splice. Shakeel, do you know the answer to this? > Sorry for the late response. I still have to go through other unanswered questions but let me answer this one quickly. From the way the tmp pages are allocated, it does not seem like they are movable and thus are not target for migration/compaction. The page with the writeback bit set is actually just a user memory page cache which is moveable but due to, at the moment, under writeback it temporarily becomes unmovable to not cause corruption. Shakeel
On 10/30/24 22:56, Shakeel Butt wrote: > On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote: >> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert >> <bernd.schubert@fastmail.fm> wrote: >>> >>> >>> Hmm, if tmp pages can be compacted, isn't that a problem for splice? >>> I.e. I don't understand what the difference between tmp page and >>> write-back page for migration. >>> >> >> That's a great question! I have no idea how compaction works for pages >> being used in splice. Shakeel, do you know the answer to this? >> > > Sorry for the late response. I still have to go through other unanswered > questions but let me answer this one quickly. From the way the tmp pages > are allocated, it does not seem like they are movable and thus are not > target for migration/compaction. > > The page with the writeback bit set is actually just a user memory page > cache which is moveable but due to, at the moment, under writeback it > temporarily becomes unmovable to not cause corruption. Thanks a lot for your quick reply Shakeel! (Actually very fast!). With that, it confirms what I wrote earlier - removing tmp and ignoring fuse writeback pages in migration should not make any difference regarding overall system performance. Unless I miss something, more on the contrary as additional memory pressure expensive page copying is being removed. Thanks, Bernd
On Wed, Oct 30, 2024 at 3:17 PM Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > > > On 10/30/24 22:56, Shakeel Butt wrote: > > On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote: > >> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert > >> <bernd.schubert@fastmail.fm> wrote: > >>> > >>> > >>> Hmm, if tmp pages can be compacted, isn't that a problem for splice? > >>> I.e. I don't understand what the difference between tmp page and > >>> write-back page for migration. > >>> > >> > >> That's a great question! I have no idea how compaction works for pages > >> being used in splice. Shakeel, do you know the answer to this? > >> > > > > Sorry for the late response. I still have to go through other unanswered > > questions but let me answer this one quickly. From the way the tmp pages > > are allocated, it does not seem like they are movable and thus are not > > target for migration/compaction. > > > > The page with the writeback bit set is actually just a user memory page > > cache which is moveable but due to, at the moment, under writeback it > > temporarily becomes unmovable to not cause corruption. > > Thanks a lot for your quick reply Shakeel! (Actually very fast!). > > With that, it confirms what I wrote earlier - removing tmp and ignoring > fuse writeback pages in migration should not make any difference > regarding overall system performance. Unless I miss something, > more on the contrary as additional memory pressure expensive page > copying is being removed. > Thanks for the information Shakeel, and thanks Bernd for bringing up this point of discussion. Before I celebrate too prematurely, a few additional questions: Are tmp pages (eg from folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0)) and page cache pages allocated from the same memory pool? Or are tmp pages allocated from a special memory pool that isn't meant to be compacted/optimized? If they are allocated from the same memory pool, then it seems like there's no difference between tmp pages blocking a memory range from being compacted vs. a page cache page blocking a memory range from being compacted (by not clearing writeback). But if they are not allocated from the same pool, then it seems like the page cache page blocking migration could adversely affect general system performance in a way that the tmp page doesn't? Thanks, Joanne > > Thanks, > Bernd
On Wed, Oct 30, 2024 at 03:51:08PM GMT, Joanne Koong wrote: > On Wed, Oct 30, 2024 at 3:17 PM Bernd Schubert > <bernd.schubert@fastmail.fm> wrote: > > > > > > > > On 10/30/24 22:56, Shakeel Butt wrote: > > > On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote: > > >> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert > > >> <bernd.schubert@fastmail.fm> wrote: > > >>> > > >>> > > >>> Hmm, if tmp pages can be compacted, isn't that a problem for splice? > > >>> I.e. I don't understand what the difference between tmp page and > > >>> write-back page for migration. > > >>> > > >> > > >> That's a great question! I have no idea how compaction works for pages > > >> being used in splice. Shakeel, do you know the answer to this? > > >> > > > > > > Sorry for the late response. I still have to go through other unanswered > > > questions but let me answer this one quickly. From the way the tmp pages > > > are allocated, it does not seem like they are movable and thus are not > > > target for migration/compaction. > > > > > > The page with the writeback bit set is actually just a user memory page > > > cache which is moveable but due to, at the moment, under writeback it > > > temporarily becomes unmovable to not cause corruption. > > > > Thanks a lot for your quick reply Shakeel! (Actually very fast!). > > > > With that, it confirms what I wrote earlier - removing tmp and ignoring > > fuse writeback pages in migration should not make any difference > > regarding overall system performance. Unless I miss something, > > more on the contrary as additional memory pressure expensive page > > copying is being removed. > > > > Thanks for the information Shakeel, and thanks Bernd for bringing up > this point of discussion. > > Before I celebrate too prematurely, a few additional questions: You are asking hard questions, so CCed couple more folks to correct me if I am wrong. > > Are tmp pages (eg from folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0)) and > page cache pages allocated from the same memory pool? Or are tmp pages > allocated from a special memory pool that isn't meant to be > compacted/optimized? Memory pool is a bit confusing term here. Most probably you are asking about the migrate type of the page block from which tmp page is allocated from. In a normal system, tmp page would be allocated from page block with MIGRATE_UNMOVABLE migrate type while the page cache page, it depends on what gfp flag was used for its allocation. What does fuse fs use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation allocations can get mixed up with different migrate types. > > If they are allocated from the same memory pool, then it seems like > there's no difference between tmp pages blocking a memory range from > being compacted vs. a page cache page blocking a memory range from > being compacted (by not clearing writeback). But if they are not > allocated from the same pool, then it seems like the page cache page > blocking migration could adversely affect general system performance > in a way that the tmp page doesn't? I think irrespective of where the page is coming from, the page under writeback is non-movable and can fragment the memory. The question that is that worse than a tmp page fragmenting the memory, I am not sure.
On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Wed, Oct 30, 2024 at 03:51:08PM GMT, Joanne Koong wrote: > > On Wed, Oct 30, 2024 at 3:17 PM Bernd Schubert > > <bernd.schubert@fastmail.fm> wrote: > > > > > > > > > > > > On 10/30/24 22:56, Shakeel Butt wrote: > > > > On Wed, Oct 30, 2024 at 10:35:47AM GMT, Joanne Koong wrote: > > > >> On Wed, Oct 30, 2024 at 10:27 AM Bernd Schubert > > > >> <bernd.schubert@fastmail.fm> wrote: > > > >>> > > > >>> > > > >>> Hmm, if tmp pages can be compacted, isn't that a problem for splice? > > > >>> I.e. I don't understand what the difference between tmp page and > > > >>> write-back page for migration. > > > >>> > > > >> > > > >> That's a great question! I have no idea how compaction works for pages > > > >> being used in splice. Shakeel, do you know the answer to this? > > > >> > > > > > > > > Sorry for the late response. I still have to go through other unanswered > > > > questions but let me answer this one quickly. From the way the tmp pages > > > > are allocated, it does not seem like they are movable and thus are not > > > > target for migration/compaction. > > > > > > > > The page with the writeback bit set is actually just a user memory page > > > > cache which is moveable but due to, at the moment, under writeback it > > > > temporarily becomes unmovable to not cause corruption. > > > > > > Thanks a lot for your quick reply Shakeel! (Actually very fast!). > > > > > > With that, it confirms what I wrote earlier - removing tmp and ignoring > > > fuse writeback pages in migration should not make any difference > > > regarding overall system performance. Unless I miss something, > > > more on the contrary as additional memory pressure expensive page > > > copying is being removed. > > > > > > > Thanks for the information Shakeel, and thanks Bernd for bringing up > > this point of discussion. > > > > Before I celebrate too prematurely, a few additional questions: > > You are asking hard questions, so CCed couple more folks to correct me > if I am wrong. > > > > > Are tmp pages (eg from folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0)) and > > page cache pages allocated from the same memory pool? Or are tmp pages > > allocated from a special memory pool that isn't meant to be > > compacted/optimized? > > Memory pool is a bit confusing term here. Most probably you are asking > about the migrate type of the page block from which tmp page is > allocated from. In a normal system, tmp page would be allocated from page > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it > depends on what gfp flag was used for its allocation. What does fuse fs > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation > allocations can get mixed up with different migrate types. > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since fuse doesn't set any additional gfp masks on the inode mapping. Could we just allocate the fuse writeback pages with GFP_HIGHUSER instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin() where we pass in the gfp mask to __filemap_get_folio(). I think this would give us the same behavior memory-wise as what the tmp pages currently do, and would solve all our headaches regarding writeback potentially blocking page migration/compaction. Thanks, Joanne > > > > If they are allocated from the same memory pool, then it seems like > > there's no difference between tmp pages blocking a memory range from > > being compacted vs. a page cache page blocking a memory range from > > being compacted (by not clearing writeback). But if they are not > > allocated from the same pool, then it seems like the page cache page > > blocking migration could adversely affect general system performance > > in a way that the tmp page doesn't? > > I think irrespective of where the page is coming from, the page under > writeback is non-movable and can fragment the memory. The question that > is that worse than a tmp page fragmenting the memory, I am not sure. >
On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote: > On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: [...] > > > > Memory pool is a bit confusing term here. Most probably you are asking > > about the migrate type of the page block from which tmp page is > > allocated from. In a normal system, tmp page would be allocated from page > > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it > > depends on what gfp flag was used for its allocation. What does fuse fs > > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation > > allocations can get mixed up with different migrate types. > > > > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since > fuse doesn't set any additional gfp masks on the inode mapping. > > Could we just allocate the fuse writeback pages with GFP_HIGHUSER > instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin() > where we pass in the gfp mask to __filemap_get_folio(). I think this > would give us the same behavior memory-wise as what the tmp pages > currently do, I don't think it would be the same behavior. From what I understand the liftime of the tmp page is from the start of the writeback till the ack from the fuse server that writeback is done. While the lifetime of the page of the page cache can be arbitrarily large. We should just make it unmovable for its lifetime. I think it is fine to make the page unmovable during the writeback. We should not try to optimize for the bad or buggy behavior of fuse server. Regarding the avoidance of wait on writeback for fuse folios, I think we can handle the migration similar to how you are handling reclaim and in addition we can add a WARN() in folio_wait_writeback() if the kernel ever sees a fuse folio in that function.
On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote: > > On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > [...] > > > > > > Memory pool is a bit confusing term here. Most probably you are asking > > > about the migrate type of the page block from which tmp page is > > > allocated from. In a normal system, tmp page would be allocated from page > > > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it > > > depends on what gfp flag was used for its allocation. What does fuse fs > > > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation > > > allocations can get mixed up with different migrate types. > > > > > > > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since > > fuse doesn't set any additional gfp masks on the inode mapping. > > > > Could we just allocate the fuse writeback pages with GFP_HIGHUSER > > instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin() > > where we pass in the gfp mask to __filemap_get_folio(). I think this > > would give us the same behavior memory-wise as what the tmp pages > > currently do, > > I don't think it would be the same behavior. From what I understand the > liftime of the tmp page is from the start of the writeback till the ack > from the fuse server that writeback is done. While the lifetime of the > page of the page cache can be arbitrarily large. We should just make it > unmovable for its lifetime. I think it is fine to make the page > unmovable during the writeback. We should not try to optimize for the > bad or buggy behavior of fuse server. > > Regarding the avoidance of wait on writeback for fuse folios, I think we > can handle the migration similar to how you are handling reclaim and in > addition we can add a WARN() in folio_wait_writeback() if the kernel ever > sees a fuse folio in that function. Awesome, this is what I'm planning to do in v3 to address migration then: 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg fuse folios will have that AS_NO_WRITEBACK_WAIT bit set) 2) in the fuse filesystem's implementation of the mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is under writeback. Does this sound good? Thanks, Joanne
On Thu, Oct 31, 2024 at 02:52:57PM GMT, Joanne Koong wrote: > On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote: > > > On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > [...] > > > > > > > > Memory pool is a bit confusing term here. Most probably you are asking > > > > about the migrate type of the page block from which tmp page is > > > > allocated from. In a normal system, tmp page would be allocated from page > > > > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it > > > > depends on what gfp flag was used for its allocation. What does fuse fs > > > > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation > > > > allocations can get mixed up with different migrate types. > > > > > > > > > > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since > > > fuse doesn't set any additional gfp masks on the inode mapping. > > > > > > Could we just allocate the fuse writeback pages with GFP_HIGHUSER > > > instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin() > > > where we pass in the gfp mask to __filemap_get_folio(). I think this > > > would give us the same behavior memory-wise as what the tmp pages > > > currently do, > > > > I don't think it would be the same behavior. From what I understand the > > liftime of the tmp page is from the start of the writeback till the ack > > from the fuse server that writeback is done. While the lifetime of the > > page of the page cache can be arbitrarily large. We should just make it > > unmovable for its lifetime. I think it is fine to make the page > > unmovable during the writeback. We should not try to optimize for the > > bad or buggy behavior of fuse server. > > > > Regarding the avoidance of wait on writeback for fuse folios, I think we > > can handle the migration similar to how you are handling reclaim and in > > addition we can add a WARN() in folio_wait_writeback() if the kernel ever > > sees a fuse folio in that function. > > Awesome, this is what I'm planning to do in v3 to address migration then: > > 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if > src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg > fuse folios will have that AS_NO_WRITEBACK_WAIT bit set) > > 2) in the fuse filesystem's implementation of the > mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is > under writeback. 3) Add WARN_ONCE() in folio_wait_writeback() if folio->mapping has AS_NO_WRITEBACK_WAIT set and return without waiting. > > Does this sound good? Yes.
Hi Joanne, Thanks for keeping pushing this forward. On 11/1/24 5:52 AM, Joanne Koong wrote: > On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: >> >> On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote: >>> On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: >> [...] >>>> >>>> Memory pool is a bit confusing term here. Most probably you are asking >>>> about the migrate type of the page block from which tmp page is >>>> allocated from. In a normal system, tmp page would be allocated from page >>>> block with MIGRATE_UNMOVABLE migrate type while the page cache page, it >>>> depends on what gfp flag was used for its allocation. What does fuse fs >>>> use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation >>>> allocations can get mixed up with different migrate types. >>>> >>> >>> I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since >>> fuse doesn't set any additional gfp masks on the inode mapping. >>> >>> Could we just allocate the fuse writeback pages with GFP_HIGHUSER >>> instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin() >>> where we pass in the gfp mask to __filemap_get_folio(). I think this >>> would give us the same behavior memory-wise as what the tmp pages >>> currently do, >> >> I don't think it would be the same behavior. From what I understand the >> liftime of the tmp page is from the start of the writeback till the ack >> from the fuse server that writeback is done. While the lifetime of the >> page of the page cache can be arbitrarily large. We should just make it >> unmovable for its lifetime. I think it is fine to make the page >> unmovable during the writeback. We should not try to optimize for the >> bad or buggy behavior of fuse server. >> >> Regarding the avoidance of wait on writeback for fuse folios, I think we >> can handle the migration similar to how you are handling reclaim and in >> addition we can add a WARN() in folio_wait_writeback() if the kernel ever >> sees a fuse folio in that function. > > Awesome, this is what I'm planning to do in v3 to address migration then: > > 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if > src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg > fuse folios will have that AS_NO_WRITEBACK_WAIT bit set) I think it's generally okay to skip FUSE pages under writeback when the sync migrate_pages() is called in low memory context, which only tries to migrate as many pages as possible (i.e. best effort). While more caution may be needed when the sync migrate_pages() is called with an implicit hint that the migration can not fail. For example, ``` offline_pages while { scan_movable_pages do_migrate_range } ``` If the malicious server never completes the writeback IO, no progress will be made in the above while loop, and I'm afraid it will be a dead loop then. > > 2) in the fuse filesystem's implementation of the > mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is > under writeback. Is there any possibility that a_ops->migrate_folio() may be called with the folio under writeback? - for most pages without AS_NO_WRITEBACK_WAIT, a_ops->migrate_folio() will be called only when Page_writeback is cleared; - for AS_NO_WRITEBACK_WAIT pages, they are skipped if they are under writeback
On Fri, Nov 1, 2024 at 4:44 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > > Hi Joanne, > > Thanks for keeping pushing this forward. > > On 11/1/24 5:52 AM, Joanne Koong wrote: > > On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > >> > >> On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote: > >>> On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > >> [...] > >>>> > >>>> Memory pool is a bit confusing term here. Most probably you are asking > >>>> about the migrate type of the page block from which tmp page is > >>>> allocated from. In a normal system, tmp page would be allocated from page > >>>> block with MIGRATE_UNMOVABLE migrate type while the page cache page, it > >>>> depends on what gfp flag was used for its allocation. What does fuse fs > >>>> use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation > >>>> allocations can get mixed up with different migrate types. > >>>> > >>> > >>> I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since > >>> fuse doesn't set any additional gfp masks on the inode mapping. > >>> > >>> Could we just allocate the fuse writeback pages with GFP_HIGHUSER > >>> instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin() > >>> where we pass in the gfp mask to __filemap_get_folio(). I think this > >>> would give us the same behavior memory-wise as what the tmp pages > >>> currently do, > >> > >> I don't think it would be the same behavior. From what I understand the > >> liftime of the tmp page is from the start of the writeback till the ack > >> from the fuse server that writeback is done. While the lifetime of the > >> page of the page cache can be arbitrarily large. We should just make it > >> unmovable for its lifetime. I think it is fine to make the page > >> unmovable during the writeback. We should not try to optimize for the > >> bad or buggy behavior of fuse server. > >> > >> Regarding the avoidance of wait on writeback for fuse folios, I think we > >> can handle the migration similar to how you are handling reclaim and in > >> addition we can add a WARN() in folio_wait_writeback() if the kernel ever > >> sees a fuse folio in that function. > > > > Awesome, this is what I'm planning to do in v3 to address migration then: > > > > 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if > > src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg > > fuse folios will have that AS_NO_WRITEBACK_WAIT bit set) > > I think it's generally okay to skip FUSE pages under writeback when the > sync migrate_pages() is called in low memory context, which only tries > to migrate as many pages as possible (i.e. best effort). > > While more caution may be needed when the sync migrate_pages() is called > with an implicit hint that the migration can not fail. For example, > > ``` > offline_pages > while { > scan_movable_pages > do_migrate_range > } > ``` > > If the malicious server never completes the writeback IO, no progress > will be made in the above while loop, and I'm afraid it will be a dead > loop then. > Thanks for taking a look and sharing your thoughts. I agree. I think for this offline_pages() path, we need to handle this "TODO: fatal migration failures should bail out". For v3 I'm thinking of handling this by having some number of retries where we try do_migrate_range() but if it still doesn't succeed, to skip those pages and move onto the next. > > > > > 2) in the fuse filesystem's implementation of the > > mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is > > under writeback. > > Is there any possibility that a_ops->migrate_folio() may be called with > the folio under writeback? > > - for most pages without AS_NO_WRITEBACK_WAIT, a_ops->migrate_folio() > will be called only when Page_writeback is cleared; > - for AS_NO_WRITEBACK_WAIT pages, they are skipped if they are under > writeback > For AS_NO_WRITEBACK_WAIT_PAGES, if we skip waiting on them if they are under writeback, I think the a_ops->migrate_folio() will still get called (by migrate_pages_batch() -> migrate_folio_move() -> move_to_new_folio()). Looking at migrate_folio_unmap() some more though, I don't think we can just skip the wait call like we can for the sync(2) case. I think we need to error out here instead since after the wait call, migrate_folio_unmap() will replace the folio's page table mappings (try_to_migrate()). If we error out here, then there's no hitting a_ops->migrate_folio() when the folio is under writeback. Thanks, Joanne > -- > Thanks, > Jingbo
On 11/2/24 4:54 AM, Joanne Koong wrote: > On Fri, Nov 1, 2024 at 4:44 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: >> >> Hi Joanne, >> >> Thanks for keeping pushing this forward. >> >> On 11/1/24 5:52 AM, Joanne Koong wrote: >>> On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: >>>> >>>> On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote: >>>>> On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: >>>> [...] >>>>>> >>>>>> Memory pool is a bit confusing term here. Most probably you are asking >>>>>> about the migrate type of the page block from which tmp page is >>>>>> allocated from. In a normal system, tmp page would be allocated from page >>>>>> block with MIGRATE_UNMOVABLE migrate type while the page cache page, it >>>>>> depends on what gfp flag was used for its allocation. What does fuse fs >>>>>> use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation >>>>>> allocations can get mixed up with different migrate types. >>>>>> >>>>> >>>>> I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since >>>>> fuse doesn't set any additional gfp masks on the inode mapping. >>>>> >>>>> Could we just allocate the fuse writeback pages with GFP_HIGHUSER >>>>> instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin() >>>>> where we pass in the gfp mask to __filemap_get_folio(). I think this >>>>> would give us the same behavior memory-wise as what the tmp pages >>>>> currently do, >>>> >>>> I don't think it would be the same behavior. From what I understand the >>>> liftime of the tmp page is from the start of the writeback till the ack >>>> from the fuse server that writeback is done. While the lifetime of the >>>> page of the page cache can be arbitrarily large. We should just make it >>>> unmovable for its lifetime. I think it is fine to make the page >>>> unmovable during the writeback. We should not try to optimize for the >>>> bad or buggy behavior of fuse server. >>>> >>>> Regarding the avoidance of wait on writeback for fuse folios, I think we >>>> can handle the migration similar to how you are handling reclaim and in >>>> addition we can add a WARN() in folio_wait_writeback() if the kernel ever >>>> sees a fuse folio in that function. >>> >>> Awesome, this is what I'm planning to do in v3 to address migration then: >>> >>> 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if >>> src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg >>> fuse folios will have that AS_NO_WRITEBACK_WAIT bit set) >> >> I think it's generally okay to skip FUSE pages under writeback when the >> sync migrate_pages() is called in low memory context, which only tries >> to migrate as many pages as possible (i.e. best effort). >> >> While more caution may be needed when the sync migrate_pages() is called >> with an implicit hint that the migration can not fail. For example, >> >> ``` >> offline_pages >> while { >> scan_movable_pages >> do_migrate_range >> } >> ``` >> >> If the malicious server never completes the writeback IO, no progress >> will be made in the above while loop, and I'm afraid it will be a dead >> loop then. >> > > Thanks for taking a look and sharing your thoughts. > I agree. I think for this offline_pages() path, we need to handle this > "TODO: fatal migration failures should bail out". For v3 I'm thinking > of handling this by having some number of retries where we try > do_migrate_range() but if it still doesn't succeed, to skip those > pages and move onto the next. > >> >>> >>> 2) in the fuse filesystem's implementation of the >>> mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is >>> under writeback. >> >> Is there any possibility that a_ops->migrate_folio() may be called with >> the folio under writeback? >> >> - for most pages without AS_NO_WRITEBACK_WAIT, a_ops->migrate_folio() >> will be called only when Page_writeback is cleared; >> - for AS_NO_WRITEBACK_WAIT pages, they are skipped if they are under >> writeback >> > > For AS_NO_WRITEBACK_WAIT_PAGES, if we skip waiting on them if they are > under writeback, I think the a_ops->migrate_folio() will still get > called (by migrate_pages_batch() -> migrate_folio_move() -> > move_to_new_folio()). > > Looking at migrate_folio_unmap() some more though, I don't think we > can just skip the wait call like we can for the sync(2) case. I think > we need to error out here instead since after the wait call, > migrate_folio_unmap() will replace the folio's page table mappings > (try_to_migrate()). If we error out here, then there's no hitting > a_ops->migrate_folio() when the folio is under writeback. > Right, we need to bail out to skip this page (under writeback), just like how MIGRATE_SYNC does.
On Thu, Oct 31, 2024 at 3:38 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Thu, Oct 31, 2024 at 02:52:57PM GMT, Joanne Koong wrote: > > On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote: > > > > On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > [...] > > > > > > > > > > Memory pool is a bit confusing term here. Most probably you are asking > > > > > about the migrate type of the page block from which tmp page is > > > > > allocated from. In a normal system, tmp page would be allocated from page > > > > > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it > > > > > depends on what gfp flag was used for its allocation. What does fuse fs > > > > > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation > > > > > allocations can get mixed up with different migrate types. > > > > > > > > > > > > > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since > > > > fuse doesn't set any additional gfp masks on the inode mapping. > > > > > > > > Could we just allocate the fuse writeback pages with GFP_HIGHUSER > > > > instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin() > > > > where we pass in the gfp mask to __filemap_get_folio(). I think this > > > > would give us the same behavior memory-wise as what the tmp pages > > > > currently do, > > > > > > I don't think it would be the same behavior. From what I understand the > > > liftime of the tmp page is from the start of the writeback till the ack > > > from the fuse server that writeback is done. While the lifetime of the > > > page of the page cache can be arbitrarily large. We should just make it > > > unmovable for its lifetime. I think it is fine to make the page > > > unmovable during the writeback. We should not try to optimize for the > > > bad or buggy behavior of fuse server. > > > > > > Regarding the avoidance of wait on writeback for fuse folios, I think we > > > can handle the migration similar to how you are handling reclaim and in > > > addition we can add a WARN() in folio_wait_writeback() if the kernel ever > > > sees a fuse folio in that function. > > > > Awesome, this is what I'm planning to do in v3 to address migration then: > > > > 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if > > src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg > > fuse folios will have that AS_NO_WRITEBACK_WAIT bit set) > > > > 2) in the fuse filesystem's implementation of the > > mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is > > under writeback. > > 3) Add WARN_ONCE() in folio_wait_writeback() if folio->mapping has > AS_NO_WRITEBACK_WAIT set and return without waiting. For v3, I'm going to change AS_NO_WRITEBACK_RECLAIM to AS_WRITEBACK_MAY_BLOCK and skip 3) because 3) may be too restrictive. For example, for the sync_file_range() syscall, we do want to wait on writeback - it's ok in this case to call folio_wait_writeback() on a fuse folio since the caller would have intentionally passed in a fuse fd to sync_file_range(). Thanks, Joanne > > > > > Does this sound good? > > Yes.
On Wed, Nov 06, 2024 at 03:37:11PM -0800, Joanne Koong wrote: > On Thu, Oct 31, 2024 at 3:38 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Thu, Oct 31, 2024 at 02:52:57PM GMT, Joanne Koong wrote: > > > On Thu, Oct 31, 2024 at 1:06 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > On Thu, Oct 31, 2024 at 12:06:49PM GMT, Joanne Koong wrote: > > > > > On Wed, Oct 30, 2024 at 5:30 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > [...] > > > > > > > > > > > > Memory pool is a bit confusing term here. Most probably you are asking > > > > > > about the migrate type of the page block from which tmp page is > > > > > > allocated from. In a normal system, tmp page would be allocated from page > > > > > > block with MIGRATE_UNMOVABLE migrate type while the page cache page, it > > > > > > depends on what gfp flag was used for its allocation. What does fuse fs > > > > > > use? GFP_HIGHUSER_MOVABLE or something else? Under low memory situation > > > > > > allocations can get mixed up with different migrate types. > > > > > > > > > > > > > > > > I believe it's GFP_HIGHUSER_MOVABLE for the page cache pages since > > > > > fuse doesn't set any additional gfp masks on the inode mapping. > > > > > > > > > > Could we just allocate the fuse writeback pages with GFP_HIGHUSER > > > > > instead of GFP_HIGHUSER_MOVABLE? That would be in fuse_write_begin() > > > > > where we pass in the gfp mask to __filemap_get_folio(). I think this > > > > > would give us the same behavior memory-wise as what the tmp pages > > > > > currently do, > > > > > > > > I don't think it would be the same behavior. From what I understand the > > > > liftime of the tmp page is from the start of the writeback till the ack > > > > from the fuse server that writeback is done. While the lifetime of the > > > > page of the page cache can be arbitrarily large. We should just make it > > > > unmovable for its lifetime. I think it is fine to make the page > > > > unmovable during the writeback. We should not try to optimize for the > > > > bad or buggy behavior of fuse server. > > > > > > > > Regarding the avoidance of wait on writeback for fuse folios, I think we > > > > can handle the migration similar to how you are handling reclaim and in > > > > addition we can add a WARN() in folio_wait_writeback() if the kernel ever > > > > sees a fuse folio in that function. > > > > > > Awesome, this is what I'm planning to do in v3 to address migration then: > > > > > > 1) in migrate_folio_unmap(), only call "folio_wait_writeback(src);" if > > > src->mapping does not have the AS_NO_WRITEBACK_WAIT bit set on it (eg > > > fuse folios will have that AS_NO_WRITEBACK_WAIT bit set) > > > > > > 2) in the fuse filesystem's implementation of the > > > mapping->a_ops->migrate_folio callback, return -EAGAIN if the folio is > > > under writeback. > > > > 3) Add WARN_ONCE() in folio_wait_writeback() if folio->mapping has > > AS_NO_WRITEBACK_WAIT set and return without waiting. > > For v3, I'm going to change AS_NO_WRITEBACK_RECLAIM to > AS_WRITEBACK_MAY_BLOCK and skip 3) because 3) may be too restrictive. > For example, for the sync_file_range() syscall, we do want to wait on > writeback - it's ok in this case to call folio_wait_writeback() on a > fuse folio since the caller would have intentionally passed in a fuse > fd to sync_file_range(). > Sounds good.
diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 4304e44f32e6..c3c5ddb4f66b 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -415,74 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id) struct fuse_writepage_args { struct fuse_io_args ia; - struct rb_node writepages_entry; struct list_head queue_entry; - struct fuse_writepage_args *next; struct inode *inode; struct fuse_sync_bucket *bucket; }; -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi, - pgoff_t idx_from, pgoff_t idx_to) -{ - struct rb_node *n; - - n = fi->writepages.rb_node; - - while (n) { - struct fuse_writepage_args *wpa; - pgoff_t curr_index; - - wpa = rb_entry(n, struct fuse_writepage_args, writepages_entry); - WARN_ON(get_fuse_inode(wpa->inode) != fi); - curr_index = wpa->ia.write.in.offset >> PAGE_SHIFT; - if (idx_from >= curr_index + wpa->ia.ap.num_pages) - n = n->rb_right; - else if (idx_to < curr_index) - n = n->rb_left; - else - return wpa; - } - return NULL; -} - -/* - * Check if any page in a range is under writeback - */ -static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from, - pgoff_t idx_to) -{ - struct fuse_inode *fi = get_fuse_inode(inode); - bool found; - - if (RB_EMPTY_ROOT(&fi->writepages)) - return false; - - spin_lock(&fi->lock); - found = fuse_find_writeback(fi, idx_from, idx_to); - spin_unlock(&fi->lock); - - return found; -} - -static inline bool fuse_page_is_writeback(struct inode *inode, pgoff_t index) -{ - return fuse_range_is_writeback(inode, index, index); -} - -/* - * Wait for page writeback to be completed. - * - * Since fuse doesn't rely on the VM writeback tracking, this has to - * use some other means. - */ -static void fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index) -{ - struct fuse_inode *fi = get_fuse_inode(inode); - - wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index)); -} - /* * Wait for all pending writepages on the inode to finish. * @@ -876,7 +813,7 @@ static int fuse_do_readpage(struct file *file, struct page *page) * page-cache page, so make sure we read a properly synced * page. */ - fuse_wait_on_page_writeback(inode, page->index); + folio_wait_writeback(page_folio(page)); attr_ver = fuse_get_attr_version(fm->fc); @@ -1024,8 +961,7 @@ static void fuse_readahead(struct readahead_control *rac) ap = &ia->ap; nr_pages = __readahead_batch(rac, ap->pages, nr_pages); for (i = 0; i < nr_pages; i++) { - fuse_wait_on_page_writeback(inode, - readahead_index(rac) + i); + folio_wait_writeback(page_folio(ap->pages[i])); ap->descs[i].length = PAGE_SIZE; } ap->num_pages = nr_pages; @@ -1147,7 +1083,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia, int err; for (i = 0; i < ap->num_pages; i++) - fuse_wait_on_page_writeback(inode, ap->pages[i]->index); + folio_wait_writeback(page_folio(ap->pages[i])); fuse_write_args_fill(ia, ff, pos, count); ia->write.in.flags = fuse_write_flags(iocb); @@ -1583,7 +1519,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter, return res; } } - if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) { + if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) { if (!write) inode_lock(inode); fuse_sync_writes(inode); @@ -1780,13 +1716,17 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out, static void fuse_writepage_free(struct fuse_writepage_args *wpa) { struct fuse_args_pages *ap = &wpa->ia.ap; + struct folio *folio; int i; if (wpa->bucket) fuse_sync_bucket_dec(wpa->bucket); - for (i = 0; i < ap->num_pages; i++) - __free_page(ap->pages[i]); + for (i = 0; i < ap->num_pages; i++) { + folio = page_folio(ap->pages[i]); + folio_end_writeback(folio); + folio_put(folio); + } fuse_file_put(wpa->ia.ff, false); @@ -1799,7 +1739,7 @@ static void fuse_writepage_finish_stat(struct inode *inode, struct page *page) struct backing_dev_info *bdi = inode_to_bdi(inode); dec_wb_stat(&bdi->wb, WB_WRITEBACK); - dec_node_page_state(page, NR_WRITEBACK_TEMP); + dec_node_page_state(page, NR_WRITEBACK); wb_writeout_inc(&bdi->wb); } @@ -1822,7 +1762,6 @@ static void fuse_send_writepage(struct fuse_mount *fm, __releases(fi->lock) __acquires(fi->lock) { - struct fuse_writepage_args *aux, *next; struct fuse_inode *fi = get_fuse_inode(wpa->inode); struct fuse_write_in *inarg = &wpa->ia.write.in; struct fuse_args *args = &wpa->ia.ap.args; @@ -1858,18 +1797,8 @@ __acquires(fi->lock) out_free: fi->writectr--; - rb_erase(&wpa->writepages_entry, &fi->writepages); fuse_writepage_finish(wpa); spin_unlock(&fi->lock); - - /* After rb_erase() aux request list is private */ - for (aux = wpa->next; aux; aux = next) { - next = aux->next; - aux->next = NULL; - fuse_writepage_finish_stat(aux->inode, aux->ia.ap.pages[0]); - fuse_writepage_free(aux); - } - fuse_writepage_free(wpa); spin_lock(&fi->lock); } @@ -1897,43 +1826,6 @@ __acquires(fi->lock) } } -static struct fuse_writepage_args *fuse_insert_writeback(struct rb_root *root, - struct fuse_writepage_args *wpa) -{ - pgoff_t idx_from = wpa->ia.write.in.offset >> PAGE_SHIFT; - pgoff_t idx_to = idx_from + wpa->ia.ap.num_pages - 1; - struct rb_node **p = &root->rb_node; - struct rb_node *parent = NULL; - - WARN_ON(!wpa->ia.ap.num_pages); - while (*p) { - struct fuse_writepage_args *curr; - pgoff_t curr_index; - - parent = *p; - curr = rb_entry(parent, struct fuse_writepage_args, - writepages_entry); - WARN_ON(curr->inode != wpa->inode); - curr_index = curr->ia.write.in.offset >> PAGE_SHIFT; - - if (idx_from >= curr_index + curr->ia.ap.num_pages) - p = &(*p)->rb_right; - else if (idx_to < curr_index) - p = &(*p)->rb_left; - else - return curr; - } - - rb_link_node(&wpa->writepages_entry, parent, p); - rb_insert_color(&wpa->writepages_entry, root); - return NULL; -} - -static void tree_insert(struct rb_root *root, struct fuse_writepage_args *wpa) -{ - WARN_ON(fuse_insert_writeback(root, wpa)); -} - static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args, int error) { @@ -1953,41 +1845,6 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args, if (!fc->writeback_cache) fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY); spin_lock(&fi->lock); - rb_erase(&wpa->writepages_entry, &fi->writepages); - while (wpa->next) { - struct fuse_mount *fm = get_fuse_mount(inode); - struct fuse_write_in *inarg = &wpa->ia.write.in; - struct fuse_writepage_args *next = wpa->next; - - wpa->next = next->next; - next->next = NULL; - tree_insert(&fi->writepages, next); - - /* - * Skip fuse_flush_writepages() to make it easy to crop requests - * based on primary request size. - * - * 1st case (trivial): there are no concurrent activities using - * fuse_set/release_nowrite. Then we're on safe side because - * fuse_flush_writepages() would call fuse_send_writepage() - * anyway. - * - * 2nd case: someone called fuse_set_nowrite and it is waiting - * now for completion of all in-flight requests. This happens - * rarely and no more than once per page, so this should be - * okay. - * - * 3rd case: someone (e.g. fuse_do_setattr()) is in the middle - * of fuse_set_nowrite..fuse_release_nowrite section. The fact - * that fuse_set_nowrite returned implies that all in-flight - * requests were completed along with all of their secondary - * requests. Further primary requests are blocked by negative - * writectr. Hence there cannot be any in-flight requests and - * no invocations of fuse_writepage_end() while we're in - * fuse_set_nowrite..fuse_release_nowrite section. - */ - fuse_send_writepage(fm, next, inarg->offset + inarg->size); - } fi->writectr--; fuse_writepage_finish(wpa); spin_unlock(&fi->lock); @@ -2074,19 +1931,18 @@ static void fuse_writepage_add_to_bucket(struct fuse_conn *fc, } static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio, - struct folio *tmp_folio, uint32_t page_index) + uint32_t page_index) { struct inode *inode = folio->mapping->host; struct fuse_args_pages *ap = &wpa->ia.ap; - folio_copy(tmp_folio, folio); - - ap->pages[page_index] = &tmp_folio->page; + folio_get(folio); + ap->pages[page_index] = &folio->page; ap->descs[page_index].offset = 0; ap->descs[page_index].length = PAGE_SIZE; inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); - inc_node_page_state(&tmp_folio->page, NR_WRITEBACK_TEMP); + inc_node_page_state(&folio->page, NR_WRITEBACK); } static struct fuse_writepage_args *fuse_writepage_args_setup(struct folio *folio, @@ -2121,18 +1977,12 @@ static int fuse_writepage_locked(struct folio *folio) struct fuse_inode *fi = get_fuse_inode(inode); struct fuse_writepage_args *wpa; struct fuse_args_pages *ap; - struct folio *tmp_folio; struct fuse_file *ff; - int error = -ENOMEM; - - tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0); - if (!tmp_folio) - goto err; + int error = -EIO; - error = -EIO; ff = fuse_write_file_get(fi); if (!ff) - goto err_nofile; + goto err; wpa = fuse_writepage_args_setup(folio, ff); error = -ENOMEM; @@ -2143,22 +1993,17 @@ static int fuse_writepage_locked(struct folio *folio) ap->num_pages = 1; folio_start_writeback(folio); - fuse_writepage_args_page_fill(wpa, folio, tmp_folio, 0); + fuse_writepage_args_page_fill(wpa, folio, 0); spin_lock(&fi->lock); - tree_insert(&fi->writepages, wpa); list_add_tail(&wpa->queue_entry, &fi->queued_writes); fuse_flush_writepages(inode); spin_unlock(&fi->lock); - folio_end_writeback(folio); - return 0; err_writepage_args: fuse_file_put(ff, false); -err_nofile: - folio_put(tmp_folio); err: mapping_set_error(folio->mapping, error); return error; @@ -2168,7 +2013,6 @@ struct fuse_fill_wb_data { struct fuse_writepage_args *wpa; struct fuse_file *ff; struct inode *inode; - struct page **orig_pages; unsigned int max_pages; }; @@ -2203,68 +2047,11 @@ static void fuse_writepages_send(struct fuse_fill_wb_data *data) struct fuse_writepage_args *wpa = data->wpa; struct inode *inode = data->inode; struct fuse_inode *fi = get_fuse_inode(inode); - int num_pages = wpa->ia.ap.num_pages; - int i; spin_lock(&fi->lock); list_add_tail(&wpa->queue_entry, &fi->queued_writes); fuse_flush_writepages(inode); spin_unlock(&fi->lock); - - for (i = 0; i < num_pages; i++) - end_page_writeback(data->orig_pages[i]); -} - -/* - * Check under fi->lock if the page is under writeback, and insert it onto the - * rb_tree if not. Otherwise iterate auxiliary write requests, to see if there's - * one already added for a page at this offset. If there's none, then insert - * this new request onto the auxiliary list, otherwise reuse the existing one by - * swapping the new temp page with the old one. - */ -static bool fuse_writepage_add(struct fuse_writepage_args *new_wpa, - struct page *page) -{ - struct fuse_inode *fi = get_fuse_inode(new_wpa->inode); - struct fuse_writepage_args *tmp; - struct fuse_writepage_args *old_wpa; - struct fuse_args_pages *new_ap = &new_wpa->ia.ap; - - WARN_ON(new_ap->num_pages != 0); - new_ap->num_pages = 1; - - spin_lock(&fi->lock); - old_wpa = fuse_insert_writeback(&fi->writepages, new_wpa); - if (!old_wpa) { - spin_unlock(&fi->lock); - return true; - } - - for (tmp = old_wpa->next; tmp; tmp = tmp->next) { - pgoff_t curr_index; - - WARN_ON(tmp->inode != new_wpa->inode); - curr_index = tmp->ia.write.in.offset >> PAGE_SHIFT; - if (curr_index == page->index) { - WARN_ON(tmp->ia.ap.num_pages != 1); - swap(tmp->ia.ap.pages[0], new_ap->pages[0]); - break; - } - } - - if (!tmp) { - new_wpa->next = old_wpa->next; - old_wpa->next = new_wpa; - } - - spin_unlock(&fi->lock); - - if (tmp) { - fuse_writepage_finish_stat(new_wpa->inode, new_ap->pages[0]); - fuse_writepage_free(new_wpa); - } - - return false; } static bool fuse_writepage_need_send(struct fuse_conn *fc, struct page *page, @@ -2273,15 +2060,6 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct page *page, { WARN_ON(!ap->num_pages); - /* - * Being under writeback is unlikely but possible. For example direct - * read to an mmaped fuse file will set the page dirty twice; once when - * the pages are faulted with get_user_pages(), and then after the read - * completed. - */ - if (fuse_page_is_writeback(data->inode, page->index)) - return true; - /* Reached max pages */ if (ap->num_pages == fc->max_pages) return true; @@ -2291,7 +2069,7 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct page *page, return true; /* Discontinuity */ - if (data->orig_pages[ap->num_pages - 1]->index + 1 != page->index) + if (ap->pages[ap->num_pages - 1]->index + 1 != page->index) return true; /* Need to grow the pages array? If so, did the expansion fail? */ @@ -2308,9 +2086,7 @@ static int fuse_writepages_fill(struct folio *folio, struct fuse_writepage_args *wpa = data->wpa; struct fuse_args_pages *ap = &wpa->ia.ap; struct inode *inode = data->inode; - struct fuse_inode *fi = get_fuse_inode(inode); struct fuse_conn *fc = get_fuse_conn(inode); - struct folio *tmp_folio; int err; if (wpa && fuse_writepage_need_send(fc, &folio->page, ap, data)) { @@ -2318,54 +2094,23 @@ static int fuse_writepages_fill(struct folio *folio, data->wpa = NULL; } - err = -ENOMEM; - tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0); - if (!tmp_folio) - goto out_unlock; - - /* - * The page must not be redirtied until the writeout is completed - * (i.e. userspace has sent a reply to the write request). Otherwise - * there could be more than one temporary page instance for each real - * page. - * - * This is ensured by holding the page lock in page_mkwrite() while - * checking fuse_page_is_writeback(). We already hold the page lock - * since clear_page_dirty_for_io() and keep it held until we add the - * request to the fi->writepages list and increment ap->num_pages. - * After this fuse_page_is_writeback() will indicate that the page is - * under writeback, so we can release the page lock. - */ if (data->wpa == NULL) { err = -ENOMEM; wpa = fuse_writepage_args_setup(folio, data->ff); - if (!wpa) { - folio_put(tmp_folio); + if (!wpa) goto out_unlock; - } fuse_file_get(wpa->ia.ff); data->max_pages = 1; ap = &wpa->ia.ap; } folio_start_writeback(folio); - fuse_writepage_args_page_fill(wpa, folio, tmp_folio, ap->num_pages); - data->orig_pages[ap->num_pages] = &folio->page; + fuse_writepage_args_page_fill(wpa, folio, ap->num_pages); err = 0; - if (data->wpa) { - /* - * Protected by fi->lock against concurrent access by - * fuse_page_is_writeback(). - */ - spin_lock(&fi->lock); - ap->num_pages++; - spin_unlock(&fi->lock); - } else if (fuse_writepage_add(wpa, &folio->page)) { + ap->num_pages++; + if (!data->wpa) data->wpa = wpa; - } else { - folio_end_writeback(folio); - } out_unlock: folio_unlock(folio); @@ -2394,21 +2139,12 @@ static int fuse_writepages(struct address_space *mapping, if (!data.ff) return -EIO; - err = -ENOMEM; - data.orig_pages = kcalloc(fc->max_pages, - sizeof(struct page *), - GFP_NOFS); - if (!data.orig_pages) - goto out; - err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data); if (data.wpa) { WARN_ON(!data.wpa->ia.ap.num_pages); fuse_writepages_send(&data); } - kfree(data.orig_pages); -out: fuse_file_put(data.ff, false); return err; } @@ -2433,7 +2169,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping, if (IS_ERR(folio)) goto error; - fuse_wait_on_page_writeback(mapping->host, folio->index); + folio_wait_writeback(folio); if (folio_test_uptodate(folio) || len >= folio_size(folio)) goto success; @@ -2497,13 +2233,11 @@ static int fuse_launder_folio(struct folio *folio) { int err = 0; if (folio_clear_dirty_for_io(folio)) { - struct inode *inode = folio->mapping->host; - /* Serialize with pending writeback for the same page */ - fuse_wait_on_page_writeback(inode, folio->index); + folio_wait_writeback(folio); err = fuse_writepage_locked(folio); if (!err) - fuse_wait_on_page_writeback(inode, folio->index); + folio_wait_writeback(folio); } return err; } @@ -2547,7 +2281,7 @@ static vm_fault_t fuse_page_mkwrite(struct vm_fault *vmf) return VM_FAULT_NOPAGE; } - fuse_wait_on_page_writeback(inode, page->index); + folio_wait_writeback(page_folio(page)); return VM_FAULT_LOCKED; } @@ -3368,6 +3102,7 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags) inode->i_fop = &fuse_file_operations; inode->i_data.a_ops = &fuse_file_aops; + mapping_set_no_writeback_reclaim(&inode->i_data); INIT_LIST_HEAD(&fi->write_files); INIT_LIST_HEAD(&fi->queued_writes); @@ -3375,7 +3110,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags) fi->iocachectr = 0; init_waitqueue_head(&fi->page_waitq); init_waitqueue_head(&fi->direct_io_waitq); - fi->writepages = RB_ROOT; if (IS_ENABLED(CONFIG_FUSE_DAX)) fuse_dax_inode_init(inode, flags);
Currently, we allocate and copy data to a temporary folio when handling writeback in order to mitigate the following deadlock scenario that may arise if reclaim waits on writeback to complete: * single-threaded FUSE server is in the middle of handling a request that needs a memory allocation * memory allocation triggers direct reclaim * direct reclaim waits on a folio under writeback * the FUSE server can't write back the folio since it's stuck in direct reclaim To work around this, we allocate a temporary folio and copy over the original folio to the temporary folio so that writeback can be immediately cleared on the original folio. This additionally requires us to maintain an internal rb tree to keep track of writeback state on the temporary folios. This change sets AS_NO_WRITEBACK_RECLAIM on the inode mapping so that FUSE folios are not reclaimed and waited on while in writeback, and removes the temporary folio + extra copying and the internal rb tree. fio benchmarks -- (using averages observed from 10 runs, throwing away outliers) Setup: sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount bs = 1k 4k 1M Before 351 MiB/s 1818 MiB/s 1851 MiB/s After 341 MiB/s 2246 MiB/s 2685 MiB/s % diff -3% 23% 45% Signed-off-by: Joanne Koong <joannelkoong@gmail.com> --- fs/fuse/file.c | 322 +++++-------------------------------------------- 1 file changed, 28 insertions(+), 294 deletions(-)