Message ID | 20230209123206.3548-4-jack@suse.cz (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Writeback handling of pinned pages | expand |
Eww. The block bounc code really needs to go away, so a new user makes me very unhappy. But independent of that I don't think this is enough anyway. Just copying the data out into a new page in the block layer doesn't solve the problem that this page needs to be tracked as dirtied for fs accounting. e.g. every time we write this copy it needs space allocated for COW file systems. Which brings me back to if and when we do writeback for pinned page. I don't think doing any I/O for short term pins like direct I/O make sense. These pins are defined to be unpinned after I/O completes, so we might as well just wait for the unpin instead of doing anything complicated. Long term pins are more troublesome, but I really wonder what the defined semantics for data integrity writeback like fsync on them is to start with as the content is very much undefined. Should an fsync on a (partially) long term pinned file simplfy fail? It's not like we can win in that scenario.
On Mon 13-02-23 01:59:28, Christoph Hellwig wrote: > Eww. The block bounc code really needs to go away, so a new user > makes me very unhappy. > > But independent of that I don't think this is enough anyway. Just > copying the data out into a new page in the block layer doesn't solve > the problem that this page needs to be tracked as dirtied for fs > accounting. e.g. every time we write this copy it needs space allocated > for COW file systems. Right, I forgot about this in my RFC. My original plan was to not clear the dirty bit in clear_page_dirty_for_io() even for WB_SYNC_ALL writeback when we do writeback the page and perhaps indicate this in the return value of clear_page_dirty_for_io() so that the COW filesystem can keep tracking this page as dirty. > Which brings me back to if and when we do writeback for pinned page. > I don't think doing any I/O for short term pins like direct I/O > make sense. These pins are defined to be unpinned after I/O > completes, so we might as well just wait for the unpin instead of doing > anything complicated. Agreed. For short term pins we could just wait which should be quite simple (although there's some DoS potential of this behavior if somebody runs multiple processes that keep pinning some page with short term pins). > Long term pins are more troublesome, but I really wonder what the > defined semantics for data integrity writeback like fsync on them > is to start with as the content is very much undefined. Should > an fsync on a (partially) long term pinned file simplfy fail? It's > not like we can win in that scenario. Well, we have also cases like sync(2) so one would have to be careful with error propagation and I'm afraid there are enough programs out-there that treat any error return from fsync(2) as catastrophic so I suspect this could lead to some surprises. The case I'm most worried about is if some application sets up RDMA to an mmaped file, runs the transfer and waits for it to complete, doesn't bother to unpin the pages (keeps them for future transfers) and calls fsync(2) to make data stable on local storage. That does seem like quite sensible use and so far it works just fine. And not writing pages with fsync(2) would break such uses. Honza
On Tue, Feb 14, 2023 at 02:56:04PM +0100, Jan Kara wrote: > On Mon 13-02-23 01:59:28, Christoph Hellwig wrote: > > Eww. The block bounc code really needs to go away, so a new user > > makes me very unhappy. > > > > But independent of that I don't think this is enough anyway. Just > > copying the data out into a new page in the block layer doesn't solve > > the problem that this page needs to be tracked as dirtied for fs > > accounting. e.g. every time we write this copy it needs space allocated > > for COW file systems. > > Right, I forgot about this in my RFC. My original plan was to not clear the > dirty bit in clear_page_dirty_for_io() even for WB_SYNC_ALL writeback when > we do writeback the page and perhaps indicate this in the return value of > clear_page_dirty_for_io() so that the COW filesystem can keep tracking this > page as dirty. I don't think this works, especially if the COW mechanism relies on delayed allocation to prevent ENOSPC during writeback. That is, we need a write() or page fault (to run ->page_mkwrite()) after every call to folio_clear_dirty_for_io() in the writeback path to ensure that new space is reserved for the allocation that will occur during a future writeback of that page. Hence we can't just leave the page dirty on COW filesystems - it has to go through a clean state so that the clean->dirty event can be gated on gaining the space reservation that allows it to be written back again. Cheers, Dave.
On Wed, Feb 15, 2023 at 03:59:52PM +1100, Dave Chinner wrote: > I don't think this works, especially if the COW mechanism relies on > delayed allocation to prevent ENOSPC during writeback. That is, we > need a write() or page fault (to run ->page_mkwrite()) after every > call to folio_clear_dirty_for_io() in the writeback path to ensure > that new space is reserved for the allocation that will occur > during a future writeback of that page. > > Hence we can't just leave the page dirty on COW filesystems - it has > to go through a clean state so that the clean->dirty event can be > gated on gaining the space reservation that allows it to be written > back again. Exactly. Although if we really want we could do the redirtying without formally moving to a clean state, but it certainly would require special new code to the same steps as if we were redirtying. Which is another reason why I'd prefer to avoid all that if we can. For that we probably need an inventory of what long term pins we have in the kernel tree that can and do operate on shared file mappings, and what kind of I/O semantics they expect.
On Tue 14-02-23 22:24:33, Christoph Hellwig wrote: > On Wed, Feb 15, 2023 at 03:59:52PM +1100, Dave Chinner wrote: > > I don't think this works, especially if the COW mechanism relies on > > delayed allocation to prevent ENOSPC during writeback. That is, we > > need a write() or page fault (to run ->page_mkwrite()) after every > > call to folio_clear_dirty_for_io() in the writeback path to ensure > > that new space is reserved for the allocation that will occur > > during a future writeback of that page. > > > > Hence we can't just leave the page dirty on COW filesystems - it has > > to go through a clean state so that the clean->dirty event can be > > gated on gaining the space reservation that allows it to be written > > back again. > > Exactly. Although if we really want we could do the redirtying without > formally moving to a clean state, but it certainly would require special > new code to the same steps as if we were redirtying. Yes. > Which is another reason why I'd prefer to avoid all that if we can. > For that we probably need an inventory of what long term pins we have > in the kernel tree that can and do operate on shared file mappings, > and what kind of I/O semantics they expect. I'm a bit skeptical we can reasonably assess that (as much as I would love to just not write these pages and be done with it) because a lot of FOLL_LONGTERM users just pin passed userspace address range, then allow userspace to manipulate it with other operations, and finally unpin it with another call. Who knows whether shared pagecache pages are passed in and what userspace is doing with them while they are pinned? We have stuff like io_uring using FOLL_LONGTERM for IO buffers passed from userspace (e.g. IORING_REGISTER_BUFFERS operation), we have V4L2 which similarly pins buffers for video processing (and I vaguely remember one bugreport due to some phone passing shared file pages there to acquire screenshots from a webcam), and we have various infiniband drivers doing this (not all of them are using FOLL_LONGTERM but they should AFAICS). We even have vmsplice(2) that should be arguably using pinning with FOLL_LONGTERM (at least that's the plan AFAIK) and not writing such pages would IMO provide an interesting attack vector... Honza
On Thu, Feb 16, 2023 at 01:33:16PM +0100, Jan Kara wrote: > I'm a bit skeptical we can reasonably assess that (as much as I would love > to just not write these pages and be done with it) because a lot of > FOLL_LONGTERM users just pin passed userspace address range, then allow > userspace to manipulate it with other operations, and finally unpin it with > another call. Who knows whether shared pagecache pages are passed in and > what userspace is doing with them while they are pinned? True. So what other sensible thing could we do at a higher level? Treat MAP_SHARED buffers that are long term registered as MAP_PRIVATE while registered, and just do writeback using in-kernel O_DIRECT on fsync?
On Sun 19-02-23 22:22:32, Christoph Hellwig wrote: > On Thu, Feb 16, 2023 at 01:33:16PM +0100, Jan Kara wrote: > > I'm a bit skeptical we can reasonably assess that (as much as I would love > > to just not write these pages and be done with it) because a lot of > > FOLL_LONGTERM users just pin passed userspace address range, then allow > > userspace to manipulate it with other operations, and finally unpin it with > > another call. Who knows whether shared pagecache pages are passed in and > > what userspace is doing with them while they are pinned? > > True. > > So what other sensible thing could we do at a higher level? > > Treat MAP_SHARED buffers that are long term registered as MAP_PRIVATE > while registered, and just do writeback using in-kernel O_DIRECT on > fsync? Do you mean to copy these pages on fsync(2) to newly allocated buffer and then submit it via direct IO? That looks sensible to me. We could then make writeback path just completely ignore these long term pinned pages and just add this copying logic into filemap_fdatawrite() or something like that. Honza
On Mon, Feb 27, 2023 at 12:39:26PM +0100, Jan Kara wrote: > Do you mean to copy these pages on fsync(2) to newly allocated buffer and > then submit it via direct IO? That looks sensible to me. We could then make > writeback path just completely ignore these long term pinned pages and just > add this copying logic into filemap_fdatawrite() or something like that. I don't think we'd even have to copy them on fsync. Just do an in-kernel ITER_BVEC direct I/O on them. The only hard part would be to come up with a way to skip the pagecache invalidation and writeout for these I/Os.
diff --git a/block/blk.h b/block/blk.h index 4c3b3325219a..def7ab8379bc 100644 --- a/block/blk.h +++ b/block/blk.h @@ -384,10 +384,18 @@ static inline bool blk_queue_may_bounce(struct request_queue *q) max_low_pfn >= max_pfn; } +static inline bool bio_need_pin_bounce(struct bio *bio, + struct request_queue *q) +{ + return IS_ENABLED(CONFIG_BOUNCE) && + bio->bi_flags & (1 << BIO_NEED_PIN_BOUNCE); +} + static inline struct bio *blk_queue_bounce(struct bio *bio, struct request_queue *q) { - if (unlikely(blk_queue_may_bounce(q) && bio_has_data(bio))) + if (unlikely((blk_queue_may_bounce(q) || bio_need_pin_bounce(bio, q)) && + bio_has_data(bio))) return __blk_queue_bounce(bio, q); return bio; } diff --git a/block/bounce.c b/block/bounce.c index 7cfcb242f9a1..ebda95953d58 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -207,12 +207,16 @@ struct bio *__blk_queue_bounce(struct bio *bio_orig, struct request_queue *q) struct bvec_iter iter; unsigned i = 0, bytes = 0; bool bounce = false; + bool pinned_bounce = bio_orig->bi_flags & (1 << BIO_NEED_PIN_BOUNCE); + bool highmem_bounce = blk_queue_may_bounce(q); int sectors; bio_for_each_segment(from, bio_orig, iter) { if (i++ < BIO_MAX_VECS) bytes += from.bv_len; - if (PageHighMem(from.bv_page)) + if (highmem_bounce && PageHighMem(from.bv_page)) + bounce = true; + if (pinned_bounce && page_maybe_dma_pinned(from.bv_page)) bounce = true; } if (!bounce) @@ -241,7 +245,8 @@ struct bio *__blk_queue_bounce(struct bio *bio_orig, struct request_queue *q) for (i = 0, to = bio->bi_io_vec; i < bio->bi_vcnt; to++, i++) { struct page *bounce_page; - if (!PageHighMem(to->bv_page)) + if (!((highmem_bounce && PageHighMem(to->bv_page)) || + (pinned_bounce && page_maybe_dma_pinned(to->bv_page)))) continue; bounce_page = mempool_alloc(&page_pool, GFP_NOIO); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 99be590f952f..3aa1dc5d8dc6 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -321,6 +321,7 @@ enum { BIO_NO_PAGE_REF, /* don't put release vec pages */ BIO_CLONED, /* doesn't own data */ BIO_BOUNCED, /* bio is a bounce bio */ + BIO_NEED_PIN_BOUNCE, /* bio needs to bounce pinned pages */ BIO_QUIET, /* Make BIO Quiet */ BIO_CHAIN, /* chained bio, ->bi_remaining in effect */ BIO_REFFED, /* bio has elevated ->bi_cnt */ diff --git a/mm/Kconfig b/mm/Kconfig index ff7b209dec05..eba075e959e8 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -659,11 +659,11 @@ config PHYS_ADDR_T_64BIT config BOUNCE bool "Enable bounce buffers" default y - depends on BLOCK && MMU && HIGHMEM + depends on BLOCK && MMU help - Enable bounce buffers for devices that cannot access the full range of - memory available to the CPU. Enabled by default when HIGHMEM is - selected, but you may say n to override this. + Enable bounce buffers. This is used for devices that cannot access + the full range of memory available to the CPU or when DMA can be + modifying pages while they are submitted for writeback. config MMU_NOTIFIER bool
When there is direct IO (or other DMA write) running into a page, it is not generally safe to submit this page for another IO (such as writeback) because this can cause checksum failures or similar issues. However sometimes we cannot avoid writing contents of these pages as pages can be pinned for extensive amount of time (e.g. for RDMA). For these cases we need to just bounce the pages if we really need to write them out. Add support for this type of bouncing into the block layer infrastructure. Signed-off-by: Jan Kara <jack@suse.cz> --- block/blk.h | 10 +++++++++- block/bounce.c | 9 +++++++-- include/linux/blk_types.h | 1 + mm/Kconfig | 8 ++++---- 4 files changed, 21 insertions(+), 7 deletions(-)