Message ID | 20230123173007.325544-11-dhowells@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | None | expand |
On Mon, Jan 23, 2023 at 05:30:07PM +0000, David Howells wrote: > Renumber FOLL_PIN and FOLL_GET down to bit 0 and 1 respectively so that > they are coincidentally the same as BIO_PAGE_PINNED and BIO_PAGE_REFFED and > also so that they can be stored in the bottom two bits of a page pointer > (something I'm looking at for zerocopy socket fragments). > > (Note that BIO_PAGE_REFFED should probably be got rid of at some point, > hence why FOLL_PIN is at 0.) > > Also renumber down the other FOLL_* flags to close the gaps. Taking the risk of having a long bikeshed: I much prefer (1U << bitnr) style definition that make it obvious where there are holes, but otherwise: Reviewed-by: Christoph Hellwig <hch@lst.de>
On 1/23/23 09:30, David Howells wrote: > Renumber FOLL_PIN and FOLL_GET down to bit 0 and 1 respectively so that > they are coincidentally the same as BIO_PAGE_PINNED and BIO_PAGE_REFFED and > also so that they can be stored in the bottom two bits of a page pointer > (something I'm looking at for zerocopy socket fragments). > > (Note that BIO_PAGE_REFFED should probably be got rid of at some point, > hence why FOLL_PIN is at 0.) > > Also renumber down the other FOLL_* flags to close the gaps. Should we also get these sorted into internal-to-mm and public sets? Because Jason (+Cc) again was about to split them apart into mm/internal.h [1] and that might make that a little cleaner. Also, I don't think that there is any large readability difference either way between 0x and <<1, so whatever you and Christophe settle on there seems fine. So either way with those points, Reviewed-by: John Hubbard <jhubbard@nvidia.com> thanks,
On 1/23/23 19:08, John Hubbard wrote: > On 1/23/23 09:30, David Howells wrote: >> Renumber FOLL_PIN and FOLL_GET down to bit 0 and 1 respectively so that >> they are coincidentally the same as BIO_PAGE_PINNED and BIO_PAGE_REFFED and >> also so that they can be stored in the bottom two bits of a page pointer >> (something I'm looking at for zerocopy socket fragments). >> >> (Note that BIO_PAGE_REFFED should probably be got rid of at some point, >> hence why FOLL_PIN is at 0.) >> >> Also renumber down the other FOLL_* flags to close the gaps. > > Should we also get these sorted into internal-to-mm and public sets? > Because Jason (+Cc) again was about to split them apart into > mm/internal.h [1] and that might make that a little cleaner. I see that I omitted both Jason's Cc and the reference, so I'll resend. Sorry for the extra noise. [1] https://lore.kernel.org/all/fcdb3294-3740-a0e0-b115-12842eb0696d@nvidia.com/ > > Also, I don't think that there is any large readability difference > either way between 0x and <<1, so whatever you and Christophe settle on > there seems fine. > > So either way with those points, > > Reviewed-by: John Hubbard <jhubbard@nvidia.com> > > thanks, thanks,
John Hubbard <jhubbard@nvidia.com> wrote: > > Renumber FOLL_PIN and FOLL_GET down to bit 0 and 1 respectively so that > > they are coincidentally the same as BIO_PAGE_PINNED and BIO_PAGE_REFFED and > > also so that they can be stored in the bottom two bits of a page pointer > > (something I'm looking at for zerocopy socket fragments). > > (Note that BIO_PAGE_REFFED should probably be got rid of at some point, > > hence why FOLL_PIN is at 0.) > > Also renumber down the other FOLL_* flags to close the gaps. > > Should we also get these sorted into internal-to-mm and public sets? > Because Jason (+Cc) again was about to split them apart into > mm/internal.h [1] and that might make that a little cleaner. My plan was to push this patch by itself through akpm since it's only an optimisation and not necessary to the rest of the patches here. David
On Mon, Jan 23, 2023 at 07:11:42PM -0800, John Hubbard wrote: > On 1/23/23 19:08, John Hubbard wrote: > > On 1/23/23 09:30, David Howells wrote: > > > Renumber FOLL_PIN and FOLL_GET down to bit 0 and 1 respectively so that > > > they are coincidentally the same as BIO_PAGE_PINNED and BIO_PAGE_REFFED and > > > also so that they can be stored in the bottom two bits of a page pointer > > > (something I'm looking at for zerocopy socket fragments). > > > > > > (Note that BIO_PAGE_REFFED should probably be got rid of at some point, > > > hence why FOLL_PIN is at 0.) > > > > > > Also renumber down the other FOLL_* flags to close the gaps. > > > > Should we also get these sorted into internal-to-mm and public sets? > > Because Jason (+Cc) again was about to split them apart into > > mm/internal.h [1] and that might make that a little cleaner. > > I see that I omitted both Jason's Cc and the reference, so I'll resend. > Sorry for the extra noise. > > [1] https://lore.kernel.org/all/fcdb3294-3740-a0e0-b115-12842eb0696d@nvidia.com/ Yeah, I already wrote a similar patch, using the 1<< notation, splitting the internal/external, and rebasing on the move to mm_types.. I can certainly drop that patch if we'd rather do this. Though, I'm not so keen on using FOLL_ internal flags inside the block layer.. Can you stick with the BIO versions of these? Jason
On Tue, Jan 24, 2023 at 09:13:30AM -0400, Jason Gunthorpe wrote: > Yeah, I already wrote a similar patch, using the 1<< notation, > splitting the internal/external, and rebasing on the move to > mm_types.. I can certainly drop that patch if we'd rather do this. Given that you are doing more work in that area it might be best to drop this patch from this series. > Though, I'm not so keen on using FOLL_ internal flags inside the block > layer.. Can you stick with the BIO versions of these? The block layer doesn't really use it - the new helper in iov_iter.c returns it, and the block layer instantly turns it into an internal flag. But maybe it should just return a bool pinned (by reference) now?
Jason Gunthorpe <jgg@nvidia.com> wrote: > Though, I'm not so keen on using FOLL_ internal flags inside the block > layer.. Can you stick with the BIO versions of these? It's used in a new iov_iter function and will be used in a bunch of filesystems including network filesystems. I'm also looking at using it in the sk_buff handling also. David
On Tue, Jan 24, 2023 at 05:18:07AM -0800, Christoph Hellwig wrote: > On Tue, Jan 24, 2023 at 09:13:30AM -0400, Jason Gunthorpe wrote: > > Yeah, I already wrote a similar patch, using the 1<< notation, > > splitting the internal/external, and rebasing on the move to > > mm_types.. I can certainly drop that patch if we'd rather do this. > > Given that you are doing more work in that area it might be best > to drop this patch from this series. > > > Though, I'm not so keen on using FOLL_ internal flags inside the block > > layer.. Can you stick with the BIO versions of these? > > The block layer doesn't really use it - the new helper in iov_iter.c > returns it, and the block layer instantly turns it into an internal > flag. But maybe it should just return a bool pinned (by reference) > now? Yes, a bool flag to the new mm helper instead of a FOLL_* seems good to me Thanks, Jason
Jason Gunthorpe <jgg@nvidia.com> wrote: > Yeah, I already wrote a similar patch, using the 1<< notation, > splitting the internal/external, and rebasing on the move to > mm_types.. I can certainly drop that patch if we'd rather do this. Note that I've already given Andrew a patch to move FOLL_* to mm_types.h. I'm happy to go with your patch on top of that if you can just renumber FOLL_PIN to 0 and FOLL_GET to 1. David
On Tue, Jan 24, 2023 at 01:46:23PM +0000, David Howells wrote: > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > Yeah, I already wrote a similar patch, using the 1<< notation, > > splitting the internal/external, and rebasing on the move to > > mm_types.. I can certainly drop that patch if we'd rather do this. > > Note that I've already given Andrew a patch to move FOLL_* to mm_types.h. > > I'm happy to go with your patch on top of that if you can just renumber > FOLL_PIN to 0 and FOLL_GET to 1. I moved FOLL_PIN to internal.h because it is not supposed to be used outside mm/* I would prefer to keep it that way. Jason
Jason Gunthorpe <jgg@nvidia.com> wrote: > I moved FOLL_PIN to internal.h because it is not supposed to be used > outside mm/* > > I would prefer to keep it that way. I'll need to do something else, then, such as creating a new pair of 'cleanup' flags: [include/linux/mm_types.h] #define PAGE_CLEANUP_UNPIN (1U << 0) #define PAGE_CLEANUP_PUT (1U << 0) [mm/gup.h] void folio_put_unpin(struct folio *folio, unsigned int cleanup_flags) { unsigned int gup_flags = 0; cleanup_flags &= PAGE_CLEANUP_UNPIN | PAGE_CLEANUP_PUT; if (!cleanup_flags) return; gup_flags |= cleanup_flags & PAGE_CLEANUP_UNPIN ? FOLL_PIN : 0; gup_flags |= cleanup_flags & PAGE_CLEANUP_PUT ? FOLL_GET : 0; gup_put_folio(folio, 1, flags); } EXPORT_SYMBOL_GPL(folio_put_unpin); David
On Tue, Jan 24, 2023 at 01:57:08PM +0000, David Howells wrote: > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > I moved FOLL_PIN to internal.h because it is not supposed to be used > > outside mm/* > > > > I would prefer to keep it that way. > > I'll need to do something else, then, such as creating a new pair of 'cleanup' > flags: > > [include/linux/mm_types.h] > #define PAGE_CLEANUP_UNPIN (1U << 0) > #define PAGE_CLEANUP_PUT (1U << 0) > > [mm/gup.h] > void folio_put_unpin(struct folio *folio, unsigned int cleanup_flags) > { > unsigned int gup_flags = 0; > > cleanup_flags &= PAGE_CLEANUP_UNPIN | PAGE_CLEANUP_PUT; > if (!cleanup_flags) > return; > gup_flags |= cleanup_flags & PAGE_CLEANUP_UNPIN ? FOLL_PIN : 0; > gup_flags |= cleanup_flags & PAGE_CLEANUP_PUT ? FOLL_GET : 0; > gup_put_folio(folio, 1, flags); > } > EXPORT_SYMBOL_GPL(folio_put_unpin); I suggest: if (cleanup_flags & PAGE_CLEANUP_UNPIN) gup_put_folio(folio, 1, true); else if (cleanup_flags & PAGE_CLEANUP_PUT) gup_put_folio(folio, 1, false); or if you are really counting cycles if (cleanup_flags & PAGE_CLEANUP_NEEDED) gup_put_folio(folio, 1, cleanup_flags & PAGE_CLEANUP_UNPIN) Jason
On Tue, Jan 24, 2023 at 01:57:08PM +0000, David Howells wrote: > [include/linux/mm_types.h] > #define PAGE_CLEANUP_UNPIN (1U << 0) > #define PAGE_CLEANUP_PUT (1U << 0) With the latest series we don't need PAGE_CLEANUP_PUT at all.
Jason Gunthorpe <jgg@nvidia.com> wrote: > if (cleanup_flags & PAGE_CLEANUP_UNPIN) > gup_put_folio(folio, 1, true); > else if (cleanup_flags & PAGE_CLEANUP_PUT) > gup_put_folio(folio, 1, false); gup_put_folio() doesn't take a bool - it takes FOLL_PIN and FOLL_GET: static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) though I could do: if (cleanup_flags & PAGE_CLEANUP_UNPIN) gup_put_folio(folio, 1, FOLL_PIN); else if (cleanup_flags & PAGE_CLEANUP_PUT) gup_put_folio(folio, 1, FOLL_GET); But the reason I wrote it like this: gup_flags |= cleanup_flags & PAGE_CLEANUP_UNPIN ? FOLL_PIN : 0; gup_flags |= cleanup_flags & PAGE_CLEANUP_PUT ? FOLL_GET : 0; is that if PAGE_CLEANUP_UNPIN == FOLL_PIN and PAGE_CLEANUP_PUT == FOLL_GET, the C compiler optimiser should be able to turn all of that into a single AND instruction. David
Christoph Hellwig <hch@infradead.org> wrote: > On Tue, Jan 24, 2023 at 01:57:08PM +0000, David Howells wrote: > > [include/linux/mm_types.h] > > #define PAGE_CLEANUP_UNPIN (1U << 0) > > #define PAGE_CLEANUP_PUT (1U << 0) > > With the latest series we don't need PAGE_CLEANUP_PUT at all. We will when it comes to skbuffs. David
On Tue, Jan 24, 2023 at 02:12:25PM +0000, David Howells wrote: > > With the latest series we don't need PAGE_CLEANUP_PUT at all. > > We will when it comes to skbuffs. I'm a little doubtful of that. But IFF skbuffs need it, we can just change the interface at that point.
On Tue, Jan 24, 2023 at 02:11:50PM +0000, David Howells wrote: > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > if (cleanup_flags & PAGE_CLEANUP_UNPIN) > > gup_put_folio(folio, 1, true); > > else if (cleanup_flags & PAGE_CLEANUP_PUT) > > gup_put_folio(folio, 1, false); > > gup_put_folio() doesn't take a bool - it takes FOLL_PIN and FOLL_GET: My point was to change that to take a 'bool unpin' because FOLL_PIN is not to be used outside gup.c Jason
Christoph Hellwig <hch@infradead.org> wrote: > > > With the latest series we don't need PAGE_CLEANUP_PUT at all. > > > > We will when it comes to skbuffs. > > I'm a little doubtful of that. skbuff fragments will be and will have to be a mixture of allocated pages that need putting and pinned or non-ref'd and non-pinned zerocopy stuff. I have posted a patch that works for the limited amount of driverage that I use on my test machine. Think network filesystem messages where you have a mixture of protocol bits generated by the kernel and data provided by direct I/O being sent by zerocopy (libceph, for example). David
Jason Gunthorpe <jgg@nvidia.com> wrote: > My point was to change that to take a 'bool unpin' because FOLL_PIN is > not to be used outside gup.c I need a 3-state wrapper. But if I can't do it in gup.c, then I'll have to do it elsewhere. As Christoph says, most of the places will only be pinned or not-pinned. The whole point was to avoid generating new constants when existing constants would do. David.
On Tue, Jan 24, 2023 at 02:27:53PM +0000, David Howells wrote: > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > My point was to change that to take a 'bool unpin' because FOLL_PIN is > > not to be used outside gup.c > > I need a 3-state wrapper. But if I can't do it in gup.c, then I'll have to do > it elsewhere. As Christoph says, most of the places will only be pinned or > not-pinned. The whole point was to avoid generating new constants when > existing constants would do. What is the 3rd state? Jason
Jason Gunthorpe <jgg@nvidia.com> wrote:
> What is the 3rd state?
Consider a network filesystem message generated for a direct I/O that the
network filesystem does zerocopy on. You may have an sk_buff that has
fragments from one or more of three different sources:
(1) Fragments consisting of specifically allocated pages, such as the
IP/UDP/TCP headers that have refs taken on them.
(2) Fragments consisting of zerocopy kernel buffers that has neither refs nor
pins belonging to the sk_buff.
iov_iter_extract_pages() will not take pins when extracting from, say, an
XARRAY-type or KVEC-type iterator. iov_iter_extract_mode() will return
0.
(3) Fragments consisting of zerocopy user buffers that have pins taken on
them belonging to the sk_buff.
iov_iter_extract_pages() will take pins when extracting from, say, a
UBUF-type or IOVEC-type iterator. iov_iter_extract_mode() will return
FOLL_PIN (at the moment).
So you have three states: Ref'd, pinned and no-retention.
David
On Tue, Jan 24, 2023 at 02:59:25PM +0000, David Howells wrote: > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > What is the 3rd state? > > Consider a network filesystem message generated for a direct I/O that the > network filesystem does zerocopy on. You may have an sk_buff that has > fragments from one or more of three different sources: > > (1) Fragments consisting of specifically allocated pages, such as the > IP/UDP/TCP headers that have refs taken on them. > > (2) Fragments consisting of zerocopy kernel buffers that has neither refs nor > pins belonging to the sk_buff. > > iov_iter_extract_pages() will not take pins when extracting from, say, an > XARRAY-type or KVEC-type iterator. iov_iter_extract_mode() will return > 0. > > (3) Fragments consisting of zerocopy user buffers that have pins taken on > them belonging to the sk_buff. > > iov_iter_extract_pages() will take pins when extracting from, say, a > UBUF-type or IOVEC-type iterator. iov_iter_extract_mode() will return > FOLL_PIN (at the moment). > > So you have three states: Ref'd, pinned and no-retention. Isn't that this: if (cleanup_flags & PAGE_CLEANUP_NEEDED) gup_put_folio(folio, 1, cleanup_flags & PAGE_CLEANUP_UNPIN) ? Three states - decr get, decr pinned, do nothing? Jason
Jason Gunthorpe <jgg@nvidia.com> wrote: > Isn't that this: > > if (cleanup_flags & PAGE_CLEANUP_NEEDED) > gup_put_folio(folio, 1, cleanup_flags & PAGE_CLEANUP_UNPIN) > > > ? Yes. As claimed, that would be three states. > Three states - decr get, decr pinned, do nothing? Yes. Don't worry about it. I'm not going to do it in gup.c. David
diff --git a/include/linux/mm.h b/include/linux/mm.h index 3de9d88f8524..c95bc4f77e8f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3074,26 +3074,28 @@ static inline vm_fault_t vmf_error(int err) struct page *follow_page(struct vm_area_struct *vma, unsigned long address, unsigned int foll_flags); -#define FOLL_WRITE 0x01 /* check pte is writable */ -#define FOLL_TOUCH 0x02 /* mark page accessed */ -#define FOLL_GET 0x04 /* do get_page on page */ -#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */ -#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */ -#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO +#define FOLL_PIN 0x01 /* pages must be released via unpin_user_page */ +#define FOLL_GET 0x02 /* do get_page on page (equivalent to BIO_FOLL_GET) */ +#define FOLL_WRITE 0x04 /* check pte is writable */ +#define FOLL_TOUCH 0x08 /* mark page accessed */ +#define FOLL_DUMP 0x10 /* give error on hole if it would be zero */ +#define FOLL_FORCE 0x20 /* get_user_pages read/write w/o permission */ +#define FOLL_NOWAIT 0x40 /* if a disk transfer is needed, start the IO * and return without waiting upon it */ #define FOLL_NOFAULT 0x80 /* do not fault in pages */ #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ -#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ -#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ -#define FOLL_ANON 0x8000 /* don't do file mappings */ -#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ -#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ -#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ -#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ -#define FOLL_PCI_P2PDMA 0x100000 /* allow returning PCI P2PDMA pages */ -#define FOLL_INTERRUPTIBLE 0x200000 /* allow interrupts from generic signals */ +#define FOLL_TRIED 0x200 /* a retry, previous pass started an IO */ +#define FOLL_REMOTE 0x400 /* we are working on non-current tsk/mm */ +#define FOLL_ANON 0x800 /* don't do file mappings */ +#define FOLL_LONGTERM 0x1000 /* mapping lifetime is indefinite: see below */ +#define FOLL_SPLIT_PMD 0x2000 /* split huge pmd before returning */ +#define FOLL_FAST_ONLY 0x4000 /* gup_fast: prevent fall-back to slow gup */ +#define FOLL_PCI_P2PDMA 0x8000 /* allow returning PCI P2PDMA pages */ +#define FOLL_INTERRUPTIBLE 0x10000 /* allow interrupts from generic signals */ /* + * Note that FOLL_PIN is sorted to bit 0 to be coincident with BIO_PAGE_PINNED. + * * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each * other. Here is what they mean, and how to use them: *
Renumber FOLL_PIN and FOLL_GET down to bit 0 and 1 respectively so that they are coincidentally the same as BIO_PAGE_PINNED and BIO_PAGE_REFFED and also so that they can be stored in the bottom two bits of a page pointer (something I'm looking at for zerocopy socket fragments). (Note that BIO_PAGE_REFFED should probably be got rid of at some point, hence why FOLL_PIN is at 0.) Also renumber down the other FOLL_* flags to close the gaps. Signed-off-by: David Howells <dhowells@redhat.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Christoph Hellwig <hch@lst.de> cc: Matthew Wilcox <willy@infradead.org> cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org --- Notes: ver #8) - Put FOLL_PIN at bit 0 and FOLL_GET at bit 1 to match BIO_PAGE_*. - Renumber the remaining flags down to fill in the gap. include/linux/mm.h | 32 +++++++++++++++++--------------- 1 file changed, 17 insertions(+), 15 deletions(-)