Message ID | 20250328211349.845857-1-vishal.moola@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduce vmap_file() | expand |
HI Vishal, 在 2025/3/29 05:13, Vishal Moola (Oracle) 写道: > Currently, users have to call vmap() or vmap_pfn() to map pages to > kernel virtual space. vmap_pfn() is for special pages (i.e. pfns > without struct page). vmap() handles normal pages. > > With large folios, we may want to map ranges that only span > part of a folio (i.e. mapping half of a 2Mb folio). > vmap_file() will allow us to do so. You mention vmap_file can support range folio vmap, but when I look code, I can't figure out how to use, maybe I missed something? :) And this API still aim to file vmap, Maybe not suitable for the problem I mentioned in: https://lore.kernel.org/lkml/20250312061513.1126496-1-link@vivo.com/ Thanks, Huan Yang > > Create a function, vmap_file(), to map a specified range of a given > file to kernel virtual space. vmap_file() is an in-kernel equivalent > to mmap(), and can be useful for filesystems. > > --- > v2: > - Reword cover letter to provide a clearer overview of the current > vmalloc APIs, and usefulness of vmap_file() > - EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() > - Provide support to partially map file folios > - Demote this to RFC while we look for users > -- > I don't have a user for this function right now, but it will be > useful as users start converting to using large folios. I'm just > putting it out here for anyone that may find a use for it. > > This seems like the sensible way to implement it, but I'm open > to tweaking the functions semantics. > > I've Cc-ed a couple people that mentioned they might be interested > in using it. > > Vishal Moola (Oracle) (1): > mm/vmalloc: Introduce vmap_file() > > include/linux/vmalloc.h | 2 + > mm/vmalloc.c | 113 ++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 115 insertions(+) >
On Mon, Mar 31, 2025 at 10:05:53AM +0800, Huan Yang wrote: > HI Vishal, > > 在 2025/3/29 05:13, Vishal Moola (Oracle) 写道: > > Currently, users have to call vmap() or vmap_pfn() to map pages to > > kernel virtual space. vmap_pfn() is for special pages (i.e. pfns > > without struct page). vmap() handles normal pages. > > > > With large folios, we may want to map ranges that only span > > part of a folio (i.e. mapping half of a 2Mb folio). > > vmap_file() will allow us to do so. > > You mention vmap_file can support range folio vmap, but when I look code, I can't figure out > > how to use, maybe I missed something? :) I took a look at the udma-buf code. Rather than iterating through the folios using pfns, you can calculate the corresponding file offsets (maybe you already have them?) to map the desired folios. > And this API still aim to file vmap, Maybe not suitable for the problem I mentioned in: > > https://lore.kernel.org/lkml/20250312061513.1126496-1-link@vivo.com/ I'm not sure which problem you're referring to, could you be more specific? > Thanks, > Huan Yang > > > Create a function, vmap_file(), to map a specified range of a given > > file to kernel virtual space. vmap_file() is an in-kernel equivalent > > to mmap(), and can be useful for filesystems. > > > > --- > > v2: > > - Reword cover letter to provide a clearer overview of the current > > vmalloc APIs, and usefulness of vmap_file() > > - EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() > > - Provide support to partially map file folios > > - Demote this to RFC while we look for users > > -- > > I don't have a user for this function right now, but it will be > > useful as users start converting to using large folios. I'm just > > putting it out here for anyone that may find a use for it. > > > > This seems like the sensible way to implement it, but I'm open > > to tweaking the functions semantics. > > > > I've Cc-ed a couple people that mentioned they might be interested > > in using it. > > > > Vishal Moola (Oracle) (1): > > mm/vmalloc: Introduce vmap_file() > > > > include/linux/vmalloc.h | 2 + > > mm/vmalloc.c | 113 ++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 115 insertions(+) > >
在 2025/4/1 09:50, Vishal Moola (Oracle) 写道: > On Mon, Mar 31, 2025 at 10:05:53AM +0800, Huan Yang wrote: >> HI Vishal, >> >> 在 2025/3/29 05:13, Vishal Moola (Oracle) 写道: >>> Currently, users have to call vmap() or vmap_pfn() to map pages to >>> kernel virtual space. vmap_pfn() is for special pages (i.e. pfns >>> without struct page). vmap() handles normal pages. >>> >>> With large folios, we may want to map ranges that only span >>> part of a folio (i.e. mapping half of a 2Mb folio). >>> vmap_file() will allow us to do so. >> You mention vmap_file can support range folio vmap, but when I look code, I can't figure out >> >> how to use, maybe I missed something? :) > I took a look at the udma-buf code. Rather than iterating through the > folios using pfns, you can calculate the corresponding file offsets > (maybe you already have them?) to map the desired folios. Currently udmabuf folio's not simple based on file(even each memory from memfd). User can provide random range of memfd to udmabuf to use. For example: We get a memfd maybe 4M, user split it into [0, 2M), [1M, 2M), [2M, 4M), so you can see 1M-2M range repeat. This range can gathered by udmabuf_create_list, then udmabuf use it. So, udmabuf record it by folio array+offset array. I think vmap_file based on address_space's range can't help. > >> And this API still aim to file vmap, Maybe not suitable for the problem I mentioned in: >> >> https://lore.kernel.org/lkml/20250312061513.1126496-1-link@vivo.com/ > I'm not sure which problem you're referring to, could you be more > specific? 1. udmabuf not same to file vmap usage 2. udmabuf can't use page struct if HVO hugetlb enabled and use. It still need pfn based vmap or folio's offset based range vmap.(Or, just simple reject HVO folio use vmap) :) > >> Thanks, >> Huan Yang >> >>> Create a function, vmap_file(), to map a specified range of a given >>> file to kernel virtual space. vmap_file() is an in-kernel equivalent >>> to mmap(), and can be useful for filesystems. >>> >>> --- >>> v2: >>> - Reword cover letter to provide a clearer overview of the current >>> vmalloc APIs, and usefulness of vmap_file() >>> - EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() >>> - Provide support to partially map file folios >>> - Demote this to RFC while we look for users >>> -- >>> I don't have a user for this function right now, but it will be >>> useful as users start converting to using large folios. I'm just >>> putting it out here for anyone that may find a use for it. >>> >>> This seems like the sensible way to implement it, but I'm open >>> to tweaking the functions semantics. >>> >>> I've Cc-ed a couple people that mentioned they might be interested >>> in using it. >>> >>> Vishal Moola (Oracle) (1): >>> mm/vmalloc: Introduce vmap_file() >>> >>> include/linux/vmalloc.h | 2 + >>> mm/vmalloc.c | 113 ++++++++++++++++++++++++++++++++++++++++ >>> 2 files changed, 115 insertions(+) >>>
On Tue, Apr 01, 2025 at 10:21:46AM +0800, Huan Yang wrote: > > 在 2025/4/1 09:50, Vishal Moola (Oracle) 写道: > > On Mon, Mar 31, 2025 at 10:05:53AM +0800, Huan Yang wrote: > > > HI Vishal, > > > > > > 在 2025/3/29 05:13, Vishal Moola (Oracle) 写道: > > > > Currently, users have to call vmap() or vmap_pfn() to map pages to > > > > kernel virtual space. vmap_pfn() is for special pages (i.e. pfns > > > > without struct page). vmap() handles normal pages. > > > > > > > > With large folios, we may want to map ranges that only span > > > > part of a folio (i.e. mapping half of a 2Mb folio). > > > > vmap_file() will allow us to do so. > > > You mention vmap_file can support range folio vmap, but when I look code, I can't figure out > > > > > > how to use, maybe I missed something? :) > > I took a look at the udma-buf code. Rather than iterating through the > > folios using pfns, you can calculate the corresponding file offsets > > (maybe you already have them?) to map the desired folios. > > Currently udmabuf folio's not simple based on file(even each memory from memfd). User can provide > > random range of memfd to udmabuf to use. For example: > > We get a memfd maybe 4M, user split it into [0, 2M), [1M, 2M), [2M, 4M), so you can see 1M-2M range repeat. > > This range can gathered by udmabuf_create_list, then udmabuf use it. So, udmabuf record it by folio array+offset array. I was thinking you could call vmap_file() on every sub-range and use those addresses. It should work, we'd have to look at making udmabuf api's support it. > I think vmap_file based on address_space's range can't help. I'm not familiar with the memfd/gup code yet, but I'm fairly confident those memfds will have associated ->f_mappings that would suffice. They are file descriptors after all. > > > > > And this API still aim to file vmap, Maybe not suitable for the problem I mentioned in: > > > > > > https://lore.kernel.org/lkml/20250312061513.1126496-1-link@vivo.com/ > > I'm not sure which problem you're referring to, could you be more > > specific? > > 1. udmabuf not same to file vmap usage > > 2. udmabuf can't use page struct if HVO hugetlb enabled and use. vmap_file() doesn't depend on tail page structs. > It still need pfn based vmap or folio's offset based range vmap.(Or, just simple reject HVO folio use vmap) :) > > > > > > Thanks, > > > Huan Yang > > > > > > > Create a function, vmap_file(), to map a specified range of a given > > > > file to kernel virtual space. vmap_file() is an in-kernel equivalent > > > > to mmap(), and can be useful for filesystems. > > > > > > > > --- > > > > v2: > > > > - Reword cover letter to provide a clearer overview of the current > > > > vmalloc APIs, and usefulness of vmap_file() > > > > - EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() > > > > - Provide support to partially map file folios > > > > - Demote this to RFC while we look for users > > > > -- > > > > I don't have a user for this function right now, but it will be > > > > useful as users start converting to using large folios. I'm just > > > > putting it out here for anyone that may find a use for it. > > > > > > > > This seems like the sensible way to implement it, but I'm open > > > > to tweaking the functions semantics. > > > > > > > > I've Cc-ed a couple people that mentioned they might be interested > > > > in using it. > > > > > > > > Vishal Moola (Oracle) (1): > > > > mm/vmalloc: Introduce vmap_file() > > > > > > > > include/linux/vmalloc.h | 2 + > > > > mm/vmalloc.c | 113 ++++++++++++++++++++++++++++++++++++++++ > > > > 2 files changed, 115 insertions(+) > > > >
在 2025/4/1 11:19, Vishal Moola (Oracle) 写道: > On Tue, Apr 01, 2025 at 10:21:46AM +0800, Huan Yang wrote: >> 在 2025/4/1 09:50, Vishal Moola (Oracle) 写道: >>> On Mon, Mar 31, 2025 at 10:05:53AM +0800, Huan Yang wrote: >>>> HI Vishal, >>>> >>>> 在 2025/3/29 05:13, Vishal Moola (Oracle) 写道: >>>>> Currently, users have to call vmap() or vmap_pfn() to map pages to >>>>> kernel virtual space. vmap_pfn() is for special pages (i.e. pfns >>>>> without struct page). vmap() handles normal pages. >>>>> >>>>> With large folios, we may want to map ranges that only span >>>>> part of a folio (i.e. mapping half of a 2Mb folio). >>>>> vmap_file() will allow us to do so. >>>> You mention vmap_file can support range folio vmap, but when I look code, I can't figure out >>>> >>>> how to use, maybe I missed something? :) >>> I took a look at the udma-buf code. Rather than iterating through the >>> folios using pfns, you can calculate the corresponding file offsets >>> (maybe you already have them?) to map the desired folios. >> Currently udmabuf folio's not simple based on file(even each memory from memfd). User can provide >> >> random range of memfd to udmabuf to use. For example: >> >> We get a memfd maybe 4M, user split it into [0, 2M), [1M, 2M), [2M, 4M), so you can see 1M-2M range repeat. >> >> This range can gathered by udmabuf_create_list, then udmabuf use it. So, udmabuf record it by folio array+offset array. > I was thinking you could call vmap_file() on every sub-range and use > those addresses. It should work, we'd have to look at making udmabuf api's > support it. Hmmm, how to get contigous virtual address? Or there are a way to merge each split vmap's return address? IMO, user invoke vmap want to map each scatter memory into contigous virtual address, but as your suggestion, I think can't to this. :) > >> I think vmap_file based on address_space's range can't help. > I'm not familiar with the memfd/gup code yet, but I'm fairly confident > those memfds will have associated ->f_mappings that would suffice. They > are file descriptors after all. Agree with this. > >>>> And this API still aim to file vmap, Maybe not suitable for the problem I mentioned in: >>>> >>>> https://lore.kernel.org/lkml/20250312061513.1126496-1-link@vivo.com/ >>> I'm not sure which problem you're referring to, could you be more >>> specific? >> 1. udmabuf not same to file vmap usage >> >> 2. udmabuf can't use page struct if HVO hugetlb enabled and use. > vmap_file() doesn't depend on tail page structs. > >> It still need pfn based vmap or folio's offset based range vmap.(Or, just simple reject HVO folio use vmap) :) >> >>>> Thanks, >>>> Huan Yang >>>> >>>>> Create a function, vmap_file(), to map a specified range of a given >>>>> file to kernel virtual space. vmap_file() is an in-kernel equivalent >>>>> to mmap(), and can be useful for filesystems. >>>>> >>>>> --- >>>>> v2: >>>>> - Reword cover letter to provide a clearer overview of the current >>>>> vmalloc APIs, and usefulness of vmap_file() >>>>> - EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() >>>>> - Provide support to partially map file folios >>>>> - Demote this to RFC while we look for users >>>>> -- >>>>> I don't have a user for this function right now, but it will be >>>>> useful as users start converting to using large folios. I'm just >>>>> putting it out here for anyone that may find a use for it. >>>>> >>>>> This seems like the sensible way to implement it, but I'm open >>>>> to tweaking the functions semantics. >>>>> >>>>> I've Cc-ed a couple people that mentioned they might be interested >>>>> in using it. >>>>> >>>>> Vishal Moola (Oracle) (1): >>>>> mm/vmalloc: Introduce vmap_file() >>>>> >>>>> include/linux/vmalloc.h | 2 + >>>>> mm/vmalloc.c | 113 ++++++++++++++++++++++++++++++++++++++++ >>>>> 2 files changed, 115 insertions(+) >>>>>
On Tue, Apr 01, 2025 at 02:08:53PM +0800, Huan Yang wrote: > > 在 2025/4/1 11:19, Vishal Moola (Oracle) 写道: > > On Tue, Apr 01, 2025 at 10:21:46AM +0800, Huan Yang wrote: > > > 在 2025/4/1 09:50, Vishal Moola (Oracle) 写道: > > > > On Mon, Mar 31, 2025 at 10:05:53AM +0800, Huan Yang wrote: > > > > > HI Vishal, > > > > > > > > > > 在 2025/3/29 05:13, Vishal Moola (Oracle) 写道: > > > > > > Currently, users have to call vmap() or vmap_pfn() to map pages to > > > > > > kernel virtual space. vmap_pfn() is for special pages (i.e. pfns > > > > > > without struct page). vmap() handles normal pages. > > > > > > > > > > > > With large folios, we may want to map ranges that only span > > > > > > part of a folio (i.e. mapping half of a 2Mb folio). > > > > > > vmap_file() will allow us to do so. > > > > > You mention vmap_file can support range folio vmap, but when I look code, I can't figure out > > > > > > > > > > how to use, maybe I missed something? :) > > > > I took a look at the udma-buf code. Rather than iterating through the > > > > folios using pfns, you can calculate the corresponding file offsets > > > > (maybe you already have them?) to map the desired folios. > > > Currently udmabuf folio's not simple based on file(even each memory from memfd). User can provide > > > > > > random range of memfd to udmabuf to use. For example: > > > > > > We get a memfd maybe 4M, user split it into [0, 2M), [1M, 2M), [2M, 4M), so you can see 1M-2M range repeat. > > > > > > This range can gathered by udmabuf_create_list, then udmabuf use it. So, udmabuf record it by folio array+offset array. > > I was thinking you could call vmap_file() on every sub-range and use > > those addresses. It should work, we'd have to look at making udmabuf api's > > support it. > > Hmmm, how to get contigous virtual address? Or there are a way to merge each split vmap's return address? > The patch in question maps whole file to continues memory as i see, but i can miss something. Partly populate technique requires to get an area and partly populate it. As i see we have something similar: <snip> /** * vm_area_map_pages - map pages inside given sparse vm_area * @area: vm_area * @start: start address inside vm_area * @end: end address inside vm_area * @pages: pages to map (always PAGE_SIZE pages) */ int vm_area_map_pages(struct vm_struct *area, unsigned long start, unsigned long end, struct page **pages) { ... <snip> it is used by the BPF. -- Uladzislau Rezki
在 2025/4/1 17:47, Uladzislau Rezki 写道: > On Tue, Apr 01, 2025 at 02:08:53PM +0800, Huan Yang wrote: >> 在 2025/4/1 11:19, Vishal Moola (Oracle) 写道: >>> On Tue, Apr 01, 2025 at 10:21:46AM +0800, Huan Yang wrote: >>>> 在 2025/4/1 09:50, Vishal Moola (Oracle) 写道: >>>>> On Mon, Mar 31, 2025 at 10:05:53AM +0800, Huan Yang wrote: >>>>>> HI Vishal, >>>>>> >>>>>> 在 2025/3/29 05:13, Vishal Moola (Oracle) 写道: >>>>>>> Currently, users have to call vmap() or vmap_pfn() to map pages to >>>>>>> kernel virtual space. vmap_pfn() is for special pages (i.e. pfns >>>>>>> without struct page). vmap() handles normal pages. >>>>>>> >>>>>>> With large folios, we may want to map ranges that only span >>>>>>> part of a folio (i.e. mapping half of a 2Mb folio). >>>>>>> vmap_file() will allow us to do so. >>>>>> You mention vmap_file can support range folio vmap, but when I look code, I can't figure out >>>>>> >>>>>> how to use, maybe I missed something? :) >>>>> I took a look at the udma-buf code. Rather than iterating through the >>>>> folios using pfns, you can calculate the corresponding file offsets >>>>> (maybe you already have them?) to map the desired folios. >>>> Currently udmabuf folio's not simple based on file(even each memory from memfd). User can provide >>>> >>>> random range of memfd to udmabuf to use. For example: >>>> >>>> We get a memfd maybe 4M, user split it into [0, 2M), [1M, 2M), [2M, 4M), so you can see 1M-2M range repeat. >>>> >>>> This range can gathered by udmabuf_create_list, then udmabuf use it. So, udmabuf record it by folio array+offset array. Here, :) >>> I was thinking you could call vmap_file() on every sub-range and use >>> those addresses. It should work, we'd have to look at making udmabuf api's >>> support it. >> Hmmm, how to get contigous virtual address? Or there are a way to merge each split vmap's return address? >> > The patch in question maps whole file to continues memory as i see, but > i can miss something. Partly populate technique requires to get an area Hmm, maybe you missed ahead talk, I point above. :) > and partly populate it. > > As i see we have something similar: > > <snip> > /** > * vm_area_map_pages - map pages inside given sparse vm_area > * @area: vm_area > * @start: start address inside vm_area > * @end: end address inside vm_area > * @pages: pages to map (always PAGE_SIZE pages) > */ > int vm_area_map_pages(struct vm_struct *area, unsigned long start, > unsigned long end, struct page **pages) > { > ... > <snip> > > it is used by the BPF. > > -- > Uladzislau Rezki
On Tue, Apr 01, 2025 at 07:09:57PM +0800, Huan Yang wrote: > > 在 2025/4/1 17:47, Uladzislau Rezki 写道: > > On Tue, Apr 01, 2025 at 02:08:53PM +0800, Huan Yang wrote: > > > 在 2025/4/1 11:19, Vishal Moola (Oracle) 写道: > > > > On Tue, Apr 01, 2025 at 10:21:46AM +0800, Huan Yang wrote: > > > > > 在 2025/4/1 09:50, Vishal Moola (Oracle) 写道: > > > > > > On Mon, Mar 31, 2025 at 10:05:53AM +0800, Huan Yang wrote: > > > > > > > HI Vishal, > > > > > > > > > > > > > > 在 2025/3/29 05:13, Vishal Moola (Oracle) 写道: > > > > > > > > Currently, users have to call vmap() or vmap_pfn() to map pages to > > > > > > > > kernel virtual space. vmap_pfn() is for special pages (i.e. pfns > > > > > > > > without struct page). vmap() handles normal pages. > > > > > > > > > > > > > > > > With large folios, we may want to map ranges that only span > > > > > > > > part of a folio (i.e. mapping half of a 2Mb folio). > > > > > > > > vmap_file() will allow us to do so. > > > > > > > You mention vmap_file can support range folio vmap, but when I look code, I can't figure out > > > > > > > > > > > > > > how to use, maybe I missed something? :) > > > > > > I took a look at the udma-buf code. Rather than iterating through the > > > > > > folios using pfns, you can calculate the corresponding file offsets > > > > > > (maybe you already have them?) to map the desired folios. > > > > > Currently udmabuf folio's not simple based on file(even each memory from memfd). User can provide > > > > > > > > > > random range of memfd to udmabuf to use. For example: > > > > > > > > > > We get a memfd maybe 4M, user split it into [0, 2M), [1M, 2M), [2M, 4M), so you can see 1M-2M range repeat. > > > > > > > > > > This range can gathered by udmabuf_create_list, then udmabuf use it. So, udmabuf record it by folio array+offset array. > Here, :) > > > > I was thinking you could call vmap_file() on every sub-range and use > > > > those addresses. It should work, we'd have to look at making udmabuf api's > > > > support it. > > > Hmmm, how to get contigous virtual address? Or there are a way to merge each split vmap's return address? > > > > > The patch in question maps whole file to continues memory as i see, but > > i can miss something. Partly populate technique requires to get an area > Hmm, maybe you missed ahead talk, I point above. :) > I pointed to how BPF does it, probably it would just give you both some extra input. -- Uladzislau Rezki
On Tue, Apr 01, 2025 at 02:08:53PM +0800, Huan Yang wrote: > > 在 2025/4/1 11:19, Vishal Moola (Oracle) 写道: > > On Tue, Apr 01, 2025 at 10:21:46AM +0800, Huan Yang wrote: > > > 在 2025/4/1 09:50, Vishal Moola (Oracle) 写道: > > > > On Mon, Mar 31, 2025 at 10:05:53AM +0800, Huan Yang wrote: > > > > > HI Vishal, > > > > > > > > > > 在 2025/3/29 05:13, Vishal Moola (Oracle) 写道: > > > > > > Currently, users have to call vmap() or vmap_pfn() to map pages to > > > > > > kernel virtual space. vmap_pfn() is for special pages (i.e. pfns > > > > > > without struct page). vmap() handles normal pages. > > > > > > > > > > > > With large folios, we may want to map ranges that only span > > > > > > part of a folio (i.e. mapping half of a 2Mb folio). > > > > > > vmap_file() will allow us to do so. > > > > > You mention vmap_file can support range folio vmap, but when I look code, I can't figure out > > > > > > > > > > how to use, maybe I missed something? :) > > > > I took a look at the udma-buf code. Rather than iterating through the > > > > folios using pfns, you can calculate the corresponding file offsets > > > > (maybe you already have them?) to map the desired folios. > > > Currently udmabuf folio's not simple based on file(even each memory from memfd). User can provide > > > > > > random range of memfd to udmabuf to use. For example: > > > > > > We get a memfd maybe 4M, user split it into [0, 2M), [1M, 2M), [2M, 4M), so you can see 1M-2M range repeat. > > > > > > This range can gathered by udmabuf_create_list, then udmabuf use it. So, udmabuf record it by folio array+offset array. > > I was thinking you could call vmap_file() on every sub-range and use > > those addresses. It should work, we'd have to look at making udmabuf api's > > support it. > > Hmmm, how to get contigous virtual address? Or there are a way to merge each split vmap's return address? I'm not sure, I'd have to take a look at that. Maybe going into a large folio world that might be a useful expansion on the APIs? > IMO, user invoke vmap want to map each scatter memory into contigous virtual address, but as your suggestion, > > I think can't to this. :) We could discuss vmap_file() supporting a series of offsets to map portions of a file; I think thats a reasonable ask for the general API. We could potentially do multiple files as well, but things start getting really complex at that point so I'd like to avoid that. The Udma code looks to be doing some buggy stuff, so I'd prefer we look at fixing/reworking those before hacking in a 'generic' API just so they can keep doing that.