Message ID | 20231212073803.3233055-4-vivek.kasireddy@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) | expand |
On 12.12.23 08:38, Vivek Kasireddy wrote: > For drivers that would like to longterm-pin the folios associated > with a memfd, the memfd_pin_folios() API provides an option to > not only pin the folios via FOLL_PIN but also to check and migrate > them if they reside in movable zone or CMA block. This API > currently works with memfds but it should work with any files > that belong to either shmemfs or hugetlbfs. Files belonging to > other filesystems are rejected for now. > > The folios need to be located first before pinning them via FOLL_PIN. > If they are found in the page cache, they can be immediately pinned. > Otherwise, they need to be allocated using the filesystem specific > APIs and then pinned. > > v2: > - Drop gup_flags and improve comments and commit message (David) > - Allocate a page if we cannot find in page cache for the hugetlbfs > case as well (David) > - Don't unpin pages if there is a migration related failure (David) > - Drop the unnecessary nr_pages <= 0 check (Jason) > - Have the caller of the API pass in file * instead of fd (Jason) > > v3: (David) > - Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE > (Build error reported by kernel test robot <lkp@intel.com>) > - Don't forget memalloc_pin_restore() on non-migration related errors > - Improve the readability of the cleanup code associated with > non-migration related errors > - Augment the comments by describing FOLL_LONGTERM like behavior > - Include the R-b tag from Jason > > v4: > - Remove the local variable "page" and instead use 3 return statements > in alloc_file_page() (David) > - Add the R-b tag from David > > v5: (David) > - For hugetlb case, ensure that we only obtain head pages from the > mapping by using __filemap_get_folio() instead of find_get_page_flags() > - Handle -EEXIST when two or more potential users try to simultaneously > add a huge page to the mapping by forcing them to retry on failure > > v6: (Christoph) > - Rename this API to memfd_pin_user_pages() to make it clear that it > is intended for memfds > - Move the memfd page allocation helper from gup.c to memfd.c > - Fix indentation errors in memfd_pin_user_pages() > - For contiguous ranges of folios, use a helper such as > filemap_get_folios_contig() to lookup the page cache in batches > > v7: > - Rename this API to memfd_pin_folios() and make it return folios > and offsets instead of pages (David) > - Don't continue processing the folios in the batch returned by > filemap_get_folios_contig() if they do not have correct next_idx > - Add the R-b tag from Christoph > Sorry, I'm still not happy about the current state, because (1) the folio vs. pages handling is still mixed (2) we're returning+pinning a large folio multiple times. See below if there is an easy way to clean this up. >> @@ -5,6 +5,7 @@ > #include <linux/spinlock.h> > > #include <linux/mm.h> > +#include <linux/memfd.h> > #include <linux/memremap.h> > #include <linux/pagemap.h> > #include <linux/rmap.h> > @@ -17,6 +18,7 @@ > #include <linux/hugetlb.h> > #include <linux/migrate.h> > #include <linux/mm_inline.h> > +#include <linux/pagevec.h> > #include <linux/sched/mm.h> > #include <linux/shmem_fs.h> > > @@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, > &locked, gup_flags); > } > EXPORT_SYMBOL(pin_user_pages_unlocked); > + > +/** > + * memfd_pin_folios() - pin folios associated with a memfd > + * @memfd: the memfd whose folios are to be pinned > + * @start: starting memfd offset > + * @nr_pages: number of pages from start to pin We're not pinning pages. An inclusive range [start, end] would be clearer. > + * @folios: array that receives pointers to the folios pinned. > + * Should be at-least nr_pages long. > + * @offsets: array that receives offsets of pages in their folios. > + * Should be at-least nr_pages long. See below, I'm wondering if this is really required once we return each folio only once. > + * > + * Attempt to pin folios associated with a memfd; given that a memfd is > + * either backed by shmem or hugetlb, the folios can either be found in > + * the page cache or need to be allocated if necessary. Once the folios > + * are located, they are all pinned via FOLL_PIN and the @offsets array > + * is populated with offsets of the pages in their respective folios. > + * Therefore, for each page the caller requested, there will be a > + * corresponding entry in both @folios and @offsets. And, eventually, > + * these pinned folios need to be released either using unpin_user_pages() > + * or unpin_user_page(). Oh, we don't have a folio function yet? Should be easy to add, and we'd really add them. > + * > + * It must be noted that the folios may be pinned for an indefinite amount > + * of time. And, in most cases, the duration of time they may stay pinned > + * would be controlled by the userspace. This behavior is effectively the > + * same as using FOLL_LONGTERM with other GUP APIs. > + * > + * Returns number of folios pinned. This would be equal to the number of > + * pages requested. If no folios were pinned, it returns -errno. > + */ > +long memfd_pin_folios(struct file *memfd, unsigned long start, > + unsigned long nr_pages, struct folio **folios, > + pgoff_t *offsets) > +{ > + unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1; > + unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT; > + pgoff_t start_idx, end_idx, next_idx; > + unsigned int flags, nr_folios, i, j; > + struct folio *folio = NULL; > + struct folio_batch fbatch; > + struct page **pages; > + struct hstate *h; > + long ret; > + > + if (!nr_pages) > + return -EINVAL; > + > + if (!memfd) > + return -EINVAL; > + > + if (!shmem_file(memfd) && !is_file_hugepages(memfd)) > + return -EINVAL; > + > + pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL); > + if (!pages) > + return -ENOMEM; > + > + if (is_file_hugepages(memfd)) { > + h = hstate_file(memfd); > + pgshift = huge_page_shift(h); > + } > + > + flags = memalloc_pin_save(); > + do { > + i = 0; > + start_idx = start >> pgshift; > + end_idx = end >> pgshift; > + if (is_file_hugepages(memfd)) { > + start_idx <<= huge_page_order(h); > + end_idx <<= huge_page_order(h); > + } > + > + folio_batch_init(&fbatch); > + while (start_idx <= end_idx) { > + /* > + * In most cases, we should be able to find the folios > + * in the page cache. If we cannot find them for some > + * reason, we try to allocate them and add them to the > + * page cache. > + */ > + nr_folios = filemap_get_folios_contig(memfd->f_mapping, > + &start_idx, > + end_idx, > + &fbatch); > + if (folio) { > + folio_put(folio); > + folio = NULL; > + } > + > + next_idx = 0; > + for (j = 0; j < nr_folios; j++) { > + if (next_idx && > + next_idx != folio_index(fbatch.folios[j])) > + continue; > + > + folio = try_grab_folio(&fbatch.folios[j]->page, > + 1, FOLL_PIN); > + if (!folio) { > + folio_batch_release(&fbatch); > + kfree(pages); > + goto err; > + } > + > + max_pgs = folio_nr_pages(folio); > + if (i == 0) { > + pgoff = offset_in_folio(folio, start); > + pgoff >>= PAGE_SHIFT; > + } > + > + do { > + folios[i] = folio; > + offsets[i] = pgoff << PAGE_SHIFT; > + pages[i] = folio_page(folio, 0); > + folio_add_pin(folio); > + > + pgoff++; > + i++; > + } while (pgoff < max_pgs && i < nr_pages); > + > + pgoff = 0; > + next_idx = folio_next_index(folio); > + gup_put_folio(folio, 1, FOLL_PIN); > + } > + > + folio = NULL; > + folio_batch_release(&fbatch); > + if (!nr_folios) { > + folio = memfd_alloc_folio(memfd, start_idx); > + if (IS_ERR(folio)) { > + ret = PTR_ERR(folio); > + if (ret != -EEXIST) { > + kfree(pages); > + goto err; > + } > + } > + } > + } > + > + ret = check_and_migrate_movable_pages(nr_pages, pages); Having a folio variant would avoid having to mess with pages here at all. Further, we're now returning+pinning the same folio multiple times, instead of just once like the folio batching variant would. I'm wondering if the following wouldn't make more sense, assuming we add check_and_migrate_movable_folios(), which should be pretty easy to add. Obviously untested, just to express what I have in mind: /** * memfd_pin_folios() - pin folios associated with a memfd * @memfd: the memfd whose folios are to be pinned * @start: the starting memfd offset * @end: the final memfd offset (inclusive) * @folios: array that receives pointers to the folios pinned * @max_folios: the number of entries in the array for folios * @offsets: the offset into the first folio * * Attempt to pin folios associated with a memfd; given that a memfd is * either backed by shmem or hugetlb, the folios can either be found in * the page cache or need to be allocated if necessary. Once the folios * are located, they are all pinned via FOLL_PIN and @offset is populated * with the offset into the first folio. * * Pinned folios must be released using unpin_folio() or unpin_folios(). * * It must be noted that the folios may be pinned for an indefinite amount * of time. And, in most cases, the duration of time they may stay pinned * would be controlled by the userspace. This behavior is effectively the * same as using FOLL_LONGTERM with other GUP APIs. * * Returns number of folios pinned, which might be less than @max_folios * only if the whole range was pinned. If no folios were pinned, it returns * -errno. */ long memfd_pin_folios(struct file *memfd, unsigned long start, unsigned long end, struct folio **folios, unsigned int max_folios, unsigned long *offset) { unsigned int pgshift = PAGE_SHIFT; unsigned int flags, nr_folios, cur_folios, i; pgoff_t start_idx, end_idx; struct folio_batch fbatch; struct folio *folio; struct hstate *h; long ret; if (start > end || !max_folios) return -EINVAL; if (!memfd) return -EINVAL; if (!shmem_file(memfd) && !is_file_hugepages(memfd)) return -EINVAL; if (is_file_hugepages(memfd)) { h = hstate_file(memfd); pgshift = huge_page_shift(h); } flags = memalloc_pin_save(); folio_batch_init(&fbatch); do { nr_folios = 0; start_idx = start >> pgshift; end_idx = end >> pgshift; if (is_file_hugepages(memfd)) { start_idx <<= huge_page_order(h); end_idx <<= huge_page_order(h); } while (start_idx <= end_idx) { /* * In most cases, we should be able to find the folios * in the page cache. If we cannot find them for some * reason, we try to allocate them and add them to the * page cache. */ folio_batch_release(&fbatch); cur_folios = filemap_get_folios_contig(memfd->f_mapping, &start_idx, end_idx, &fbatch); if (!cur_folios) { folio = memfd_alloc_folio(memfd, start_idx); if (IS_ERR(folio)) { ret = PTR_ERR(folio); if (ret != -EEXIST) goto err; } folio_put(folio); continue; } /* Let's pin each folio, which shouldn't really fail. */ for (i = 0; i < cur_folios; i++) { folio = try_grab_folio(&fbatch.folios[i]->page, 1, FOLL_PIN); if (!folio) goto err; if (!nr_folios) *offset = offset_in_folio(folio, start); folios[nr_folios++] = folio; if (max_folios == nr_folios) break; } if (max_folios == nr_folios) break; } folio_batch_release(&fbatch); ret = check_and_migrate_movable_folios(nr_folios, folios); } while (ret == -EAGAIN); memalloc_pin_restore(flags); return ret ? ret : nr_folios; err: folio_batch_release(&fbatch); memalloc_pin_restore(flags); while (i-- > 0) if (folios[i]) gup_put_folio(folios[i], 1, FOLL_PIN); return ret; } EXPORT_SYMBOL_GPL(memfd_pin_folios); I'm still wondering about the offset handling, though. Could it happen that why we are repeatedly calling filemap_get_folios_contig(), that we would need offset!=0 on any of the other folios besides the first one? My current understanding (and looking at filemap_get_folios_contig()) is: no. I'm primarily concerned about concurrent fallocate(PUNCH_HOLE) and THP collapse/splitting.
Hi Vivek, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Vivek-Kasireddy/udmabuf-Use-vmf_insert_pfn-and-VM_PFNMAP-for-handling-mmap/20231212-160312 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20231212073803.3233055-4-vivek.kasireddy%40intel.com patch subject: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20231212/202312122109.SxDFnBaq-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231212/202312122109.SxDFnBaq-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202312122109.SxDFnBaq-lkp@intel.com/ All errors (new ones prefixed by >>): mm/gup.c: In function 'memfd_pin_folios': >> mm/gup.c:3543:39: error: assignment to 'struct folio *' from incompatible pointer type 'struct page *' [-Werror=incompatible-pointer-types] 3543 | folio = memfd_alloc_folio(memfd, start_idx); | ^ cc1: some warnings being treated as errors vim +3543 mm/gup.c 3417 3418 /** 3419 * memfd_pin_folios() - pin folios associated with a memfd 3420 * @memfd: the memfd whose folios are to be pinned 3421 * @start: starting memfd offset 3422 * @nr_pages: number of pages from start to pin 3423 * @folios: array that receives pointers to the folios pinned. 3424 * Should be at-least nr_pages long. 3425 * @offsets: array that receives offsets of pages in their folios. 3426 * Should be at-least nr_pages long. 3427 * 3428 * Attempt to pin folios associated with a memfd; given that a memfd is 3429 * either backed by shmem or hugetlb, the folios can either be found in 3430 * the page cache or need to be allocated if necessary. Once the folios 3431 * are located, they are all pinned via FOLL_PIN and the @offsets array 3432 * is populated with offsets of the pages in their respective folios. 3433 * Therefore, for each page the caller requested, there will be a 3434 * corresponding entry in both @folios and @offsets. And, eventually, 3435 * these pinned folios need to be released either using unpin_user_pages() 3436 * or unpin_user_page(). 3437 * 3438 * It must be noted that the folios may be pinned for an indefinite amount 3439 * of time. And, in most cases, the duration of time they may stay pinned 3440 * would be controlled by the userspace. This behavior is effectively the 3441 * same as using FOLL_LONGTERM with other GUP APIs. 3442 * 3443 * Returns number of folios pinned. This would be equal to the number of 3444 * pages requested. If no folios were pinned, it returns -errno. 3445 */ 3446 long memfd_pin_folios(struct file *memfd, unsigned long start, 3447 unsigned long nr_pages, struct folio **folios, 3448 pgoff_t *offsets) 3449 { 3450 unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1; 3451 unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT; 3452 pgoff_t start_idx, end_idx, next_idx; 3453 unsigned int flags, nr_folios, i, j; 3454 struct folio *folio = NULL; 3455 struct folio_batch fbatch; 3456 struct page **pages; 3457 struct hstate *h; 3458 long ret; 3459 3460 if (!nr_pages) 3461 return -EINVAL; 3462 3463 if (!memfd) 3464 return -EINVAL; 3465 3466 if (!shmem_file(memfd) && !is_file_hugepages(memfd)) 3467 return -EINVAL; 3468 3469 pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL); 3470 if (!pages) 3471 return -ENOMEM; 3472 3473 if (is_file_hugepages(memfd)) { 3474 h = hstate_file(memfd); 3475 pgshift = huge_page_shift(h); 3476 } 3477 3478 flags = memalloc_pin_save(); 3479 do { 3480 i = 0; 3481 start_idx = start >> pgshift; 3482 end_idx = end >> pgshift; 3483 if (is_file_hugepages(memfd)) { 3484 start_idx <<= huge_page_order(h); 3485 end_idx <<= huge_page_order(h); 3486 } 3487 3488 folio_batch_init(&fbatch); 3489 while (start_idx <= end_idx) { 3490 /* 3491 * In most cases, we should be able to find the folios 3492 * in the page cache. If we cannot find them for some 3493 * reason, we try to allocate them and add them to the 3494 * page cache. 3495 */ 3496 nr_folios = filemap_get_folios_contig(memfd->f_mapping, 3497 &start_idx, 3498 end_idx, 3499 &fbatch); 3500 if (folio) { 3501 folio_put(folio); 3502 folio = NULL; 3503 } 3504 3505 next_idx = 0; 3506 for (j = 0; j < nr_folios; j++) { 3507 if (next_idx && 3508 next_idx != folio_index(fbatch.folios[j])) 3509 continue; 3510 3511 folio = try_grab_folio(&fbatch.folios[j]->page, 3512 1, FOLL_PIN); 3513 if (!folio) { 3514 folio_batch_release(&fbatch); 3515 kfree(pages); 3516 goto err; 3517 } 3518 3519 max_pgs = folio_nr_pages(folio); 3520 if (i == 0) { 3521 pgoff = offset_in_folio(folio, start); 3522 pgoff >>= PAGE_SHIFT; 3523 } 3524 3525 do { 3526 folios[i] = folio; 3527 offsets[i] = pgoff << PAGE_SHIFT; 3528 pages[i] = folio_page(folio, 0); 3529 folio_add_pin(folio); 3530 3531 pgoff++; 3532 i++; 3533 } while (pgoff < max_pgs && i < nr_pages); 3534 3535 pgoff = 0; 3536 next_idx = folio_next_index(folio); 3537 gup_put_folio(folio, 1, FOLL_PIN); 3538 } 3539 3540 folio = NULL; 3541 folio_batch_release(&fbatch); 3542 if (!nr_folios) { > 3543 folio = memfd_alloc_folio(memfd, start_idx);
Hi Vivek, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Vivek-Kasireddy/udmabuf-Use-vmf_insert_pfn-and-VM_PFNMAP-for-handling-mmap/20231212-160312 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20231212073803.3233055-4-vivek.kasireddy%40intel.com patch subject: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) config: i386-allnoconfig (https://download.01.org/0day-ci/archive/20231212/202312122339.viQUjwIW-lkp@intel.com/config) compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231212/202312122339.viQUjwIW-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202312122339.viQUjwIW-lkp@intel.com/ All errors (new ones prefixed by >>): >> mm/gup.c:3543:11: error: incompatible pointer types assigning to 'struct folio *' from 'struct page *' [-Werror,-Wincompatible-pointer-types] folio = memfd_alloc_folio(memfd, start_idx); ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. vim +3543 mm/gup.c 3417 3418 /** 3419 * memfd_pin_folios() - pin folios associated with a memfd 3420 * @memfd: the memfd whose folios are to be pinned 3421 * @start: starting memfd offset 3422 * @nr_pages: number of pages from start to pin 3423 * @folios: array that receives pointers to the folios pinned. 3424 * Should be at-least nr_pages long. 3425 * @offsets: array that receives offsets of pages in their folios. 3426 * Should be at-least nr_pages long. 3427 * 3428 * Attempt to pin folios associated with a memfd; given that a memfd is 3429 * either backed by shmem or hugetlb, the folios can either be found in 3430 * the page cache or need to be allocated if necessary. Once the folios 3431 * are located, they are all pinned via FOLL_PIN and the @offsets array 3432 * is populated with offsets of the pages in their respective folios. 3433 * Therefore, for each page the caller requested, there will be a 3434 * corresponding entry in both @folios and @offsets. And, eventually, 3435 * these pinned folios need to be released either using unpin_user_pages() 3436 * or unpin_user_page(). 3437 * 3438 * It must be noted that the folios may be pinned for an indefinite amount 3439 * of time. And, in most cases, the duration of time they may stay pinned 3440 * would be controlled by the userspace. This behavior is effectively the 3441 * same as using FOLL_LONGTERM with other GUP APIs. 3442 * 3443 * Returns number of folios pinned. This would be equal to the number of 3444 * pages requested. If no folios were pinned, it returns -errno. 3445 */ 3446 long memfd_pin_folios(struct file *memfd, unsigned long start, 3447 unsigned long nr_pages, struct folio **folios, 3448 pgoff_t *offsets) 3449 { 3450 unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1; 3451 unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT; 3452 pgoff_t start_idx, end_idx, next_idx; 3453 unsigned int flags, nr_folios, i, j; 3454 struct folio *folio = NULL; 3455 struct folio_batch fbatch; 3456 struct page **pages; 3457 struct hstate *h; 3458 long ret; 3459 3460 if (!nr_pages) 3461 return -EINVAL; 3462 3463 if (!memfd) 3464 return -EINVAL; 3465 3466 if (!shmem_file(memfd) && !is_file_hugepages(memfd)) 3467 return -EINVAL; 3468 3469 pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL); 3470 if (!pages) 3471 return -ENOMEM; 3472 3473 if (is_file_hugepages(memfd)) { 3474 h = hstate_file(memfd); 3475 pgshift = huge_page_shift(h); 3476 } 3477 3478 flags = memalloc_pin_save(); 3479 do { 3480 i = 0; 3481 start_idx = start >> pgshift; 3482 end_idx = end >> pgshift; 3483 if (is_file_hugepages(memfd)) { 3484 start_idx <<= huge_page_order(h); 3485 end_idx <<= huge_page_order(h); 3486 } 3487 3488 folio_batch_init(&fbatch); 3489 while (start_idx <= end_idx) { 3490 /* 3491 * In most cases, we should be able to find the folios 3492 * in the page cache. If we cannot find them for some 3493 * reason, we try to allocate them and add them to the 3494 * page cache. 3495 */ 3496 nr_folios = filemap_get_folios_contig(memfd->f_mapping, 3497 &start_idx, 3498 end_idx, 3499 &fbatch); 3500 if (folio) { 3501 folio_put(folio); 3502 folio = NULL; 3503 } 3504 3505 next_idx = 0; 3506 for (j = 0; j < nr_folios; j++) { 3507 if (next_idx && 3508 next_idx != folio_index(fbatch.folios[j])) 3509 continue; 3510 3511 folio = try_grab_folio(&fbatch.folios[j]->page, 3512 1, FOLL_PIN); 3513 if (!folio) { 3514 folio_batch_release(&fbatch); 3515 kfree(pages); 3516 goto err; 3517 } 3518 3519 max_pgs = folio_nr_pages(folio); 3520 if (i == 0) { 3521 pgoff = offset_in_folio(folio, start); 3522 pgoff >>= PAGE_SHIFT; 3523 } 3524 3525 do { 3526 folios[i] = folio; 3527 offsets[i] = pgoff << PAGE_SHIFT; 3528 pages[i] = folio_page(folio, 0); 3529 folio_add_pin(folio); 3530 3531 pgoff++; 3532 i++; 3533 } while (pgoff < max_pgs && i < nr_pages); 3534 3535 pgoff = 0; 3536 next_idx = folio_next_index(folio); 3537 gup_put_folio(folio, 1, FOLL_PIN); 3538 } 3539 3540 folio = NULL; 3541 folio_batch_release(&fbatch); 3542 if (!nr_folios) { > 3543 folio = memfd_alloc_folio(memfd, start_idx);
Hi David, > > On 12.12.23 08:38, Vivek Kasireddy wrote: > > For drivers that would like to longterm-pin the folios associated > > with a memfd, the memfd_pin_folios() API provides an option to > > not only pin the folios via FOLL_PIN but also to check and migrate > > them if they reside in movable zone or CMA block. This API > > currently works with memfds but it should work with any files > > that belong to either shmemfs or hugetlbfs. Files belonging to > > other filesystems are rejected for now. > > > > The folios need to be located first before pinning them via FOLL_PIN. > > If they are found in the page cache, they can be immediately pinned. > > Otherwise, they need to be allocated using the filesystem specific > > APIs and then pinned. > > > > v2: > > - Drop gup_flags and improve comments and commit message (David) > > - Allocate a page if we cannot find in page cache for the hugetlbfs > > case as well (David) > > - Don't unpin pages if there is a migration related failure (David) > > - Drop the unnecessary nr_pages <= 0 check (Jason) > > - Have the caller of the API pass in file * instead of fd (Jason) > > > > v3: (David) > > - Enclose the huge page allocation code with #ifdef > CONFIG_HUGETLB_PAGE > > (Build error reported by kernel test robot <lkp@intel.com>) > > - Don't forget memalloc_pin_restore() on non-migration related errors > > - Improve the readability of the cleanup code associated with > > non-migration related errors > > - Augment the comments by describing FOLL_LONGTERM like behavior > > - Include the R-b tag from Jason > > > > v4: > > - Remove the local variable "page" and instead use 3 return statements > > in alloc_file_page() (David) > > - Add the R-b tag from David > > > > v5: (David) > > - For hugetlb case, ensure that we only obtain head pages from the > > mapping by using __filemap_get_folio() instead of find_get_page_flags() > > - Handle -EEXIST when two or more potential users try to simultaneously > > add a huge page to the mapping by forcing them to retry on failure > > > > v6: (Christoph) > > - Rename this API to memfd_pin_user_pages() to make it clear that it > > is intended for memfds > > - Move the memfd page allocation helper from gup.c to memfd.c > > - Fix indentation errors in memfd_pin_user_pages() > > - For contiguous ranges of folios, use a helper such as > > filemap_get_folios_contig() to lookup the page cache in batches > > > > v7: > > - Rename this API to memfd_pin_folios() and make it return folios > > and offsets instead of pages (David) > > - Don't continue processing the folios in the batch returned by > > filemap_get_folios_contig() if they do not have correct next_idx > > - Add the R-b tag from Christoph > > > > Sorry, I'm still not happy about the current state, because (1) the > folio vs. pages handling is still mixed (2) we're returning+pinning a > large folio multiple times. I can address (1) in a follow-up series and as far as (2) is concerned, my understanding is that we need to increase the folio's refcount as and when the folio's tail pages are used. Is this not the case? It appears this is what unpin_user_pages() expects as well. Do you see any concern with this? > > See below if there is an easy way to clean this up. > > >> @@ -5,6 +5,7 @@ > > #include <linux/spinlock.h> > > > > #include <linux/mm.h> > > +#include <linux/memfd.h> > > #include <linux/memremap.h> > > #include <linux/pagemap.h> > > #include <linux/rmap.h> > > @@ -17,6 +18,7 @@ > > #include <linux/hugetlb.h> > > #include <linux/migrate.h> > > #include <linux/mm_inline.h> > > +#include <linux/pagevec.h> > > #include <linux/sched/mm.h> > > #include <linux/shmem_fs.h> > > > > @@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long > start, unsigned long nr_pages, > > &locked, gup_flags); > > } > > EXPORT_SYMBOL(pin_user_pages_unlocked); > > + > > +/** > > + * memfd_pin_folios() - pin folios associated with a memfd > > + * @memfd: the memfd whose folios are to be pinned > > + * @start: starting memfd offset > > + * @nr_pages: number of pages from start to pin > > We're not pinning pages. An inclusive range [start, end] would be clearer. Ok, I'll make this change in the next version. > > > + * @folios: array that receives pointers to the folios pinned. > > + * Should be at-least nr_pages long. > > + * @offsets: array that receives offsets of pages in their folios. > > + * Should be at-least nr_pages long. > > See below, I'm wondering if this is really required once we return each folio > only once. The offsets can be calculated by the caller (udmabuf) as well but doing so in this interface would prevent special handling in the caller for the hugetlb case. Please look at patch 5 in this series (udmabuf: Pin the pages using memfd_pin_folios() API (v5)) for more details as to what I mean. > > > + * > > + * Attempt to pin folios associated with a memfd; given that a memfd is > > + * either backed by shmem or hugetlb, the folios can either be found in > > + * the page cache or need to be allocated if necessary. Once the folios > > + * are located, they are all pinned via FOLL_PIN and the @offsets array > > + * is populated with offsets of the pages in their respective folios. > > + * Therefore, for each page the caller requested, there will be a > > + * corresponding entry in both @folios and @offsets. And, eventually, > > + * these pinned folios need to be released either using > unpin_user_pages() > > + * or unpin_user_page(). > > Oh, we don't have a folio function yet? Should be easy to add, and we'd > really > add them. Sure, I'll add them in a follow-up series. > > > + * > > + * It must be noted that the folios may be pinned for an indefinite amount > > + * of time. And, in most cases, the duration of time they may stay pinned > > + * would be controlled by the userspace. This behavior is effectively the > > + * same as using FOLL_LONGTERM with other GUP APIs. > > + * > > + * Returns number of folios pinned. This would be equal to the number of > > + * pages requested. If no folios were pinned, it returns -errno. > > + */ > > +long memfd_pin_folios(struct file *memfd, unsigned long start, > > + unsigned long nr_pages, struct folio **folios, > > + pgoff_t *offsets) > > +{ > > + unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1; > > + unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT; > > + pgoff_t start_idx, end_idx, next_idx; > > + unsigned int flags, nr_folios, i, j; > > + struct folio *folio = NULL; > > + struct folio_batch fbatch; > > + struct page **pages; > > + struct hstate *h; > > + long ret; > > + > > + if (!nr_pages) > > + return -EINVAL; > > + > > + if (!memfd) > > + return -EINVAL; > > + > > + if (!shmem_file(memfd) && !is_file_hugepages(memfd)) > > + return -EINVAL; > > + > > + pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL); > > + if (!pages) > > + return -ENOMEM; > > + > > + if (is_file_hugepages(memfd)) { > > + h = hstate_file(memfd); > > + pgshift = huge_page_shift(h); > > + } > > + > > + flags = memalloc_pin_save(); > > + do { > > + i = 0; > > + start_idx = start >> pgshift; > > + end_idx = end >> pgshift; > > + if (is_file_hugepages(memfd)) { > > + start_idx <<= huge_page_order(h); > > + end_idx <<= huge_page_order(h); > > + } > > + > > + folio_batch_init(&fbatch); > > + while (start_idx <= end_idx) { > > + /* > > + * In most cases, we should be able to find the folios > > + * in the page cache. If we cannot find them for some > > + * reason, we try to allocate them and add them to > the > > + * page cache. > > + */ > > + nr_folios = filemap_get_folios_contig(memfd- > >f_mapping, > > + &start_idx, > > + end_idx, > > + &fbatch); > > + if (folio) { > > + folio_put(folio); > > + folio = NULL; > > + } > > + > > + next_idx = 0; > > + for (j = 0; j < nr_folios; j++) { > > + if (next_idx && > > + next_idx != folio_index(fbatch.folios[j])) > > + continue; > > + > > + folio = try_grab_folio(&fbatch.folios[j]->page, > > + 1, FOLL_PIN); > > + if (!folio) { > > + folio_batch_release(&fbatch); > > + kfree(pages); > > + goto err; > > + } > > + > > + max_pgs = folio_nr_pages(folio); > > + if (i == 0) { > > + pgoff = offset_in_folio(folio, start); > > + pgoff >>= PAGE_SHIFT; > > + } > > + > > + do { > > + folios[i] = folio; > > + offsets[i] = pgoff << PAGE_SHIFT; > > + pages[i] = folio_page(folio, 0); > > + folio_add_pin(folio); > > + > > + pgoff++; > > + i++; > > + } while (pgoff < max_pgs && i < nr_pages); > > + > > + pgoff = 0; > > + next_idx = folio_next_index(folio); > > + gup_put_folio(folio, 1, FOLL_PIN); > > + } > > + > > + folio = NULL; > > + folio_batch_release(&fbatch); > > + if (!nr_folios) { > > + folio = memfd_alloc_folio(memfd, start_idx); > > + if (IS_ERR(folio)) { > > + ret = PTR_ERR(folio); > > + if (ret != -EEXIST) { > > + kfree(pages); > > + goto err; > > + } > > + } > > + } > > + } > > + > > + ret = check_and_migrate_movable_pages(nr_pages, pages); > > Having a folio variant would avoid having to mess with pages here at all. > Further, we're now returning+pinning the same folio multiple times, instead > of > just once like the folio batching variant would. It should be possible to pin the folio only once but I don't see any problem with pinning it multiple times -- once per each subpage used -- as long as it is unpinned correctly the same number of times. Is this not ok? > > I'm wondering if the following wouldn't make more sense, assuming we add > check_and_migrate_movable_folios(), which should be pretty easy to add. > > Obviously untested, just to express what I have in mind: Thank you for taking the time to do this! > > > > /** > * memfd_pin_folios() - pin folios associated with a memfd > * @memfd: the memfd whose folios are to be pinned > * @start: the starting memfd offset > * @end: the final memfd offset (inclusive) > * @folios: array that receives pointers to the folios pinned > * @max_folios: the number of entries in the array for folios > * @offsets: the offset into the first folio Given that my goal is to do the following in udmabuf driver: ret = sg_alloc_table(sg, ubuf->pagecount, GFP_KERNEL); for_each_sg(sg->sgl, sgl, ubuf->pagecount, i) sg_set_folio(sgl, ubuf->folios[i], PAGE_SIZE, ubuf->offsets[i]); ret = dma_map_sgtable(dev, sg, direction, 0); That is, populate a scatterlist with ubuf->pagecount number of entries, where each segment if of size PAGE_SIZE, in order to be consistent and support a wide variety of DMA importers that may not probably handle segments that are larger than PAGE_SIZE. Therefore, in the hugetlb case, there would be multiple entries pointing to the same folio with different offsets. The question really is whether these entries associated with @folios and @offsets would need to be populated by the caller (udmabuf) or the API (memfd_pin_folios). I have tried both of these approaches in the earlier versions and they all work fine but I think populating the entries in memfd_pin_folios() seems to be cleaner as the caller does not need to do any special handling (hugetlb vs shmem). > * > * Attempt to pin folios associated with a memfd; given that a memfd is > * either backed by shmem or hugetlb, the folios can either be found in > * the page cache or need to be allocated if necessary. Once the folios > * are located, they are all pinned via FOLL_PIN and @offset is populated > * with the offset into the first folio. > * > * Pinned folios must be released using unpin_folio() or unpin_folios(). > * > * It must be noted that the folios may be pinned for an indefinite amount > * of time. And, in most cases, the duration of time they may stay pinned > * would be controlled by the userspace. This behavior is effectively the > * same as using FOLL_LONGTERM with other GUP APIs. > * > * Returns number of folios pinned, which might be less than @max_folios > * only if the whole range was pinned. If no folios were pinned, it returns > * -errno. > */ > long memfd_pin_folios(struct file *memfd, unsigned long start, > unsigned long end, struct folio **folios, > unsigned int max_folios, unsigned long *offset) > { > unsigned int pgshift = PAGE_SHIFT; > unsigned int flags, nr_folios, cur_folios, i; > pgoff_t start_idx, end_idx; > struct folio_batch fbatch; > struct folio *folio; > struct hstate *h; > long ret; > > if (start > end || !max_folios) > return -EINVAL; > > if (!memfd) > return -EINVAL; > > if (!shmem_file(memfd) && !is_file_hugepages(memfd)) > return -EINVAL; > > if (is_file_hugepages(memfd)) { > h = hstate_file(memfd); > pgshift = huge_page_shift(h); > } > > flags = memalloc_pin_save(); > folio_batch_init(&fbatch); > do { > nr_folios = 0; > start_idx = start >> pgshift; > end_idx = end >> pgshift; > if (is_file_hugepages(memfd)) { > start_idx <<= huge_page_order(h); > end_idx <<= huge_page_order(h); > } > > while (start_idx <= end_idx) { > /* > * In most cases, we should be able to find the folios > * in the page cache. If we cannot find them for some > * reason, we try to allocate them and add them to > the > * page cache. > */ > folio_batch_release(&fbatch); > cur_folios = filemap_get_folios_contig(memfd- > >f_mapping, > &start_idx, > end_idx, > &fbatch); > if (!cur_folios) { > folio = memfd_alloc_folio(memfd, start_idx); > if (IS_ERR(folio)) { > ret = PTR_ERR(folio); > if (ret != -EEXIST) > goto err; > } > folio_put(folio); > continue; > } > > /* Let's pin each folio, which shouldn't really fail. */ > for (i = 0; i < cur_folios; i++) { > folio = try_grab_folio(&fbatch.folios[i]->page, > 1, FOLL_PIN); > if (!folio) > goto err; > > if (!nr_folios) > *offset = offset_in_folio(folio, start); > folios[nr_folios++] = folio; > > if (max_folios == nr_folios) > break; > } > if (max_folios == nr_folios) > break; > } > folio_batch_release(&fbatch); > > ret = check_and_migrate_movable_folios(nr_folios, folios); > } while (ret == -EAGAIN); > > memalloc_pin_restore(flags); > return ret ? ret : nr_folios; > err: > folio_batch_release(&fbatch); > memalloc_pin_restore(flags); > while (i-- > 0) > if (folios[i]) > gup_put_folio(folios[i], 1, FOLL_PIN); > > return ret; > } > EXPORT_SYMBOL_GPL(memfd_pin_folios); > > > > I'm still wondering about the offset handling, though. Could it happen that > why we are > repeatedly calling filemap_get_folios_contig(), that we would need offset!=0 > on any of > the other folios besides the first one? My current understanding (and looking > at > filemap_get_folios_contig()) is: no. I am not entirely sure but while testing this series with Qemu master + kernel snapshot of drm-tip which is 6.7 RC1, I noticed strange behavior of filemap_get_folios_contig() and the batches it returns particularly for the hugetlb folios. Assuming we have order-9 folios in the memfd (my test-case), and if the range [start, end] cuts across more than one folio: lets say start is at subpage 490 (in folio-0) and end is at subpage 520 (in folio-1), then start_idx would be 0 and end_idx would be 512. In this case, I would have expected filemap_get_folios_contig() to return two entries in the batch that included folio-0 and folio-1. However, it returned a batch with 15 entries (max batch size) with all the entries pointing to folio-0. This is why I added the check: if (next_idx && next_idx != folio_index(fbatch.folios[j])) continue; Anyway, based on the code you wrote, I have realized that we both have a different view on how many entries need to be there in the @folios array for a given range [start, end] in the hugetlb case. I have assumed that it is highly desirable to have a segment length of PAGE_SIZE for consistency and interoperability reasons but I guess it might be ok to do: sg_set_folio(sgl, ubuf->folios[i], nr_tails * PAGE_SIZE, ubuf->offsets[i]); I'll run some experiments to see if this would work in most cases or not. > > I'm primarily concerned about concurrent fallocate(PUNCH_HOLE) and THP > collapse/splitting. Could you please elaborate on what the issue would be in this case? Thanks, Vivek > > -- > Cheers, > > David / dhildenb
On Wed, Dec 13, 2023 at 08:44:51AM +0000, Kasireddy, Vivek wrote: > That is, populate a scatterlist with ubuf->pagecount number of entries, > where each segment if of size PAGE_SIZE, in order to be consistent and > support a wide variety of DMA importers that may not probably handle > segments that are larger than PAGE_SIZE. No! This is totally wrong, sg lists must aggregate up to the limits specified in the struct device. We have importer helpers that do this aggregation. If some driver is working with a sglist and can't handle this it is simply broken. Do not mess up core code to accomodate such things. Jason
Hi, >> Sorry, I'm still not happy about the current state, because (1) the >> folio vs. pages handling is still mixed (2) we're returning+pinning a >> large folio multiple times. > I can address (1) in a follow-up series and as far as (2) is concerned, my > understanding is that we need to increase the folio's refcount as and > when the folio's tail pages are used. Is this not the case? It appears > this is what unpin_user_pages() expects as well. Do you see any > concern with this? If you'd just pin the folio once, you'd also only have to unpin it once. Bu that raises a question: Is it a requirement for the user of this interface, being able to unmap+unpin each individual page? If you really want to handle each subpage possibly individually, then subpage or folio+offset makes more sense, agreed. > >> >> See below if there is an easy way to clean this up. >> >>>> @@ -5,6 +5,7 @@ >>> #include <linux/spinlock.h> >>> >>> #include <linux/mm.h> >>> +#include <linux/memfd.h> >>> #include <linux/memremap.h> >>> #include <linux/pagemap.h> >>> #include <linux/rmap.h> >>> @@ -17,6 +18,7 @@ >>> #include <linux/hugetlb.h> >>> #include <linux/migrate.h> >>> #include <linux/mm_inline.h> >>> +#include <linux/pagevec.h> >>> #include <linux/sched/mm.h> >>> #include <linux/shmem_fs.h> >>> >>> @@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long >> start, unsigned long nr_pages, >>> &locked, gup_flags); >>> } >>> EXPORT_SYMBOL(pin_user_pages_unlocked); >>> + >>> +/** >>> + * memfd_pin_folios() - pin folios associated with a memfd >>> + * @memfd: the memfd whose folios are to be pinned >>> + * @start: starting memfd offset >>> + * @nr_pages: number of pages from start to pin >> >> We're not pinning pages. An inclusive range [start, end] would be clearer. > Ok, I'll make this change in the next version. > >> >>> + * @folios: array that receives pointers to the folios pinned. >>> + * Should be at-least nr_pages long. >>> + * @offsets: array that receives offsets of pages in their folios. >>> + * Should be at-least nr_pages long. >> >> See below, I'm wondering if this is really required once we return each folio >> only once. > The offsets can be calculated by the caller (udmabuf) as well but doing so > in this interface would prevent special handling in the caller for the hugetlb > case. Please look at patch 5 in this series (udmabuf: Pin the pages using > memfd_pin_folios() API (v5)) for more details as to what I mean. > I'll have a look later to be reminded about the target use case :) >> >>> + * >>> + * It must be noted that the folios may be pinned for an indefinite amount >>> + * of time. And, in most cases, the duration of time they may stay pinned >>> + * would be controlled by the userspace. This behavior is effectively the >>> + * same as using FOLL_LONGTERM with other GUP APIs. >>> + * >>> + * Returns number of folios pinned. This would be equal to the number of >>> + * pages requested. If no folios were pinned, it returns -errno. >>> + */ >>> +long memfd_pin_folios(struct file *memfd, unsigned long start, >>> + unsigned long nr_pages, struct folio **folios, >>> + pgoff_t *offsets) >>> +{ >>> + unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1; >>> + unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT; >>> + pgoff_t start_idx, end_idx, next_idx; >>> + unsigned int flags, nr_folios, i, j; >>> + struct folio *folio = NULL; >>> + struct folio_batch fbatch; >>> + struct page **pages; >>> + struct hstate *h; >>> + long ret; >>> + >>> + if (!nr_pages) >>> + return -EINVAL; >>> + >>> + if (!memfd) >>> + return -EINVAL; >>> + >>> + if (!shmem_file(memfd) && !is_file_hugepages(memfd)) >>> + return -EINVAL; >>> + >>> + pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL); >>> + if (!pages) >>> + return -ENOMEM; >>> + >>> + if (is_file_hugepages(memfd)) { >>> + h = hstate_file(memfd); >>> + pgshift = huge_page_shift(h); >>> + } >>> + >>> + flags = memalloc_pin_save(); >>> + do { >>> + i = 0; >>> + start_idx = start >> pgshift; >>> + end_idx = end >> pgshift; >>> + if (is_file_hugepages(memfd)) { >>> + start_idx <<= huge_page_order(h); >>> + end_idx <<= huge_page_order(h); >>> + } >>> + >>> + folio_batch_init(&fbatch); >>> + while (start_idx <= end_idx) { >>> + /* >>> + * In most cases, we should be able to find the folios >>> + * in the page cache. If we cannot find them for some >>> + * reason, we try to allocate them and add them to >> the >>> + * page cache. >>> + */ >>> + nr_folios = filemap_get_folios_contig(memfd- >>> f_mapping, >>> + &start_idx, >>> + end_idx, >>> + &fbatch); >>> + if (folio) { >>> + folio_put(folio); >>> + folio = NULL; >>> + } >>> + >>> + next_idx = 0; >>> + for (j = 0; j < nr_folios; j++) { >>> + if (next_idx && >>> + next_idx != folio_index(fbatch.folios[j])) >>> + continue; >>> + >>> + folio = try_grab_folio(&fbatch.folios[j]->page, >>> + 1, FOLL_PIN); >>> + if (!folio) { >>> + folio_batch_release(&fbatch); >>> + kfree(pages); >>> + goto err; >>> + } >>> + >>> + max_pgs = folio_nr_pages(folio); >>> + if (i == 0) { >>> + pgoff = offset_in_folio(folio, start); >>> + pgoff >>= PAGE_SHIFT; >>> + } >>> + >>> + do { >>> + folios[i] = folio; >>> + offsets[i] = pgoff << PAGE_SHIFT; >>> + pages[i] = folio_page(folio, 0); >>> + folio_add_pin(folio); >>> + >>> + pgoff++; >>> + i++; >>> + } while (pgoff < max_pgs && i < nr_pages); >>> + >>> + pgoff = 0; >>> + next_idx = folio_next_index(folio); >>> + gup_put_folio(folio, 1, FOLL_PIN); >>> + } >>> + >>> + folio = NULL; >>> + folio_batch_release(&fbatch); >>> + if (!nr_folios) { >>> + folio = memfd_alloc_folio(memfd, start_idx); >>> + if (IS_ERR(folio)) { >>> + ret = PTR_ERR(folio); >>> + if (ret != -EEXIST) { >>> + kfree(pages); >>> + goto err; >>> + } >>> + } >>> + } >>> + } >>> + >>> + ret = check_and_migrate_movable_pages(nr_pages, pages); >> >> Having a folio variant would avoid having to mess with pages here at all. >> Further, we're now returning+pinning the same folio multiple times, instead >> of >> just once like the folio batching variant would. > It should be possible to pin the folio only once but I don't see any problem with > pinning it multiple times -- once per each subpage used -- as long as it is unpinned > correctly the same number of times. Is this not ok? You can, but that partially avoids the benefit of using folios? Instead of "large folio + offset" you have "folio+offset1, folio+offset2 ..." essentially for each subpage. But again, maybe that really is required for the target use case. It's not necessarily wrong to do that, but staring just at the interface it's the opposite of what other folio-handling functions like batching do. > >> >> I'm wondering if the following wouldn't make more sense, assuming we add >> check_and_migrate_movable_folios(), which should be pretty easy to add. >> >> Obviously untested, just to express what I have in mind: > Thank you for taking the time to do this! > >> >> >> >> /** >> * memfd_pin_folios() - pin folios associated with a memfd >> * @memfd: the memfd whose folios are to be pinned >> * @start: the starting memfd offset >> * @end: the final memfd offset (inclusive) >> * @folios: array that receives pointers to the folios pinned >> * @max_folios: the number of entries in the array for folios >> * @offsets: the offset into the first folio > Given that my goal is to do the following in udmabuf driver: > ret = sg_alloc_table(sg, ubuf->pagecount, GFP_KERNEL); > for_each_sg(sg->sgl, sgl, ubuf->pagecount, i) > sg_set_folio(sgl, ubuf->folios[i], PAGE_SIZE, ubuf->offsets[i]); > > ret = dma_map_sgtable(dev, sg, direction, 0); > > That is, populate a scatterlist with ubuf->pagecount number of entries, > where each segment if of size PAGE_SIZE, in order to be consistent and > support a wide variety of DMA importers that may not probably handle > segments that are larger than PAGE_SIZE. > > Therefore, in the hugetlb case, there would be multiple entries pointing to > the same folio with different offsets. The question really is whether these > entries associated with @folios and @offsets would need to be populated > by the caller (udmabuf) or the API (memfd_pin_folios). I have tried both of > these approaches in the earlier versions and they all work fine but I think > populating the entries in memfd_pin_folios() seems to be cleaner as the > caller does not need to do any special handling (hugetlb vs shmem). > >> * >> * Attempt to pin folios associated with a memfd; given that a memfd is >> * either backed by shmem or hugetlb, the folios can either be found in >> * the page cache or need to be allocated if necessary. Once the folios >> * are located, they are all pinned via FOLL_PIN and @offset is populated >> * with the offset into the first folio. >> * >> * Pinned folios must be released using unpin_folio() or unpin_folios(). >> * >> * It must be noted that the folios may be pinned for an indefinite amount >> * of time. And, in most cases, the duration of time they may stay pinned >> * would be controlled by the userspace. This behavior is effectively the >> * same as using FOLL_LONGTERM with other GUP APIs. >> * >> * Returns number of folios pinned, which might be less than @max_folios >> * only if the whole range was pinned. If no folios were pinned, it returns >> * -errno. >> */ >> long memfd_pin_folios(struct file *memfd, unsigned long start, >> unsigned long end, struct folio **folios, >> unsigned int max_folios, unsigned long *offset) >> { >> unsigned int pgshift = PAGE_SHIFT; >> unsigned int flags, nr_folios, cur_folios, i; >> pgoff_t start_idx, end_idx; >> struct folio_batch fbatch; >> struct folio *folio; >> struct hstate *h; >> long ret; >> >> if (start > end || !max_folios) >> return -EINVAL; >> >> if (!memfd) >> return -EINVAL; >> >> if (!shmem_file(memfd) && !is_file_hugepages(memfd)) >> return -EINVAL; >> >> if (is_file_hugepages(memfd)) { >> h = hstate_file(memfd); >> pgshift = huge_page_shift(h); >> } >> >> flags = memalloc_pin_save(); >> folio_batch_init(&fbatch); >> do { >> nr_folios = 0; >> start_idx = start >> pgshift; >> end_idx = end >> pgshift; >> if (is_file_hugepages(memfd)) { >> start_idx <<= huge_page_order(h); >> end_idx <<= huge_page_order(h); >> } >> >> while (start_idx <= end_idx) { >> /* >> * In most cases, we should be able to find the folios >> * in the page cache. If we cannot find them for some >> * reason, we try to allocate them and add them to >> the >> * page cache. >> */ >> folio_batch_release(&fbatch); >> cur_folios = filemap_get_folios_contig(memfd- >>> f_mapping, >> &start_idx, >> end_idx, >> &fbatch); >> if (!cur_folios) { >> folio = memfd_alloc_folio(memfd, start_idx); >> if (IS_ERR(folio)) { >> ret = PTR_ERR(folio); >> if (ret != -EEXIST) >> goto err; >> } >> folio_put(folio); >> continue; >> } >> >> /* Let's pin each folio, which shouldn't really fail. */ >> for (i = 0; i < cur_folios; i++) { >> folio = try_grab_folio(&fbatch.folios[i]->page, >> 1, FOLL_PIN); >> if (!folio) >> goto err; >> >> if (!nr_folios) >> *offset = offset_in_folio(folio, start); >> folios[nr_folios++] = folio; >> >> if (max_folios == nr_folios) >> break; >> } >> if (max_folios == nr_folios) >> break; >> } >> folio_batch_release(&fbatch); >> >> ret = check_and_migrate_movable_folios(nr_folios, folios); >> } while (ret == -EAGAIN); >> >> memalloc_pin_restore(flags); >> return ret ? ret : nr_folios; >> err: >> folio_batch_release(&fbatch); >> memalloc_pin_restore(flags); >> while (i-- > 0) >> if (folios[i]) >> gup_put_folio(folios[i], 1, FOLL_PIN); >> >> return ret; >> } >> EXPORT_SYMBOL_GPL(memfd_pin_folios); >> >> >> >> I'm still wondering about the offset handling, though. Could it happen that >> why we are >> repeatedly calling filemap_get_folios_contig(), that we would need offset!=0 >> on any of >> the other folios besides the first one? My current understanding (and looking >> at >> filemap_get_folios_contig()) is: no. > I am not entirely sure but while testing this series with Qemu master + kernel > snapshot of drm-tip which is 6.7 RC1, I noticed strange behavior of > filemap_get_folios_contig() and the batches it returns particularly for the > hugetlb folios. Assuming we have order-9 folios in the memfd (my test-case), > and if the range [start, end] cuts across more than one folio: lets say start is > at subpage 490 (in folio-0) and end is at subpage 520 (in folio-1), then start_idx > would be 0 and end_idx would be 512. In this case, I would have expected That is weird. Shouldn't you get start_idx = 0 and end_idx = 1 with hugetlb, where the idx differs ? Maybe that's the problem. > filemap_get_folios_contig() to return two entries in the batch that included > folio-0 and folio-1. However, it returned a batch with 15 entries (max batch size) > with all the entries pointing to folio-0. This is why I added the check: > if (next_idx && > next_idx != folio_index(fbatch.folios[j])) > continue; > > Anyway, based on the code you wrote, I have realized that we both have a > different view on how many entries need to be there in the @folios array > for a given range [start, end] in the hugetlb case. Oh, yes, ideally the interface should behave the same for hugetlb and shmem. > > I have assumed that it is highly desirable to have a segment length of > PAGE_SIZE for consistency and interoperability reasons but I guess it might > be ok to do: > sg_set_folio(sgl, ubuf->folios[i], nr_tails * PAGE_SIZE, ubuf->offsets[i]); > > I'll run some experiments to see if this would work in most cases or not. > >> >> I'm primarily concerned about concurrent fallocate(PUNCH_HOLE) and THP >> collapse/splitting. > Could you please elaborate on what the issue would be in this case? I'm not sure if this can happen, but assume the following (shouldn't happen as long as shmem does not support 1m folios): Assume the file looks like this: [ 1m ][ 512k ] ^0 ^256 ^384 Assume we call filemap_get_folios_contig() and get back the first folio and get start_idx=256 Then, someone fallocate(PUNCH_HOLE) the whole range and re-populates the whole range with a 2m folio. [ 2m ] ^0 ^256 ^384 if we call filemap_get_folios_contig() with 256, we get another "large folio with offset". Of course, we can detect that, and simply fail/retry. Just wondering if that could happen.
On Wed, Dec 13, 2023 at 08:31:55AM -0400, Jason Gunthorpe wrote: > > That is, populate a scatterlist with ubuf->pagecount number of entries, > > where each segment if of size PAGE_SIZE, in order to be consistent and > > support a wide variety of DMA importers that may not probably handle > > segments that are larger than PAGE_SIZE. > > No! This is totally wrong, sg lists must aggregate up to the limits > specified in the struct device. We have importer helpers that do this > aggregation. > > If some driver is working with a sglist and can't handle this it is > simply broken. Do not mess up core code to accomodate such things. Well.. There's no single driver that is broken, it's more the whole dmabuf philosophy that wants things to be mappable by multiple devices without knowing their limits beforehand. So you'll get this cargo culting.
On Wed, Dec 13, 2023 at 04:36:34PM +0100, Christoph Hellwig wrote: > On Wed, Dec 13, 2023 at 08:31:55AM -0400, Jason Gunthorpe wrote: > > > That is, populate a scatterlist with ubuf->pagecount number of entries, > > > where each segment if of size PAGE_SIZE, in order to be consistent and > > > support a wide variety of DMA importers that may not probably handle > > > segments that are larger than PAGE_SIZE. > > > > No! This is totally wrong, sg lists must aggregate up to the limits > > specified in the struct device. We have importer helpers that do this > > aggregation. > > > > If some driver is working with a sglist and can't handle this it is > > simply broken. Do not mess up core code to accomodate such things. > > Well.. There's no single driver that is broken, it's more the whole > dmabuf philosophy that wants things to be mappable by multiple devices > without knowing their limits beforehand. So you'll get this cargo > culting. It is not so bad, the API has the importer pass a struct device to the exporter that can be used in the usual way to shape the sg list. But really, I think in most cases importers don't strictly need the sg list to be a certain configuration, it is just a combination of lazy driver writers and a lack of common helpers to iterate over the sg list in the way they need. RDMA has done this right, but for it to work efficiently the exporter *must* aggregate all contiguous memory into a single sg element otherwise you loose the HW's large page support. Jason
diff --git a/include/linux/memfd.h b/include/linux/memfd.h index e7abf6fa4c52..b38fb35f4271 100644 --- a/include/linux/memfd.h +++ b/include/linux/memfd.h @@ -6,11 +6,16 @@ #ifdef CONFIG_MEMFD_CREATE extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg); +extern struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx); #else static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a) { return -EINVAL; } +static inline struct page *memfd_alloc_folio(struct file *memfd, pgoff_t idx) +{ + return ERR_PTR(-EINVAL); +} #endif #endif /* __LINUX_MEMFD_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 418d26608ece..537f40989837 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2472,6 +2472,9 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); +long memfd_pin_folios(struct file *file, unsigned long start, + unsigned long nr_pages, struct folio **folios, + pgoff_t *offsets); int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); diff --git a/mm/gup.c b/mm/gup.c index 231711efa390..e798cdbc6a11 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -5,6 +5,7 @@ #include <linux/spinlock.h> #include <linux/mm.h> +#include <linux/memfd.h> #include <linux/memremap.h> #include <linux/pagemap.h> #include <linux/rmap.h> @@ -17,6 +18,7 @@ #include <linux/hugetlb.h> #include <linux/migrate.h> #include <linux/mm_inline.h> +#include <linux/pagevec.h> #include <linux/sched/mm.h> #include <linux/shmem_fs.h> @@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, &locked, gup_flags); } EXPORT_SYMBOL(pin_user_pages_unlocked); + +/** + * memfd_pin_folios() - pin folios associated with a memfd + * @memfd: the memfd whose folios are to be pinned + * @start: starting memfd offset + * @nr_pages: number of pages from start to pin + * @folios: array that receives pointers to the folios pinned. + * Should be at-least nr_pages long. + * @offsets: array that receives offsets of pages in their folios. + * Should be at-least nr_pages long. + * + * Attempt to pin folios associated with a memfd; given that a memfd is + * either backed by shmem or hugetlb, the folios can either be found in + * the page cache or need to be allocated if necessary. Once the folios + * are located, they are all pinned via FOLL_PIN and the @offsets array + * is populated with offsets of the pages in their respective folios. + * Therefore, for each page the caller requested, there will be a + * corresponding entry in both @folios and @offsets. And, eventually, + * these pinned folios need to be released either using unpin_user_pages() + * or unpin_user_page(). + * + * It must be noted that the folios may be pinned for an indefinite amount + * of time. And, in most cases, the duration of time they may stay pinned + * would be controlled by the userspace. This behavior is effectively the + * same as using FOLL_LONGTERM with other GUP APIs. + * + * Returns number of folios pinned. This would be equal to the number of + * pages requested. If no folios were pinned, it returns -errno. + */ +long memfd_pin_folios(struct file *memfd, unsigned long start, + unsigned long nr_pages, struct folio **folios, + pgoff_t *offsets) +{ + unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1; + unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT; + pgoff_t start_idx, end_idx, next_idx; + unsigned int flags, nr_folios, i, j; + struct folio *folio = NULL; + struct folio_batch fbatch; + struct page **pages; + struct hstate *h; + long ret; + + if (!nr_pages) + return -EINVAL; + + if (!memfd) + return -EINVAL; + + if (!shmem_file(memfd) && !is_file_hugepages(memfd)) + return -EINVAL; + + pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL); + if (!pages) + return -ENOMEM; + + if (is_file_hugepages(memfd)) { + h = hstate_file(memfd); + pgshift = huge_page_shift(h); + } + + flags = memalloc_pin_save(); + do { + i = 0; + start_idx = start >> pgshift; + end_idx = end >> pgshift; + if (is_file_hugepages(memfd)) { + start_idx <<= huge_page_order(h); + end_idx <<= huge_page_order(h); + } + + folio_batch_init(&fbatch); + while (start_idx <= end_idx) { + /* + * In most cases, we should be able to find the folios + * in the page cache. If we cannot find them for some + * reason, we try to allocate them and add them to the + * page cache. + */ + nr_folios = filemap_get_folios_contig(memfd->f_mapping, + &start_idx, + end_idx, + &fbatch); + if (folio) { + folio_put(folio); + folio = NULL; + } + + next_idx = 0; + for (j = 0; j < nr_folios; j++) { + if (next_idx && + next_idx != folio_index(fbatch.folios[j])) + continue; + + folio = try_grab_folio(&fbatch.folios[j]->page, + 1, FOLL_PIN); + if (!folio) { + folio_batch_release(&fbatch); + kfree(pages); + goto err; + } + + max_pgs = folio_nr_pages(folio); + if (i == 0) { + pgoff = offset_in_folio(folio, start); + pgoff >>= PAGE_SHIFT; + } + + do { + folios[i] = folio; + offsets[i] = pgoff << PAGE_SHIFT; + pages[i] = folio_page(folio, 0); + folio_add_pin(folio); + + pgoff++; + i++; + } while (pgoff < max_pgs && i < nr_pages); + + pgoff = 0; + next_idx = folio_next_index(folio); + gup_put_folio(folio, 1, FOLL_PIN); + } + + folio = NULL; + folio_batch_release(&fbatch); + if (!nr_folios) { + folio = memfd_alloc_folio(memfd, start_idx); + if (IS_ERR(folio)) { + ret = PTR_ERR(folio); + if (ret != -EEXIST) { + kfree(pages); + goto err; + } + } + } + } + + ret = check_and_migrate_movable_pages(nr_pages, pages); + } while (ret == -EAGAIN); + + kfree(pages); + memalloc_pin_restore(flags); + return ret ? ret : nr_pages; +err: + memalloc_pin_restore(flags); + while (i-- > 0) + if (folios[i]) + gup_put_folio(folios[i], 1, FOLL_PIN); + + return ret; +} +EXPORT_SYMBOL_GPL(memfd_pin_folios); + diff --git a/mm/memfd.c b/mm/memfd.c index d3a1ba4208c9..36a75e8249f8 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -63,6 +63,40 @@ static void memfd_tag_pins(struct xa_state *xas) xas_unlock_irq(xas); } +/* + * This is a helper function used by memfd_pin_user_pages() in GUP (gup.c). + * It is mainly called to allocate a page in a memfd when the caller + * (memfd_pin_user_pages()) cannot find a page in the page cache at a given + * index in the mapping. + */ +struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) +{ +#ifdef CONFIG_HUGETLB_PAGE + struct folio *folio; + int err; + + if (is_file_hugepages(memfd)) { + folio = alloc_hugetlb_folio_nodemask(hstate_file(memfd), + NUMA_NO_NODE, + NULL, + GFP_USER); + if (folio && folio_try_get(folio)) { + err = hugetlb_add_to_page_cache(folio, + memfd->f_mapping, + idx); + if (err) { + folio_put(folio); + free_huge_folio(folio); + return ERR_PTR(err); + } + return folio; + } + return ERR_PTR(-ENOMEM); + } +#endif + return shmem_read_folio(memfd->f_mapping, idx); +} + /* * Setting SEAL_WRITE requires us to verify there's no pending writer. However, * via get_user_pages(), drivers might have some pending I/O without any active