diff mbox series

[2/4] mm: mempool: introduce page bulk allocator

Message ID 20221005180341.1738796-3-shy828301@gmail.com (mailing list archive)
State New
Headers show
Series Introduce mempool pages bulk allocator the use it in dm-crypt | expand

Commit Message

Yang Shi Oct. 5, 2022, 6:03 p.m. UTC
Since v5.13 the page bulk allocator was introduced to allocate order-0
pages in bulk.  There are a few mempool allocator callers which does
order-0 page allocation in a loop, for example, dm-crypt, f2fs compress,
etc.  A mempool page bulk allocator seems useful.  So introduce the
mempool page bulk allocator.

It introduces the below APIs:
  - mempool_init_pages_bulk()
  - mempool_create_pages_bulk()
They initialize the mempool for page bulk allocator.  The pool is filled
by alloc_page() in a loop.

  - mempool_alloc_pages_bulk_list()
  - mempool_alloc_pages_bulk_array()
They do bulk allocation from mempool.
They do the below conceptually:
  1. Call bulk page allocator
  2. If the allocation is fulfilled then return otherwise try to
     allocate the remaining pages from the mempool
  3. If it is fulfilled then return otherwise retry from #1 with sleepable
     gfp
  4. If it is still failed, sleep for a while to wait for the mempool is
     refilled, then retry from #1
The populated pages will stay on the list or array until the callers
consume them or free them.
Since mempool allocator is guaranteed to success in the sleepable context,
so the two APIs return true for success or false for fail.  It is the
caller's responsibility to handle failure case (partial allocation), just
like the page bulk allocator.

The mempool typically is an object agnostic allocator, but bulk allocation
is only supported by pages, so the mempool bulk allocator is for page
allocation only as well.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/mempool.h |  19 ++++
 mm/mempool.c            | 188 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 197 insertions(+), 10 deletions(-)

Comments

kernel test robot Oct. 5, 2022, 7:35 p.m. UTC | #1
Hi Yang,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on device-mapper-dm/for-next]
[also build test WARNING on linus/master v6.0 next-20221005]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Yang-Shi/Introduce-mempool-pages-bulk-allocator-the-use-it-in-dm-crypt/20221006-020438
base:   https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git for-next
config: alpha-defconfig
compiler: alpha-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/439333bd9ab3e6ecd88ff78224ae727a485854d8
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yang-Shi/Introduce-mempool-pages-bulk-allocator-the-use-it-in-dm-crypt/20221006-020438
        git checkout 439333bd9ab3e6ecd88ff78224ae727a485854d8
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha SHELL=/bin/bash drivers/scsi/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from include/linux/bio.h:8,
                    from drivers/scsi/scsi_lib.c:12:
>> include/linux/mempool.h:19:48: warning: 'struct page' declared inside parameter list will not be visible outside of this definition or declaration
      19 |                                         struct page **page_array);
         |                                                ^~~~
   include/linux/mempool.h:68:51: warning: 'struct page' declared inside parameter list will not be visible outside of this definition or declaration
      68 |                                            struct page **page_array);
         |                                                   ^~~~


vim +19 include/linux/mempool.h

    15	
    16	typedef unsigned int (mempool_alloc_pages_bulk_t)(gfp_t gfp_mask,
    17						unsigned int nr, void *pool_data,
    18						struct list_head *page_list,
  > 19						struct page **page_array);
    20
Brian Foster Oct. 6, 2022, 2:47 p.m. UTC | #2
On Wed, Oct 05, 2022 at 11:03:39AM -0700, Yang Shi wrote:
> Since v5.13 the page bulk allocator was introduced to allocate order-0
> pages in bulk.  There are a few mempool allocator callers which does
> order-0 page allocation in a loop, for example, dm-crypt, f2fs compress,
> etc.  A mempool page bulk allocator seems useful.  So introduce the
> mempool page bulk allocator.
> 
> It introduces the below APIs:
>   - mempool_init_pages_bulk()
>   - mempool_create_pages_bulk()
> They initialize the mempool for page bulk allocator.  The pool is filled
> by alloc_page() in a loop.
> 
>   - mempool_alloc_pages_bulk_list()
>   - mempool_alloc_pages_bulk_array()
> They do bulk allocation from mempool.
> They do the below conceptually:
>   1. Call bulk page allocator
>   2. If the allocation is fulfilled then return otherwise try to
>      allocate the remaining pages from the mempool
>   3. If it is fulfilled then return otherwise retry from #1 with sleepable
>      gfp
>   4. If it is still failed, sleep for a while to wait for the mempool is
>      refilled, then retry from #1
> The populated pages will stay on the list or array until the callers
> consume them or free them.
> Since mempool allocator is guaranteed to success in the sleepable context,
> so the two APIs return true for success or false for fail.  It is the
> caller's responsibility to handle failure case (partial allocation), just
> like the page bulk allocator.
> 
> The mempool typically is an object agnostic allocator, but bulk allocation
> is only supported by pages, so the mempool bulk allocator is for page
> allocation only as well.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---

Hi Yang,

I'm not terribly familiar with either component so I'm probably missing
context/details, but just a couple high level thoughts when reading your
patches...

>  include/linux/mempool.h |  19 ++++
>  mm/mempool.c            | 188 +++++++++++++++++++++++++++++++++++++---
>  2 files changed, 197 insertions(+), 10 deletions(-)
> 
...
> diff --git a/mm/mempool.c b/mm/mempool.c
> index ba32151f3843..7711ca2e6d66 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -177,6 +177,7 @@ void mempool_destroy(mempool_t *pool)
>  EXPORT_SYMBOL(mempool_destroy);
>  
>  static inline int __mempool_init(mempool_t *pool, int min_nr,
> +				 mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
>  				 mempool_alloc_t *alloc_fn,
>  				 mempool_free_t *free_fn, void *pool_data,
>  				 gfp_t gfp_mask, int node_id)
> @@ -186,8 +187,11 @@ static inline int __mempool_init(mempool_t *pool, int min_nr,
>  	pool->pool_data = pool_data;
>  	pool->alloc	= alloc_fn;
>  	pool->free	= free_fn;
> +	pool->alloc_pages_bulk = alloc_pages_bulk_fn;
>  	init_waitqueue_head(&pool->wait);
>  
> +	WARN_ON_ONCE(alloc_pages_bulk_fn && alloc_fn);
> +
>  	pool->elements = kmalloc_array_node(min_nr, sizeof(void *),
>  					    gfp_mask, node_id);
>  	if (!pool->elements)
> @@ -199,7 +203,10 @@ static inline int __mempool_init(mempool_t *pool, int min_nr,
>  	while (pool->curr_nr < pool->min_nr) {
>  		void *element;
>  
> -		element = pool->alloc(gfp_mask, pool->pool_data);
> +		if (pool->alloc_pages_bulk)
> +			element = alloc_page(gfp_mask);

Any reason to not use the callback from the caller for the bulk variant
here? It looks like some users might expect consistency between the
alloc / free callbacks for the pool. I.e., I'm not familiar with
dm-crypt, but the code modified in the subsequent patches looks like it
keeps an allocated page count. Will that still work with this, assuming
these pages are freed through free_fn?

> +		else
> +			element = pool->alloc(gfp_mask, pool->pool_data);
>  		if (unlikely(!element)) {
>  			mempool_exit(pool);
>  			return -ENOMEM;
...
> @@ -457,6 +499,132 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL(mempool_alloc);
>  
> +/**
> + * mempool_alloc_pages_bulk - allocate a bulk of pagesfrom a specific
> + *                           memory pool
> + * @pool:       pointer to the memory pool which was allocated via
> + *              mempool_create().
> + * @gfp_mask:   the usual allocation bitmask.
> + * @nr:         the number of requested pages.
> + * @page_list:  the list the pages will be added to.
> + * @page_array: the array the pages will be added to.
> + *
> + * this function only sleeps if the alloc_pages_bulk_fn() function sleeps
> + * or the allocation can not be satisfied even though the mempool is depleted.
> + * Note that due to preallocation, this function *never* fails when called
> + * from process contexts. (it might fail if called from an IRQ context.)
> + * Note: using __GFP_ZERO is not supported.  And the caller should not pass
> + * in both valid page_list and page_array.
> + *
> + * Return: true when nr pages are allocated or false if not.  It is the
> + *         caller's responsibility to free the partial allocated pages.
> + */
> +static bool mempool_alloc_pages_bulk(mempool_t *pool, gfp_t gfp_mask,
> +				     unsigned int nr,
> +				     struct list_head *page_list,
> +				     struct page **page_array)
> +{
> +	unsigned long flags;
> +	wait_queue_entry_t wait;
> +	gfp_t gfp_temp;
> +	int i;
> +	unsigned int ret, nr_remaining;
> +	struct page *page;
> +

This looks like a lot of duplicate boilerplate from mempool_alloc().
Could this instead do something like: rename the former to
__mempool_alloc() and add a count parameter, implement bulk alloc
support in there for count > 1, then let traditional (i.e., non-bulk)
mempool_alloc() callers pass a count of 1?

Along the same lines, I also wonder if there's any value in generic bulk
alloc support for mempool. For example, I suppose technically this could
be implemented via one change to support a pool->alloc_bulk() callback
that any user could implement via a loop if they wanted
mempool_alloc_bulk() support backed by a preallocated pool. The page
based user could then just use that to call alloc_pages_bulk_*() as an
optimization without the mempool layer needing to know or care about
whether the underlying elements are pages or not. Hm?

Brian

> +	VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
> +	might_alloc(gfp_mask);
> +
> +	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
> +	gfp_mask |= __GFP_NORETRY;      /* don't loop in __alloc_pages */
> +	gfp_mask |= __GFP_NOWARN;       /* failures are OK */
> +
> +	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
> +
> +repeat_alloc:
> +	i = 0;
> +	ret = pool->alloc_pages_bulk(gfp_temp, nr, pool->pool_data, page_list,
> +				     page_array);
> +
> +	if (ret == nr)
> +		return true;
> +
> +	nr_remaining = nr - ret;
> +
> +	spin_lock_irqsave(&pool->lock, flags);
> +	/* Allocate page from the pool and add to the list or array */
> +	while (pool->curr_nr && (nr_remaining > 0)) {
> +		page = remove_element(pool);
> +		spin_unlock_irqrestore(&pool->lock, flags);
> +		smp_wmb();
> +
> +		kmemleak_update_trace((void *)page);
> +
> +		if (page_list)
> +			list_add(&page->lru, page_list);
> +		else
> +			page_array[ret + i] = page;
> +
> +		i++;
> +		nr_remaining--;
> +
> +		spin_lock_irqsave(&pool->lock, flags);
> +	}
> +
> +	spin_unlock_irqrestore(&pool->lock, flags);
> +
> +	if (!nr_remaining)
> +		return true;
> +
> +	/*
> +	 * The bulk allocator counts in the populated pages for array,
> +	 * but don't do it for list.
> +	 */
> +	if (page_list)
> +		nr = nr_remaining;
> +
> +	/*
> +	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
> +	 * alloc failed with that and @pool was empty, retry immediately.
> +	 */
> +	if (gfp_temp != gfp_mask) {
> +		gfp_temp = gfp_mask;
> +		goto repeat_alloc;
> +	}
> +
> +	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
> +	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
> +		return false;
> +
> +	/* Let's wait for someone else to return an element to @pool */
> +	init_wait(&wait);
> +	prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
> +
> +	/*
> +	 * FIXME: this should be io_schedule().  The timeout is there as a
> +	 * workaround for some DM problems in 2.6.18.
> +	 */
> +	io_schedule_timeout(5*HZ);
> +
> +	finish_wait(&pool->wait, &wait);
> +	goto repeat_alloc;
> +}
> +
> +bool mempool_alloc_pages_bulk_list(mempool_t *pool, gfp_t gfp_mask,
> +				   unsigned int nr,
> +				   struct list_head *page_list)
> +{
> +	return mempool_alloc_pages_bulk(pool, gfp_mask, nr, page_list, NULL);
> +}
> +EXPORT_SYMBOL(mempool_alloc_pages_bulk_list);
> +
> +bool mempool_alloc_pages_bulk_array(mempool_t *pool, gfp_t gfp_mask,
> +				    unsigned int nr,
> +				    struct page **page_array)
> +{
> +	return mempool_alloc_pages_bulk(pool, gfp_mask, nr, NULL, page_array);
> +}
> +EXPORT_SYMBOL(mempool_alloc_pages_bulk_array);
> +
>  /**
>   * mempool_free - return an element to the pool.
>   * @element:   pool element pointer.
> -- 
> 2.26.3
>
Yang Shi Oct. 6, 2022, 6:43 p.m. UTC | #3
On Thu, Oct 6, 2022 at 7:47 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Wed, Oct 05, 2022 at 11:03:39AM -0700, Yang Shi wrote:
> > Since v5.13 the page bulk allocator was introduced to allocate order-0
> > pages in bulk.  There are a few mempool allocator callers which does
> > order-0 page allocation in a loop, for example, dm-crypt, f2fs compress,
> > etc.  A mempool page bulk allocator seems useful.  So introduce the
> > mempool page bulk allocator.
> >
> > It introduces the below APIs:
> >   - mempool_init_pages_bulk()
> >   - mempool_create_pages_bulk()
> > They initialize the mempool for page bulk allocator.  The pool is filled
> > by alloc_page() in a loop.
> >
> >   - mempool_alloc_pages_bulk_list()
> >   - mempool_alloc_pages_bulk_array()
> > They do bulk allocation from mempool.
> > They do the below conceptually:
> >   1. Call bulk page allocator
> >   2. If the allocation is fulfilled then return otherwise try to
> >      allocate the remaining pages from the mempool
> >   3. If it is fulfilled then return otherwise retry from #1 with sleepable
> >      gfp
> >   4. If it is still failed, sleep for a while to wait for the mempool is
> >      refilled, then retry from #1
> > The populated pages will stay on the list or array until the callers
> > consume them or free them.
> > Since mempool allocator is guaranteed to success in the sleepable context,
> > so the two APIs return true for success or false for fail.  It is the
> > caller's responsibility to handle failure case (partial allocation), just
> > like the page bulk allocator.
> >
> > The mempool typically is an object agnostic allocator, but bulk allocation
> > is only supported by pages, so the mempool bulk allocator is for page
> > allocation only as well.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
>
> Hi Yang,
>
> I'm not terribly familiar with either component so I'm probably missing
> context/details, but just a couple high level thoughts when reading your
> patches...
>
> >  include/linux/mempool.h |  19 ++++
> >  mm/mempool.c            | 188 +++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 197 insertions(+), 10 deletions(-)
> >
> ...
> > diff --git a/mm/mempool.c b/mm/mempool.c
> > index ba32151f3843..7711ca2e6d66 100644
> > --- a/mm/mempool.c
> > +++ b/mm/mempool.c
> > @@ -177,6 +177,7 @@ void mempool_destroy(mempool_t *pool)
> >  EXPORT_SYMBOL(mempool_destroy);
> >
> >  static inline int __mempool_init(mempool_t *pool, int min_nr,
> > +                              mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
> >                                mempool_alloc_t *alloc_fn,
> >                                mempool_free_t *free_fn, void *pool_data,
> >                                gfp_t gfp_mask, int node_id)
> > @@ -186,8 +187,11 @@ static inline int __mempool_init(mempool_t *pool, int min_nr,
> >       pool->pool_data = pool_data;
> >       pool->alloc     = alloc_fn;
> >       pool->free      = free_fn;
> > +     pool->alloc_pages_bulk = alloc_pages_bulk_fn;
> >       init_waitqueue_head(&pool->wait);
> >
> > +     WARN_ON_ONCE(alloc_pages_bulk_fn && alloc_fn);
> > +
> >       pool->elements = kmalloc_array_node(min_nr, sizeof(void *),
> >                                           gfp_mask, node_id);
> >       if (!pool->elements)
> > @@ -199,7 +203,10 @@ static inline int __mempool_init(mempool_t *pool, int min_nr,
> >       while (pool->curr_nr < pool->min_nr) {
> >               void *element;
> >
> > -             element = pool->alloc(gfp_mask, pool->pool_data);
> > +             if (pool->alloc_pages_bulk)
> > +                     element = alloc_page(gfp_mask);
>
> Any reason to not use the callback from the caller for the bulk variant
> here? It looks like some users might expect consistency between the
> alloc / free callbacks for the pool. I.e., I'm not familiar with
> dm-crypt, but the code modified in the subsequent patches looks like it
> keeps an allocated page count. Will that still work with this, assuming
> these pages are freed through free_fn?

No special reason, this implementation just end up with fewer code
otherwise we should need to define a list, and manipulate the list,
seems like a little bit overkilling for initialization code.

Yes, that allocated page count works, just the pages in the pool are
not counted in the count anymore, 256 pages should be not a big deal
IMHO.

>
> > +             else
> > +                     element = pool->alloc(gfp_mask, pool->pool_data);
> >               if (unlikely(!element)) {
> >                       mempool_exit(pool);
> >                       return -ENOMEM;
> ...
> > @@ -457,6 +499,132 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  }
> >  EXPORT_SYMBOL(mempool_alloc);
> >
> > +/**
> > + * mempool_alloc_pages_bulk - allocate a bulk of pagesfrom a specific
> > + *                           memory pool
> > + * @pool:       pointer to the memory pool which was allocated via
> > + *              mempool_create().
> > + * @gfp_mask:   the usual allocation bitmask.
> > + * @nr:         the number of requested pages.
> > + * @page_list:  the list the pages will be added to.
> > + * @page_array: the array the pages will be added to.
> > + *
> > + * this function only sleeps if the alloc_pages_bulk_fn() function sleeps
> > + * or the allocation can not be satisfied even though the mempool is depleted.
> > + * Note that due to preallocation, this function *never* fails when called
> > + * from process contexts. (it might fail if called from an IRQ context.)
> > + * Note: using __GFP_ZERO is not supported.  And the caller should not pass
> > + * in both valid page_list and page_array.
> > + *
> > + * Return: true when nr pages are allocated or false if not.  It is the
> > + *         caller's responsibility to free the partial allocated pages.
> > + */
> > +static bool mempool_alloc_pages_bulk(mempool_t *pool, gfp_t gfp_mask,
> > +                                  unsigned int nr,
> > +                                  struct list_head *page_list,
> > +                                  struct page **page_array)
> > +{
> > +     unsigned long flags;
> > +     wait_queue_entry_t wait;
> > +     gfp_t gfp_temp;
> > +     int i;
> > +     unsigned int ret, nr_remaining;
> > +     struct page *page;
> > +
>
> This looks like a lot of duplicate boilerplate from mempool_alloc().
> Could this instead do something like: rename the former to
> __mempool_alloc() and add a count parameter, implement bulk alloc
> support in there for count > 1, then let traditional (i.e., non-bulk)
> mempool_alloc() callers pass a count of 1?

Thanks for the suggestion. Yeah, the duplicate code is not perfect. I
thought about this way too, but it may need to have a lot of "if
(count > 0)" of "if (is_bulk_alloc) " if a flag is used in the code to
handle the bulk allocation, for example, calculate remaining nr, loop
to remove element from the pool, manipulate list or array, etc. Seems
not that readable IMHO.

We may be able to extract some common code into shared helpers, for
example, the gfp sanitization and wait logic.

>
> Along the same lines, I also wonder if there's any value in generic bulk
> alloc support for mempool. For example, I suppose technically this could
> be implemented via one change to support a pool->alloc_bulk() callback
> that any user could implement via a loop if they wanted
> mempool_alloc_bulk() support backed by a preallocated pool. The page
> based user could then just use that to call alloc_pages_bulk_*() as an
> optimization without the mempool layer needing to know or care about
> whether the underlying elements are pages or not. Hm?

Thanks for the suggestion. Actually I thought about this too. But the
memory space overhead, particularly stack space seems like a
showstopper to me. We just can put the pointers into an array, but
this may consume a significant amount of stack memory. One pointer is
8 bytes, 256 objects imply 2K stack space. Of course the users could
move the array into a dynamic allocated data structure, but this may
need the users modify their driver. Bulk kmalloc via kmalloc_array()
may be fine, this is the only usercase other than pages I could think
of.

>
> Brian
>
> > +     VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
> > +     might_alloc(gfp_mask);
> > +
> > +     gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
> > +     gfp_mask |= __GFP_NORETRY;      /* don't loop in __alloc_pages */
> > +     gfp_mask |= __GFP_NOWARN;       /* failures are OK */
> > +
> > +     gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
> > +
> > +repeat_alloc:
> > +     i = 0;
> > +     ret = pool->alloc_pages_bulk(gfp_temp, nr, pool->pool_data, page_list,
> > +                                  page_array);
> > +
> > +     if (ret == nr)
> > +             return true;
> > +
> > +     nr_remaining = nr - ret;
> > +
> > +     spin_lock_irqsave(&pool->lock, flags);
> > +     /* Allocate page from the pool and add to the list or array */
> > +     while (pool->curr_nr && (nr_remaining > 0)) {
> > +             page = remove_element(pool);
> > +             spin_unlock_irqrestore(&pool->lock, flags);
> > +             smp_wmb();
> > +
> > +             kmemleak_update_trace((void *)page);
> > +
> > +             if (page_list)
> > +                     list_add(&page->lru, page_list);
> > +             else
> > +                     page_array[ret + i] = page;
> > +
> > +             i++;
> > +             nr_remaining--;
> > +
> > +             spin_lock_irqsave(&pool->lock, flags);
> > +     }
> > +
> > +     spin_unlock_irqrestore(&pool->lock, flags);
> > +
> > +     if (!nr_remaining)
> > +             return true;
> > +
> > +     /*
> > +      * The bulk allocator counts in the populated pages for array,
> > +      * but don't do it for list.
> > +      */
> > +     if (page_list)
> > +             nr = nr_remaining;
> > +
> > +     /*
> > +      * We use gfp mask w/o direct reclaim or IO for the first round.  If
> > +      * alloc failed with that and @pool was empty, retry immediately.
> > +      */
> > +     if (gfp_temp != gfp_mask) {
> > +             gfp_temp = gfp_mask;
> > +             goto repeat_alloc;
> > +     }
> > +
> > +     /* We must not sleep if !__GFP_DIRECT_RECLAIM */
> > +     if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
> > +             return false;
> > +
> > +     /* Let's wait for someone else to return an element to @pool */
> > +     init_wait(&wait);
> > +     prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
> > +
> > +     /*
> > +      * FIXME: this should be io_schedule().  The timeout is there as a
> > +      * workaround for some DM problems in 2.6.18.
> > +      */
> > +     io_schedule_timeout(5*HZ);
> > +
> > +     finish_wait(&pool->wait, &wait);
> > +     goto repeat_alloc;
> > +}
> > +
> > +bool mempool_alloc_pages_bulk_list(mempool_t *pool, gfp_t gfp_mask,
> > +                                unsigned int nr,
> > +                                struct list_head *page_list)
> > +{
> > +     return mempool_alloc_pages_bulk(pool, gfp_mask, nr, page_list, NULL);
> > +}
> > +EXPORT_SYMBOL(mempool_alloc_pages_bulk_list);
> > +
> > +bool mempool_alloc_pages_bulk_array(mempool_t *pool, gfp_t gfp_mask,
> > +                                 unsigned int nr,
> > +                                 struct page **page_array)
> > +{
> > +     return mempool_alloc_pages_bulk(pool, gfp_mask, nr, NULL, page_array);
> > +}
> > +EXPORT_SYMBOL(mempool_alloc_pages_bulk_array);
> > +
> >  /**
> >   * mempool_free - return an element to the pool.
> >   * @element:   pool element pointer.
> > --
> > 2.26.3
> >
>
Mel Gorman Oct. 13, 2022, 12:38 p.m. UTC | #4
On Wed, Oct 05, 2022 at 11:03:39AM -0700, Yang Shi wrote:
> Since v5.13 the page bulk allocator was introduced to allocate order-0
> pages in bulk.  There are a few mempool allocator callers which does
> order-0 page allocation in a loop, for example, dm-crypt, f2fs compress,
> etc.  A mempool page bulk allocator seems useful.  So introduce the
> mempool page bulk allocator.
> 
> It introduces the below APIs:
>   - mempool_init_pages_bulk()
>   - mempool_create_pages_bulk()
> They initialize the mempool for page bulk allocator.  The pool is filled
> by alloc_page() in a loop.
> 
>   - mempool_alloc_pages_bulk_list()
>   - mempool_alloc_pages_bulk_array()
> They do bulk allocation from mempool.
> They do the below conceptually:
>   1. Call bulk page allocator
>   2. If the allocation is fulfilled then return otherwise try to
>      allocate the remaining pages from the mempool
>   3. If it is fulfilled then return otherwise retry from #1 with sleepable
>      gfp
>   4. If it is still failed, sleep for a while to wait for the mempool is
>      refilled, then retry from #1
> The populated pages will stay on the list or array until the callers
> consume them or free them.
> Since mempool allocator is guaranteed to success in the sleepable context,
> so the two APIs return true for success or false for fail.  It is the
> caller's responsibility to handle failure case (partial allocation), just
> like the page bulk allocator.
> 
> The mempool typically is an object agnostic allocator, but bulk allocation
> is only supported by pages, so the mempool bulk allocator is for page
> allocation only as well.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>

Overall, I think it's an ok approach and certainly a good use case for
the bulk allocator.

The main concern that I have is that the dm-crypt use case doesn't really
want to use lists as such and it's just a means for collecting pages to pass
to bio_add_page(). bio_add_page() is working with arrays but you cannot
use that array directly as any change to how that array is populated will
then explode. Unfortunately, what you have is adding pages to a list to
take them off the list and put them in an array and that is inefficient.

How about this

1. Add a callback to __alloc_pages_bulk() that takes a page as a
   parameter like bulk_add_page() or whatever.

2. For page_list == NULL && page_array == NULL, the callback is used

3. Add alloc_pages_bulk_cb() that passes in the name of a callback
   function

4. In the dm-crypt case, use the callback to pass the page to bio_add_page
   for the new page allocated.

It's not free because there will be an additional function call for every
page bulk allocated but I suspect that's cheaper than adding a pile of
pages to a list just to take them off again. It also avoids adding a user
for the bulk allocator list interface that does not even want a list.

It might mean that there is additional cleanup work for __alloc_pages_bulk
to abstract away whether a list, array or cb is used but nothing
impossible.
Yang Shi Oct. 13, 2022, 8:16 p.m. UTC | #5
On Thu, Oct 13, 2022 at 5:38 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Wed, Oct 05, 2022 at 11:03:39AM -0700, Yang Shi wrote:
> > Since v5.13 the page bulk allocator was introduced to allocate order-0
> > pages in bulk.  There are a few mempool allocator callers which does
> > order-0 page allocation in a loop, for example, dm-crypt, f2fs compress,
> > etc.  A mempool page bulk allocator seems useful.  So introduce the
> > mempool page bulk allocator.
> >
> > It introduces the below APIs:
> >   - mempool_init_pages_bulk()
> >   - mempool_create_pages_bulk()
> > They initialize the mempool for page bulk allocator.  The pool is filled
> > by alloc_page() in a loop.
> >
> >   - mempool_alloc_pages_bulk_list()
> >   - mempool_alloc_pages_bulk_array()
> > They do bulk allocation from mempool.
> > They do the below conceptually:
> >   1. Call bulk page allocator
> >   2. If the allocation is fulfilled then return otherwise try to
> >      allocate the remaining pages from the mempool
> >   3. If it is fulfilled then return otherwise retry from #1 with sleepable
> >      gfp
> >   4. If it is still failed, sleep for a while to wait for the mempool is
> >      refilled, then retry from #1
> > The populated pages will stay on the list or array until the callers
> > consume them or free them.
> > Since mempool allocator is guaranteed to success in the sleepable context,
> > so the two APIs return true for success or false for fail.  It is the
> > caller's responsibility to handle failure case (partial allocation), just
> > like the page bulk allocator.
> >
> > The mempool typically is an object agnostic allocator, but bulk allocation
> > is only supported by pages, so the mempool bulk allocator is for page
> > allocation only as well.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
>
> Overall, I think it's an ok approach and certainly a good use case for
> the bulk allocator.
>
> The main concern that I have is that the dm-crypt use case doesn't really
> want to use lists as such and it's just a means for collecting pages to pass
> to bio_add_page(). bio_add_page() is working with arrays but you cannot
> use that array directly as any change to how that array is populated will
> then explode. Unfortunately, what you have is adding pages to a list to
> take them off the list and put them in an array and that is inefficient.

Yeah, I didn't think of a better way to pass the pages to dm-crypt.

>
> How about this
>
> 1. Add a callback to __alloc_pages_bulk() that takes a page as a
>    parameter like bulk_add_page() or whatever.
>
> 2. For page_list == NULL && page_array == NULL, the callback is used
>
> 3. Add alloc_pages_bulk_cb() that passes in the name of a callback
>    function
>
> 4. In the dm-crypt case, use the callback to pass the page to bio_add_page
>    for the new page allocated.

Thank you so much for the suggestion. But I have a hard time
understanding how these work together. Do you mean call bio_add_page()
in the callback? But bio_add_page() needs other parameters. Or I
misunderstood you?

>
> It's not free because there will be an additional function call for every
> page bulk allocated but I suspect that's cheaper than adding a pile of
> pages to a list just to take them off again. It also avoids adding a user
> for the bulk allocator list interface that does not even want a list.
>
> It might mean that there is additional cleanup work for __alloc_pages_bulk
> to abstract away whether a list, array or cb is used but nothing
> impossible.
>
> --
> Mel Gorman
> SUSE Labs
Brian Foster Oct. 14, 2022, 12:03 p.m. UTC | #6
On Thu, Oct 06, 2022 at 11:43:21AM -0700, Yang Shi wrote:
> On Thu, Oct 6, 2022 at 7:47 AM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Wed, Oct 05, 2022 at 11:03:39AM -0700, Yang Shi wrote:
> > > Since v5.13 the page bulk allocator was introduced to allocate order-0
> > > pages in bulk.  There are a few mempool allocator callers which does
> > > order-0 page allocation in a loop, for example, dm-crypt, f2fs compress,
> > > etc.  A mempool page bulk allocator seems useful.  So introduce the
> > > mempool page bulk allocator.
> > >
> > > It introduces the below APIs:
> > >   - mempool_init_pages_bulk()
> > >   - mempool_create_pages_bulk()
> > > They initialize the mempool for page bulk allocator.  The pool is filled
> > > by alloc_page() in a loop.
> > >
> > >   - mempool_alloc_pages_bulk_list()
> > >   - mempool_alloc_pages_bulk_array()
> > > They do bulk allocation from mempool.
> > > They do the below conceptually:
> > >   1. Call bulk page allocator
> > >   2. If the allocation is fulfilled then return otherwise try to
> > >      allocate the remaining pages from the mempool
> > >   3. If it is fulfilled then return otherwise retry from #1 with sleepable
> > >      gfp
> > >   4. If it is still failed, sleep for a while to wait for the mempool is
> > >      refilled, then retry from #1
> > > The populated pages will stay on the list or array until the callers
> > > consume them or free them.
> > > Since mempool allocator is guaranteed to success in the sleepable context,
> > > so the two APIs return true for success or false for fail.  It is the
> > > caller's responsibility to handle failure case (partial allocation), just
> > > like the page bulk allocator.
> > >
> > > The mempool typically is an object agnostic allocator, but bulk allocation
> > > is only supported by pages, so the mempool bulk allocator is for page
> > > allocation only as well.
> > >
> > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > ---
> >
> > Hi Yang,
> >
> > I'm not terribly familiar with either component so I'm probably missing
> > context/details, but just a couple high level thoughts when reading your
> > patches...
> >
> > >  include/linux/mempool.h |  19 ++++
> > >  mm/mempool.c            | 188 +++++++++++++++++++++++++++++++++++++---
> > >  2 files changed, 197 insertions(+), 10 deletions(-)
> > >
> > ...
> > > diff --git a/mm/mempool.c b/mm/mempool.c
> > > index ba32151f3843..7711ca2e6d66 100644
> > > --- a/mm/mempool.c
> > > +++ b/mm/mempool.c
> > > @@ -177,6 +177,7 @@ void mempool_destroy(mempool_t *pool)
> > >  EXPORT_SYMBOL(mempool_destroy);
> > >
> > >  static inline int __mempool_init(mempool_t *pool, int min_nr,
> > > +                              mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
> > >                                mempool_alloc_t *alloc_fn,
> > >                                mempool_free_t *free_fn, void *pool_data,
> > >                                gfp_t gfp_mask, int node_id)
> > > @@ -186,8 +187,11 @@ static inline int __mempool_init(mempool_t *pool, int min_nr,
> > >       pool->pool_data = pool_data;
> > >       pool->alloc     = alloc_fn;
> > >       pool->free      = free_fn;
> > > +     pool->alloc_pages_bulk = alloc_pages_bulk_fn;
> > >       init_waitqueue_head(&pool->wait);
> > >
> > > +     WARN_ON_ONCE(alloc_pages_bulk_fn && alloc_fn);
> > > +
> > >       pool->elements = kmalloc_array_node(min_nr, sizeof(void *),
> > >                                           gfp_mask, node_id);
> > >       if (!pool->elements)
> > > @@ -199,7 +203,10 @@ static inline int __mempool_init(mempool_t *pool, int min_nr,
> > >       while (pool->curr_nr < pool->min_nr) {
> > >               void *element;
> > >
> > > -             element = pool->alloc(gfp_mask, pool->pool_data);
> > > +             if (pool->alloc_pages_bulk)
> > > +                     element = alloc_page(gfp_mask);
> >
> > Any reason to not use the callback from the caller for the bulk variant
> > here? It looks like some users might expect consistency between the
> > alloc / free callbacks for the pool. I.e., I'm not familiar with
> > dm-crypt, but the code modified in the subsequent patches looks like it
> > keeps an allocated page count. Will that still work with this, assuming
> > these pages are freed through free_fn?
> 
> No special reason, this implementation just end up with fewer code
> otherwise we should need to define a list, and manipulate the list,
> seems like a little bit overkilling for initialization code.
> 
> Yes, that allocated page count works, just the pages in the pool are
> not counted in the count anymore, 256 pages should be not a big deal
> IMHO.
> 

Ok. I defer to dm-crypt folks on whether/how much it might care about
pages being hidden from the accounting. My concern was partly that, but
also partly whether it was possible to break consistency between the
number of alloc and free callbacks to be expected. For example, wouldn't
these counters underflow if the mempool is torn down or shrunk (via
mempool_resize()), and thus the caller gets ->free() callbacks for pages
it never accounted for in the first place?

Brian

> >
> > > +             else
> > > +                     element = pool->alloc(gfp_mask, pool->pool_data);
> > >               if (unlikely(!element)) {
> > >                       mempool_exit(pool);
> > >                       return -ENOMEM;
> > ...
> > > @@ -457,6 +499,132 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> > >  }
> > >  EXPORT_SYMBOL(mempool_alloc);
> > >
> > > +/**
> > > + * mempool_alloc_pages_bulk - allocate a bulk of pagesfrom a specific
> > > + *                           memory pool
> > > + * @pool:       pointer to the memory pool which was allocated via
> > > + *              mempool_create().
> > > + * @gfp_mask:   the usual allocation bitmask.
> > > + * @nr:         the number of requested pages.
> > > + * @page_list:  the list the pages will be added to.
> > > + * @page_array: the array the pages will be added to.
> > > + *
> > > + * this function only sleeps if the alloc_pages_bulk_fn() function sleeps
> > > + * or the allocation can not be satisfied even though the mempool is depleted.
> > > + * Note that due to preallocation, this function *never* fails when called
> > > + * from process contexts. (it might fail if called from an IRQ context.)
> > > + * Note: using __GFP_ZERO is not supported.  And the caller should not pass
> > > + * in both valid page_list and page_array.
> > > + *
> > > + * Return: true when nr pages are allocated or false if not.  It is the
> > > + *         caller's responsibility to free the partial allocated pages.
> > > + */
> > > +static bool mempool_alloc_pages_bulk(mempool_t *pool, gfp_t gfp_mask,
> > > +                                  unsigned int nr,
> > > +                                  struct list_head *page_list,
> > > +                                  struct page **page_array)
> > > +{
> > > +     unsigned long flags;
> > > +     wait_queue_entry_t wait;
> > > +     gfp_t gfp_temp;
> > > +     int i;
> > > +     unsigned int ret, nr_remaining;
> > > +     struct page *page;
> > > +
> >
> > This looks like a lot of duplicate boilerplate from mempool_alloc().
> > Could this instead do something like: rename the former to
> > __mempool_alloc() and add a count parameter, implement bulk alloc
> > support in there for count > 1, then let traditional (i.e., non-bulk)
> > mempool_alloc() callers pass a count of 1?
> 
> Thanks for the suggestion. Yeah, the duplicate code is not perfect. I
> thought about this way too, but it may need to have a lot of "if
> (count > 0)" of "if (is_bulk_alloc) " if a flag is used in the code to
> handle the bulk allocation, for example, calculate remaining nr, loop
> to remove element from the pool, manipulate list or array, etc. Seems
> not that readable IMHO.
> 
> We may be able to extract some common code into shared helpers, for
> example, the gfp sanitization and wait logic.
> 
> >
> > Along the same lines, I also wonder if there's any value in generic bulk
> > alloc support for mempool. For example, I suppose technically this could
> > be implemented via one change to support a pool->alloc_bulk() callback
> > that any user could implement via a loop if they wanted
> > mempool_alloc_bulk() support backed by a preallocated pool. The page
> > based user could then just use that to call alloc_pages_bulk_*() as an
> > optimization without the mempool layer needing to know or care about
> > whether the underlying elements are pages or not. Hm?
> 
> Thanks for the suggestion. Actually I thought about this too. But the
> memory space overhead, particularly stack space seems like a
> showstopper to me. We just can put the pointers into an array, but
> this may consume a significant amount of stack memory. One pointer is
> 8 bytes, 256 objects imply 2K stack space. Of course the users could
> move the array into a dynamic allocated data structure, but this may
> need the users modify their driver. Bulk kmalloc via kmalloc_array()
> may be fine, this is the only usercase other than pages I could think
> of.
> 
> >
> > Brian
> >
> > > +     VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
> > > +     might_alloc(gfp_mask);
> > > +
> > > +     gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
> > > +     gfp_mask |= __GFP_NORETRY;      /* don't loop in __alloc_pages */
> > > +     gfp_mask |= __GFP_NOWARN;       /* failures are OK */
> > > +
> > > +     gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
> > > +
> > > +repeat_alloc:
> > > +     i = 0;
> > > +     ret = pool->alloc_pages_bulk(gfp_temp, nr, pool->pool_data, page_list,
> > > +                                  page_array);
> > > +
> > > +     if (ret == nr)
> > > +             return true;
> > > +
> > > +     nr_remaining = nr - ret;
> > > +
> > > +     spin_lock_irqsave(&pool->lock, flags);
> > > +     /* Allocate page from the pool and add to the list or array */
> > > +     while (pool->curr_nr && (nr_remaining > 0)) {
> > > +             page = remove_element(pool);
> > > +             spin_unlock_irqrestore(&pool->lock, flags);
> > > +             smp_wmb();
> > > +
> > > +             kmemleak_update_trace((void *)page);
> > > +
> > > +             if (page_list)
> > > +                     list_add(&page->lru, page_list);
> > > +             else
> > > +                     page_array[ret + i] = page;
> > > +
> > > +             i++;
> > > +             nr_remaining--;
> > > +
> > > +             spin_lock_irqsave(&pool->lock, flags);
> > > +     }
> > > +
> > > +     spin_unlock_irqrestore(&pool->lock, flags);
> > > +
> > > +     if (!nr_remaining)
> > > +             return true;
> > > +
> > > +     /*
> > > +      * The bulk allocator counts in the populated pages for array,
> > > +      * but don't do it for list.
> > > +      */
> > > +     if (page_list)
> > > +             nr = nr_remaining;
> > > +
> > > +     /*
> > > +      * We use gfp mask w/o direct reclaim or IO for the first round.  If
> > > +      * alloc failed with that and @pool was empty, retry immediately.
> > > +      */
> > > +     if (gfp_temp != gfp_mask) {
> > > +             gfp_temp = gfp_mask;
> > > +             goto repeat_alloc;
> > > +     }
> > > +
> > > +     /* We must not sleep if !__GFP_DIRECT_RECLAIM */
> > > +     if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
> > > +             return false;
> > > +
> > > +     /* Let's wait for someone else to return an element to @pool */
> > > +     init_wait(&wait);
> > > +     prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
> > > +
> > > +     /*
> > > +      * FIXME: this should be io_schedule().  The timeout is there as a
> > > +      * workaround for some DM problems in 2.6.18.
> > > +      */
> > > +     io_schedule_timeout(5*HZ);
> > > +
> > > +     finish_wait(&pool->wait, &wait);
> > > +     goto repeat_alloc;
> > > +}
> > > +
> > > +bool mempool_alloc_pages_bulk_list(mempool_t *pool, gfp_t gfp_mask,
> > > +                                unsigned int nr,
> > > +                                struct list_head *page_list)
> > > +{
> > > +     return mempool_alloc_pages_bulk(pool, gfp_mask, nr, page_list, NULL);
> > > +}
> > > +EXPORT_SYMBOL(mempool_alloc_pages_bulk_list);
> > > +
> > > +bool mempool_alloc_pages_bulk_array(mempool_t *pool, gfp_t gfp_mask,
> > > +                                 unsigned int nr,
> > > +                                 struct page **page_array)
> > > +{
> > > +     return mempool_alloc_pages_bulk(pool, gfp_mask, nr, NULL, page_array);
> > > +}
> > > +EXPORT_SYMBOL(mempool_alloc_pages_bulk_array);
> > > +
> > >  /**
> > >   * mempool_free - return an element to the pool.
> > >   * @element:   pool element pointer.
> > > --
> > > 2.26.3
> > >
> >
>
Mel Gorman Oct. 17, 2022, 9:41 a.m. UTC | #7
On Thu, Oct 13, 2022 at 01:16:31PM -0700, Yang Shi wrote:
> On Thu, Oct 13, 2022 at 5:38 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Wed, Oct 05, 2022 at 11:03:39AM -0700, Yang Shi wrote:
> > > Since v5.13 the page bulk allocator was introduced to allocate order-0
> > > pages in bulk.  There are a few mempool allocator callers which does
> > > order-0 page allocation in a loop, for example, dm-crypt, f2fs compress,
> > > etc.  A mempool page bulk allocator seems useful.  So introduce the
> > > mempool page bulk allocator.
> > >
> > > It introduces the below APIs:
> > >   - mempool_init_pages_bulk()
> > >   - mempool_create_pages_bulk()
> > > They initialize the mempool for page bulk allocator.  The pool is filled
> > > by alloc_page() in a loop.
> > >
> > >   - mempool_alloc_pages_bulk_list()
> > >   - mempool_alloc_pages_bulk_array()
> > > They do bulk allocation from mempool.
> > > They do the below conceptually:
> > >   1. Call bulk page allocator
> > >   2. If the allocation is fulfilled then return otherwise try to
> > >      allocate the remaining pages from the mempool
> > >   3. If it is fulfilled then return otherwise retry from #1 with sleepable
> > >      gfp
> > >   4. If it is still failed, sleep for a while to wait for the mempool is
> > >      refilled, then retry from #1
> > > The populated pages will stay on the list or array until the callers
> > > consume them or free them.
> > > Since mempool allocator is guaranteed to success in the sleepable context,
> > > so the two APIs return true for success or false for fail.  It is the
> > > caller's responsibility to handle failure case (partial allocation), just
> > > like the page bulk allocator.
> > >
> > > The mempool typically is an object agnostic allocator, but bulk allocation
> > > is only supported by pages, so the mempool bulk allocator is for page
> > > allocation only as well.
> > >
> > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> >
> > Overall, I think it's an ok approach and certainly a good use case for
> > the bulk allocator.
> >
> > The main concern that I have is that the dm-crypt use case doesn't really
> > want to use lists as such and it's just a means for collecting pages to pass
> > to bio_add_page(). bio_add_page() is working with arrays but you cannot
> > use that array directly as any change to how that array is populated will
> > then explode. Unfortunately, what you have is adding pages to a list to
> > take them off the list and put them in an array and that is inefficient.
> 
> Yeah, I didn't think of a better way to pass the pages to dm-crypt.
> 
> >
> > How about this
> >
> > 1. Add a callback to __alloc_pages_bulk() that takes a page as a
> >    parameter like bulk_add_page() or whatever.
> >
> > 2. For page_list == NULL && page_array == NULL, the callback is used
> >
> > 3. Add alloc_pages_bulk_cb() that passes in the name of a callback
> >    function
> >
> > 4. In the dm-crypt case, use the callback to pass the page to bio_add_page
> >    for the new page allocated.
> 
> Thank you so much for the suggestion. But I have a hard time
> understanding how these work together. Do you mean call bio_add_page()
> in the callback? But bio_add_page() needs other parameters. Or I
> misunderstood you?
> 

I expected dm-crypt to define the callback. Using bio_add_page
directly would not work as the bulk allocator has no idea what to pass
bio_add_page. dm-crypt would likely need to create both a callback and an
opaque data structure passed as (void *) to track "clone" and "len"
Yang Shi Oct. 18, 2022, 5:51 p.m. UTC | #8
On Fri, Oct 14, 2022 at 5:03 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Oct 06, 2022 at 11:43:21AM -0700, Yang Shi wrote:
> > On Thu, Oct 6, 2022 at 7:47 AM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Wed, Oct 05, 2022 at 11:03:39AM -0700, Yang Shi wrote:
> > > > Since v5.13 the page bulk allocator was introduced to allocate order-0
> > > > pages in bulk.  There are a few mempool allocator callers which does
> > > > order-0 page allocation in a loop, for example, dm-crypt, f2fs compress,
> > > > etc.  A mempool page bulk allocator seems useful.  So introduce the
> > > > mempool page bulk allocator.
> > > >
> > > > It introduces the below APIs:
> > > >   - mempool_init_pages_bulk()
> > > >   - mempool_create_pages_bulk()
> > > > They initialize the mempool for page bulk allocator.  The pool is filled
> > > > by alloc_page() in a loop.
> > > >
> > > >   - mempool_alloc_pages_bulk_list()
> > > >   - mempool_alloc_pages_bulk_array()
> > > > They do bulk allocation from mempool.
> > > > They do the below conceptually:
> > > >   1. Call bulk page allocator
> > > >   2. If the allocation is fulfilled then return otherwise try to
> > > >      allocate the remaining pages from the mempool
> > > >   3. If it is fulfilled then return otherwise retry from #1 with sleepable
> > > >      gfp
> > > >   4. If it is still failed, sleep for a while to wait for the mempool is
> > > >      refilled, then retry from #1
> > > > The populated pages will stay on the list or array until the callers
> > > > consume them or free them.
> > > > Since mempool allocator is guaranteed to success in the sleepable context,
> > > > so the two APIs return true for success or false for fail.  It is the
> > > > caller's responsibility to handle failure case (partial allocation), just
> > > > like the page bulk allocator.
> > > >
> > > > The mempool typically is an object agnostic allocator, but bulk allocation
> > > > is only supported by pages, so the mempool bulk allocator is for page
> > > > allocation only as well.
> > > >
> > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > ---
> > >
> > > Hi Yang,
> > >
> > > I'm not terribly familiar with either component so I'm probably missing
> > > context/details, but just a couple high level thoughts when reading your
> > > patches...
> > >
> > > >  include/linux/mempool.h |  19 ++++
> > > >  mm/mempool.c            | 188 +++++++++++++++++++++++++++++++++++++---
> > > >  2 files changed, 197 insertions(+), 10 deletions(-)
> > > >
> > > ...
> > > > diff --git a/mm/mempool.c b/mm/mempool.c
> > > > index ba32151f3843..7711ca2e6d66 100644
> > > > --- a/mm/mempool.c
> > > > +++ b/mm/mempool.c
> > > > @@ -177,6 +177,7 @@ void mempool_destroy(mempool_t *pool)
> > > >  EXPORT_SYMBOL(mempool_destroy);
> > > >
> > > >  static inline int __mempool_init(mempool_t *pool, int min_nr,
> > > > +                              mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
> > > >                                mempool_alloc_t *alloc_fn,
> > > >                                mempool_free_t *free_fn, void *pool_data,
> > > >                                gfp_t gfp_mask, int node_id)
> > > > @@ -186,8 +187,11 @@ static inline int __mempool_init(mempool_t *pool, int min_nr,
> > > >       pool->pool_data = pool_data;
> > > >       pool->alloc     = alloc_fn;
> > > >       pool->free      = free_fn;
> > > > +     pool->alloc_pages_bulk = alloc_pages_bulk_fn;
> > > >       init_waitqueue_head(&pool->wait);
> > > >
> > > > +     WARN_ON_ONCE(alloc_pages_bulk_fn && alloc_fn);
> > > > +
> > > >       pool->elements = kmalloc_array_node(min_nr, sizeof(void *),
> > > >                                           gfp_mask, node_id);
> > > >       if (!pool->elements)
> > > > @@ -199,7 +203,10 @@ static inline int __mempool_init(mempool_t *pool, int min_nr,
> > > >       while (pool->curr_nr < pool->min_nr) {
> > > >               void *element;
> > > >
> > > > -             element = pool->alloc(gfp_mask, pool->pool_data);
> > > > +             if (pool->alloc_pages_bulk)
> > > > +                     element = alloc_page(gfp_mask);
> > >
> > > Any reason to not use the callback from the caller for the bulk variant
> > > here? It looks like some users might expect consistency between the
> > > alloc / free callbacks for the pool. I.e., I'm not familiar with
> > > dm-crypt, but the code modified in the subsequent patches looks like it
> > > keeps an allocated page count. Will that still work with this, assuming
> > > these pages are freed through free_fn?
> >
> > No special reason, this implementation just end up with fewer code
> > otherwise we should need to define a list, and manipulate the list,
> > seems like a little bit overkilling for initialization code.
> >
> > Yes, that allocated page count works, just the pages in the pool are
> > not counted in the count anymore, 256 pages should be not a big deal
> > IMHO.
> >
>
> Ok. I defer to dm-crypt folks on whether/how much it might care about
> pages being hidden from the accounting. My concern was partly that, but
> also partly whether it was possible to break consistency between the
> number of alloc and free callbacks to be expected. For example, wouldn't
> these counters underflow if the mempool is torn down or shrunk (via
> mempool_resize()), and thus the caller gets ->free() callbacks for pages
> it never accounted for in the first place?

For graceful tear down, all the pages should be freed before tear down
IMHO. For mempool resize, it may be possible, but dm-crypt doesn't
resize mempool IIRC. Anyway this counter is driver specific, it could
be inc'ed or dec'ed when the mempool is created or resized.

>
> Brian
>
> > >
> > > > +             else
> > > > +                     element = pool->alloc(gfp_mask, pool->pool_data);
> > > >               if (unlikely(!element)) {
> > > >                       mempool_exit(pool);
> > > >                       return -ENOMEM;
> > > ...
> > > > @@ -457,6 +499,132 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> > > >  }
> > > >  EXPORT_SYMBOL(mempool_alloc);
> > > >
> > > > +/**
> > > > + * mempool_alloc_pages_bulk - allocate a bulk of pagesfrom a specific
> > > > + *                           memory pool
> > > > + * @pool:       pointer to the memory pool which was allocated via
> > > > + *              mempool_create().
> > > > + * @gfp_mask:   the usual allocation bitmask.
> > > > + * @nr:         the number of requested pages.
> > > > + * @page_list:  the list the pages will be added to.
> > > > + * @page_array: the array the pages will be added to.
> > > > + *
> > > > + * this function only sleeps if the alloc_pages_bulk_fn() function sleeps
> > > > + * or the allocation can not be satisfied even though the mempool is depleted.
> > > > + * Note that due to preallocation, this function *never* fails when called
> > > > + * from process contexts. (it might fail if called from an IRQ context.)
> > > > + * Note: using __GFP_ZERO is not supported.  And the caller should not pass
> > > > + * in both valid page_list and page_array.
> > > > + *
> > > > + * Return: true when nr pages are allocated or false if not.  It is the
> > > > + *         caller's responsibility to free the partial allocated pages.
> > > > + */
> > > > +static bool mempool_alloc_pages_bulk(mempool_t *pool, gfp_t gfp_mask,
> > > > +                                  unsigned int nr,
> > > > +                                  struct list_head *page_list,
> > > > +                                  struct page **page_array)
> > > > +{
> > > > +     unsigned long flags;
> > > > +     wait_queue_entry_t wait;
> > > > +     gfp_t gfp_temp;
> > > > +     int i;
> > > > +     unsigned int ret, nr_remaining;
> > > > +     struct page *page;
> > > > +
> > >
> > > This looks like a lot of duplicate boilerplate from mempool_alloc().
> > > Could this instead do something like: rename the former to
> > > __mempool_alloc() and add a count parameter, implement bulk alloc
> > > support in there for count > 1, then let traditional (i.e., non-bulk)
> > > mempool_alloc() callers pass a count of 1?
> >
> > Thanks for the suggestion. Yeah, the duplicate code is not perfect. I
> > thought about this way too, but it may need to have a lot of "if
> > (count > 0)" of "if (is_bulk_alloc) " if a flag is used in the code to
> > handle the bulk allocation, for example, calculate remaining nr, loop
> > to remove element from the pool, manipulate list or array, etc. Seems
> > not that readable IMHO.
> >
> > We may be able to extract some common code into shared helpers, for
> > example, the gfp sanitization and wait logic.
> >
> > >
> > > Along the same lines, I also wonder if there's any value in generic bulk
> > > alloc support for mempool. For example, I suppose technically this could
> > > be implemented via one change to support a pool->alloc_bulk() callback
> > > that any user could implement via a loop if they wanted
> > > mempool_alloc_bulk() support backed by a preallocated pool. The page
> > > based user could then just use that to call alloc_pages_bulk_*() as an
> > > optimization without the mempool layer needing to know or care about
> > > whether the underlying elements are pages or not. Hm?
> >
> > Thanks for the suggestion. Actually I thought about this too. But the
> > memory space overhead, particularly stack space seems like a
> > showstopper to me. We just can put the pointers into an array, but
> > this may consume a significant amount of stack memory. One pointer is
> > 8 bytes, 256 objects imply 2K stack space. Of course the users could
> > move the array into a dynamic allocated data structure, but this may
> > need the users modify their driver. Bulk kmalloc via kmalloc_array()
> > may be fine, this is the only usercase other than pages I could think
> > of.
> >
> > >
> > > Brian
> > >
> > > > +     VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
> > > > +     might_alloc(gfp_mask);
> > > > +
> > > > +     gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
> > > > +     gfp_mask |= __GFP_NORETRY;      /* don't loop in __alloc_pages */
> > > > +     gfp_mask |= __GFP_NOWARN;       /* failures are OK */
> > > > +
> > > > +     gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
> > > > +
> > > > +repeat_alloc:
> > > > +     i = 0;
> > > > +     ret = pool->alloc_pages_bulk(gfp_temp, nr, pool->pool_data, page_list,
> > > > +                                  page_array);
> > > > +
> > > > +     if (ret == nr)
> > > > +             return true;
> > > > +
> > > > +     nr_remaining = nr - ret;
> > > > +
> > > > +     spin_lock_irqsave(&pool->lock, flags);
> > > > +     /* Allocate page from the pool and add to the list or array */
> > > > +     while (pool->curr_nr && (nr_remaining > 0)) {
> > > > +             page = remove_element(pool);
> > > > +             spin_unlock_irqrestore(&pool->lock, flags);
> > > > +             smp_wmb();
> > > > +
> > > > +             kmemleak_update_trace((void *)page);
> > > > +
> > > > +             if (page_list)
> > > > +                     list_add(&page->lru, page_list);
> > > > +             else
> > > > +                     page_array[ret + i] = page;
> > > > +
> > > > +             i++;
> > > > +             nr_remaining--;
> > > > +
> > > > +             spin_lock_irqsave(&pool->lock, flags);
> > > > +     }
> > > > +
> > > > +     spin_unlock_irqrestore(&pool->lock, flags);
> > > > +
> > > > +     if (!nr_remaining)
> > > > +             return true;
> > > > +
> > > > +     /*
> > > > +      * The bulk allocator counts in the populated pages for array,
> > > > +      * but don't do it for list.
> > > > +      */
> > > > +     if (page_list)
> > > > +             nr = nr_remaining;
> > > > +
> > > > +     /*
> > > > +      * We use gfp mask w/o direct reclaim or IO for the first round.  If
> > > > +      * alloc failed with that and @pool was empty, retry immediately.
> > > > +      */
> > > > +     if (gfp_temp != gfp_mask) {
> > > > +             gfp_temp = gfp_mask;
> > > > +             goto repeat_alloc;
> > > > +     }
> > > > +
> > > > +     /* We must not sleep if !__GFP_DIRECT_RECLAIM */
> > > > +     if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
> > > > +             return false;
> > > > +
> > > > +     /* Let's wait for someone else to return an element to @pool */
> > > > +     init_wait(&wait);
> > > > +     prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
> > > > +
> > > > +     /*
> > > > +      * FIXME: this should be io_schedule().  The timeout is there as a
> > > > +      * workaround for some DM problems in 2.6.18.
> > > > +      */
> > > > +     io_schedule_timeout(5*HZ);
> > > > +
> > > > +     finish_wait(&pool->wait, &wait);
> > > > +     goto repeat_alloc;
> > > > +}
> > > > +
> > > > +bool mempool_alloc_pages_bulk_list(mempool_t *pool, gfp_t gfp_mask,
> > > > +                                unsigned int nr,
> > > > +                                struct list_head *page_list)
> > > > +{
> > > > +     return mempool_alloc_pages_bulk(pool, gfp_mask, nr, page_list, NULL);
> > > > +}
> > > > +EXPORT_SYMBOL(mempool_alloc_pages_bulk_list);
> > > > +
> > > > +bool mempool_alloc_pages_bulk_array(mempool_t *pool, gfp_t gfp_mask,
> > > > +                                 unsigned int nr,
> > > > +                                 struct page **page_array)
> > > > +{
> > > > +     return mempool_alloc_pages_bulk(pool, gfp_mask, nr, NULL, page_array);
> > > > +}
> > > > +EXPORT_SYMBOL(mempool_alloc_pages_bulk_array);
> > > > +
> > > >  /**
> > > >   * mempool_free - return an element to the pool.
> > > >   * @element:   pool element pointer.
> > > > --
> > > > 2.26.3
> > > >
> > >
> >
>
Yang Shi Oct. 18, 2022, 6:01 p.m. UTC | #9
On Mon, Oct 17, 2022 at 2:41 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Oct 13, 2022 at 01:16:31PM -0700, Yang Shi wrote:
> > On Thu, Oct 13, 2022 at 5:38 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > On Wed, Oct 05, 2022 at 11:03:39AM -0700, Yang Shi wrote:
> > > > Since v5.13 the page bulk allocator was introduced to allocate order-0
> > > > pages in bulk.  There are a few mempool allocator callers which does
> > > > order-0 page allocation in a loop, for example, dm-crypt, f2fs compress,
> > > > etc.  A mempool page bulk allocator seems useful.  So introduce the
> > > > mempool page bulk allocator.
> > > >
> > > > It introduces the below APIs:
> > > >   - mempool_init_pages_bulk()
> > > >   - mempool_create_pages_bulk()
> > > > They initialize the mempool for page bulk allocator.  The pool is filled
> > > > by alloc_page() in a loop.
> > > >
> > > >   - mempool_alloc_pages_bulk_list()
> > > >   - mempool_alloc_pages_bulk_array()
> > > > They do bulk allocation from mempool.
> > > > They do the below conceptually:
> > > >   1. Call bulk page allocator
> > > >   2. If the allocation is fulfilled then return otherwise try to
> > > >      allocate the remaining pages from the mempool
> > > >   3. If it is fulfilled then return otherwise retry from #1 with sleepable
> > > >      gfp
> > > >   4. If it is still failed, sleep for a while to wait for the mempool is
> > > >      refilled, then retry from #1
> > > > The populated pages will stay on the list or array until the callers
> > > > consume them or free them.
> > > > Since mempool allocator is guaranteed to success in the sleepable context,
> > > > so the two APIs return true for success or false for fail.  It is the
> > > > caller's responsibility to handle failure case (partial allocation), just
> > > > like the page bulk allocator.
> > > >
> > > > The mempool typically is an object agnostic allocator, but bulk allocation
> > > > is only supported by pages, so the mempool bulk allocator is for page
> > > > allocation only as well.
> > > >
> > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > >
> > > Overall, I think it's an ok approach and certainly a good use case for
> > > the bulk allocator.
> > >
> > > The main concern that I have is that the dm-crypt use case doesn't really
> > > want to use lists as such and it's just a means for collecting pages to pass
> > > to bio_add_page(). bio_add_page() is working with arrays but you cannot
> > > use that array directly as any change to how that array is populated will
> > > then explode. Unfortunately, what you have is adding pages to a list to
> > > take them off the list and put them in an array and that is inefficient.
> >
> > Yeah, I didn't think of a better way to pass the pages to dm-crypt.
> >
> > >
> > > How about this
> > >
> > > 1. Add a callback to __alloc_pages_bulk() that takes a page as a
> > >    parameter like bulk_add_page() or whatever.
> > >
> > > 2. For page_list == NULL && page_array == NULL, the callback is used
> > >
> > > 3. Add alloc_pages_bulk_cb() that passes in the name of a callback
> > >    function
> > >
> > > 4. In the dm-crypt case, use the callback to pass the page to bio_add_page
> > >    for the new page allocated.
> >
> > Thank you so much for the suggestion. But I have a hard time
> > understanding how these work together. Do you mean call bio_add_page()
> > in the callback? But bio_add_page() needs other parameters. Or I
> > misunderstood you?
> >
>
> I expected dm-crypt to define the callback. Using bio_add_page
> directly would not work as the bulk allocator has no idea what to pass
> bio_add_page. dm-crypt would likely need to create both a callback and an
> opaque data structure passed as (void *) to track "clone" and "len"

I see. Yeah, we have to pass the "clone" and "len" to the callback via
pool_data. It should not be hard since dm-crypt already uses
crypt_config to maintain a counter for allocated pages, we should just
need to pass the struct to the callback as a parameter.

But I'm wondering whether this is worth it or not? Will it make the
code harder to follow?

>
> --
> Mel Gorman
> SUSE Labs
Mel Gorman Oct. 21, 2022, 9:19 a.m. UTC | #10
On Tue, Oct 18, 2022 at 11:01:31AM -0700, Yang Shi wrote:
> > > Yeah, I didn't think of a better way to pass the pages to dm-crypt.
> > >
> > > >
> > > > How about this
> > > >
> > > > 1. Add a callback to __alloc_pages_bulk() that takes a page as a
> > > >    parameter like bulk_add_page() or whatever.
> > > >
> > > > 2. For page_list == NULL && page_array == NULL, the callback is used
> > > >
> > > > 3. Add alloc_pages_bulk_cb() that passes in the name of a callback
> > > >    function
> > > >
> > > > 4. In the dm-crypt case, use the callback to pass the page to bio_add_page
> > > >    for the new page allocated.
> > >
> > > Thank you so much for the suggestion. But I have a hard time
> > > understanding how these work together. Do you mean call bio_add_page()
> > > in the callback? But bio_add_page() needs other parameters. Or I
> > > misunderstood you?
> > >
> >
> > I expected dm-crypt to define the callback. Using bio_add_page
> > directly would not work as the bulk allocator has no idea what to pass
> > bio_add_page. dm-crypt would likely need to create both a callback and an
> > opaque data structure passed as (void *) to track "clone" and "len"
> 
> I see. Yeah, we have to pass the "clone" and "len" to the callback via
> pool_data. It should not be hard since dm-crypt already uses
> crypt_config to maintain a counter for allocated pages, we should just
> need to pass the struct to the callback as a parameter.
> 
> But I'm wondering whether this is worth it or not? Will it make the
> code harder to follow?
> 

A little because a callback is involved but it's not the only place in the
kernel where a callback is used like this and a comment should suffice. It
should be faster than list manipulation if nothing else. Mostly, I'm wary
of adding the first user of the list interface for the bulk allocator that
does not even want a list. If there isn't a user of the list interface
that *requires* it, the support will simply be deleted as dead code.
Yang Shi Oct. 21, 2022, 9:04 p.m. UTC | #11
On Fri, Oct 21, 2022 at 2:19 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Oct 18, 2022 at 11:01:31AM -0700, Yang Shi wrote:
> > > > Yeah, I didn't think of a better way to pass the pages to dm-crypt.
> > > >
> > > > >
> > > > > How about this
> > > > >
> > > > > 1. Add a callback to __alloc_pages_bulk() that takes a page as a
> > > > >    parameter like bulk_add_page() or whatever.
> > > > >
> > > > > 2. For page_list == NULL && page_array == NULL, the callback is used
> > > > >
> > > > > 3. Add alloc_pages_bulk_cb() that passes in the name of a callback
> > > > >    function
> > > > >
> > > > > 4. In the dm-crypt case, use the callback to pass the page to bio_add_page
> > > > >    for the new page allocated.
> > > >
> > > > Thank you so much for the suggestion. But I have a hard time
> > > > understanding how these work together. Do you mean call bio_add_page()
> > > > in the callback? But bio_add_page() needs other parameters. Or I
> > > > misunderstood you?
> > > >
> > >
> > > I expected dm-crypt to define the callback. Using bio_add_page
> > > directly would not work as the bulk allocator has no idea what to pass
> > > bio_add_page. dm-crypt would likely need to create both a callback and an
> > > opaque data structure passed as (void *) to track "clone" and "len"
> >
> > I see. Yeah, we have to pass the "clone" and "len" to the callback via
> > pool_data. It should not be hard since dm-crypt already uses
> > crypt_config to maintain a counter for allocated pages, we should just
> > need to pass the struct to the callback as a parameter.
> >
> > But I'm wondering whether this is worth it or not? Will it make the
> > code harder to follow?
> >
>
> A little because a callback is involved but it's not the only place in the
> kernel where a callback is used like this and a comment should suffice. It
> should be faster than list manipulation if nothing else. Mostly, I'm wary
> of adding the first user of the list interface for the bulk allocator that
> does not even want a list. If there isn't a user of the list interface
> that *requires* it, the support will simply be deleted as dead code.

Thanks, I see your point. Will work on the new version to implement
the callback approach.

>
> --
> Mel Gorman
> SUSE Labs
diff mbox series

Patch

diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 0c964ac107c2..8bad28bceaa8 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -13,6 +13,11 @@  struct kmem_cache;
 typedef void * (mempool_alloc_t)(gfp_t gfp_mask, void *pool_data);
 typedef void (mempool_free_t)(void *element, void *pool_data);
 
+typedef unsigned int (mempool_alloc_pages_bulk_t)(gfp_t gfp_mask,
+					unsigned int nr, void *pool_data,
+					struct list_head *page_list,
+					struct page **page_array);
+
 typedef struct mempool_s {
 	spinlock_t lock;
 	int min_nr;		/* nr of elements at *elements */
@@ -22,6 +27,7 @@  typedef struct mempool_s {
 	void *pool_data;
 	mempool_alloc_t *alloc;
 	mempool_free_t *free;
+	mempool_alloc_pages_bulk_t *alloc_pages_bulk;
 	wait_queue_head_t wait;
 } mempool_t;
 
@@ -36,18 +42,31 @@  int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
 		      gfp_t gfp_mask, int node_id);
 int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
 		 mempool_free_t *free_fn, void *pool_data);
+int mempool_init_pages_bulk(mempool_t *pool, int min_nr,
+			    mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
+			    mempool_free_t *free_fn, void *pool_data);
 
 extern mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
 			mempool_free_t *free_fn, void *pool_data);
 extern mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
 			mempool_free_t *free_fn, void *pool_data,
 			gfp_t gfp_mask, int nid);
+extern mempool_t *mempool_create_pages_bulk(int min_nr,
+			mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
+			mempool_free_t *free_fn, void *pool_data);
 
 extern int mempool_resize(mempool_t *pool, int new_min_nr);
 extern void mempool_destroy(mempool_t *pool);
 extern void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) __malloc;
 extern void mempool_free(void *element, mempool_t *pool);
 
+extern bool mempool_alloc_pages_bulk_list(mempool_t *pool, gfp_t gfp_mask,
+					  unsigned int nr,
+					  struct list_head *page_list);
+extern bool mempool_alloc_pages_bulk_array(mempool_t *pool, gfp_t gfp_mask,
+					   unsigned int nr,
+					   struct page **page_array);
+
 /*
  * A mempool_alloc_t and mempool_free_t that get the memory from
  * a slab cache that is passed in through pool_data.
diff --git a/mm/mempool.c b/mm/mempool.c
index ba32151f3843..7711ca2e6d66 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -177,6 +177,7 @@  void mempool_destroy(mempool_t *pool)
 EXPORT_SYMBOL(mempool_destroy);
 
 static inline int __mempool_init(mempool_t *pool, int min_nr,
+				 mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
 				 mempool_alloc_t *alloc_fn,
 				 mempool_free_t *free_fn, void *pool_data,
 				 gfp_t gfp_mask, int node_id)
@@ -186,8 +187,11 @@  static inline int __mempool_init(mempool_t *pool, int min_nr,
 	pool->pool_data = pool_data;
 	pool->alloc	= alloc_fn;
 	pool->free	= free_fn;
+	pool->alloc_pages_bulk = alloc_pages_bulk_fn;
 	init_waitqueue_head(&pool->wait);
 
+	WARN_ON_ONCE(alloc_pages_bulk_fn && alloc_fn);
+
 	pool->elements = kmalloc_array_node(min_nr, sizeof(void *),
 					    gfp_mask, node_id);
 	if (!pool->elements)
@@ -199,7 +203,10 @@  static inline int __mempool_init(mempool_t *pool, int min_nr,
 	while (pool->curr_nr < pool->min_nr) {
 		void *element;
 
-		element = pool->alloc(gfp_mask, pool->pool_data);
+		if (pool->alloc_pages_bulk)
+			element = alloc_page(gfp_mask);
+		else
+			element = pool->alloc(gfp_mask, pool->pool_data);
 		if (unlikely(!element)) {
 			mempool_exit(pool);
 			return -ENOMEM;
@@ -214,7 +221,7 @@  int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
 		      mempool_free_t *free_fn, void *pool_data,
 		      gfp_t gfp_mask, int node_id)
 {
-	return __mempool_init(pool, min_nr, alloc_fn, free_fn, pool_data,
+	return __mempool_init(pool, min_nr, NULL, alloc_fn, free_fn, pool_data,
 			      gfp_mask, node_id);
 }
 EXPORT_SYMBOL(mempool_init_node);
@@ -236,15 +243,40 @@  EXPORT_SYMBOL(mempool_init_node);
 int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
 		 mempool_free_t *free_fn, void *pool_data)
 {
-	return __mempool_init(pool, min_nr, alloc_fn, free_fn,
+	return __mempool_init(pool, min_nr, NULL, alloc_fn, free_fn,
 			      pool_data, GFP_KERNEL, NUMA_NO_NODE);
 
 }
 EXPORT_SYMBOL(mempool_init);
 
-static mempool_t *__mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
-				   mempool_free_t *free_fn, void *pool_data,
-				   gfp_t gfp_mask, int node_id)
+/**
+ * mempool_init_pages_bulk - initialize a pages pool for bulk allocator
+ * @pool: pointer to the memory pool that should be initialized
+ * @min_nr: the minimum number of elements guaranteed to be
+ *          allocated for this pool.
+ * @alloc_pages_bulk_fn: user-defined pages bulk allocation function.
+ * @free_fn: user-defined element-freeing function.
+ * @pool_data: optional private data available to the user-defined functions.
+ *
+ * Like mempool_create(), but initializes the pool in (i.e. embedded in another
+ * structure).
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int mempool_init_pages_bulk(mempool_t *pool, int min_nr,
+			    mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
+			    mempool_free_t *free_fn, void *pool_data)
+{
+	return __mempool_init(pool, min_nr, alloc_pages_bulk_fn, NULL,
+			      free_fn, pool_data, GFP_KERNEL, NUMA_NO_NODE);
+}
+EXPORT_SYMBOL(mempool_init_pages_bulk);
+
+static mempool_t *__mempool_create(int min_nr,
+			mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
+			mempool_alloc_t *alloc_fn,
+			mempool_free_t *free_fn, void *pool_data,
+			gfp_t gfp_mask, int node_id)
 {
 	mempool_t *pool;
 
@@ -252,8 +284,8 @@  static mempool_t *__mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
 	if (!pool)
 		return NULL;
 
-	if (__mempool_init(pool, min_nr, alloc_fn, free_fn, pool_data,
-			   gfp_mask, node_id)) {
+	if (__mempool_init(pool, min_nr, alloc_pages_bulk_fn, alloc_fn,
+			   free_fn, pool_data, gfp_mask, node_id)) {
 		kfree(pool);
 		return NULL;
 	}
@@ -280,7 +312,7 @@  static mempool_t *__mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
 mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
 				mempool_free_t *free_fn, void *pool_data)
 {
-	return __mempool_create(min_nr, alloc_fn, free_fn, pool_data,
+	return __mempool_create(min_nr, NULL, alloc_fn, free_fn, pool_data,
 				GFP_KERNEL, NUMA_NO_NODE);
 }
 EXPORT_SYMBOL(mempool_create);
@@ -289,11 +321,21 @@  mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
 			       mempool_free_t *free_fn, void *pool_data,
 			       gfp_t gfp_mask, int node_id)
 {
-	return __mempool_create(min_nr, alloc_fn, free_fn, pool_data,
+	return __mempool_create(min_nr, NULL, alloc_fn, free_fn, pool_data,
 				gfp_mask, node_id);
 }
 EXPORT_SYMBOL(mempool_create_node);
 
+mempool_t* mempool_create_pages_bulk(int min_nr,
+			mempool_alloc_pages_bulk_t *alloc_pages_bulk_fn,
+			mempool_free_t *free_fn, void *pool_data)
+{
+	return __mempool_create(min_nr, alloc_pages_bulk_fn, NULL,
+				free_fn, pool_data, GFP_KERNEL,
+				NUMA_NO_NODE);
+}
+EXPORT_SYMBOL(mempool_create_pages_bulk);
+
 /**
  * mempool_resize - resize an existing memory pool
  * @pool:       pointer to the memory pool which was allocated via
@@ -457,6 +499,132 @@  void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(mempool_alloc);
 
+/**
+ * mempool_alloc_pages_bulk - allocate a bulk of pagesfrom a specific
+ *                           memory pool
+ * @pool:       pointer to the memory pool which was allocated via
+ *              mempool_create().
+ * @gfp_mask:   the usual allocation bitmask.
+ * @nr:         the number of requested pages.
+ * @page_list:  the list the pages will be added to.
+ * @page_array: the array the pages will be added to.
+ *
+ * this function only sleeps if the alloc_pages_bulk_fn() function sleeps
+ * or the allocation can not be satisfied even though the mempool is depleted.
+ * Note that due to preallocation, this function *never* fails when called
+ * from process contexts. (it might fail if called from an IRQ context.)
+ * Note: using __GFP_ZERO is not supported.  And the caller should not pass
+ * in both valid page_list and page_array.
+ *
+ * Return: true when nr pages are allocated or false if not.  It is the
+ *         caller's responsibility to free the partial allocated pages.
+ */
+static bool mempool_alloc_pages_bulk(mempool_t *pool, gfp_t gfp_mask,
+				     unsigned int nr,
+				     struct list_head *page_list,
+				     struct page **page_array)
+{
+	unsigned long flags;
+	wait_queue_entry_t wait;
+	gfp_t gfp_temp;
+	int i;
+	unsigned int ret, nr_remaining;
+	struct page *page;
+
+	VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
+	might_alloc(gfp_mask);
+
+	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
+	gfp_mask |= __GFP_NORETRY;      /* don't loop in __alloc_pages */
+	gfp_mask |= __GFP_NOWARN;       /* failures are OK */
+
+	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
+
+repeat_alloc:
+	i = 0;
+	ret = pool->alloc_pages_bulk(gfp_temp, nr, pool->pool_data, page_list,
+				     page_array);
+
+	if (ret == nr)
+		return true;
+
+	nr_remaining = nr - ret;
+
+	spin_lock_irqsave(&pool->lock, flags);
+	/* Allocate page from the pool and add to the list or array */
+	while (pool->curr_nr && (nr_remaining > 0)) {
+		page = remove_element(pool);
+		spin_unlock_irqrestore(&pool->lock, flags);
+		smp_wmb();
+
+		kmemleak_update_trace((void *)page);
+
+		if (page_list)
+			list_add(&page->lru, page_list);
+		else
+			page_array[ret + i] = page;
+
+		i++;
+		nr_remaining--;
+
+		spin_lock_irqsave(&pool->lock, flags);
+	}
+
+	spin_unlock_irqrestore(&pool->lock, flags);
+
+	if (!nr_remaining)
+		return true;
+
+	/*
+	 * The bulk allocator counts in the populated pages for array,
+	 * but don't do it for list.
+	 */
+	if (page_list)
+		nr = nr_remaining;
+
+	/*
+	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
+	 * alloc failed with that and @pool was empty, retry immediately.
+	 */
+	if (gfp_temp != gfp_mask) {
+		gfp_temp = gfp_mask;
+		goto repeat_alloc;
+	}
+
+	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return false;
+
+	/* Let's wait for someone else to return an element to @pool */
+	init_wait(&wait);
+	prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
+
+	/*
+	 * FIXME: this should be io_schedule().  The timeout is there as a
+	 * workaround for some DM problems in 2.6.18.
+	 */
+	io_schedule_timeout(5*HZ);
+
+	finish_wait(&pool->wait, &wait);
+	goto repeat_alloc;
+}
+
+bool mempool_alloc_pages_bulk_list(mempool_t *pool, gfp_t gfp_mask,
+				   unsigned int nr,
+				   struct list_head *page_list)
+{
+	return mempool_alloc_pages_bulk(pool, gfp_mask, nr, page_list, NULL);
+}
+EXPORT_SYMBOL(mempool_alloc_pages_bulk_list);
+
+bool mempool_alloc_pages_bulk_array(mempool_t *pool, gfp_t gfp_mask,
+				    unsigned int nr,
+				    struct page **page_array)
+{
+	return mempool_alloc_pages_bulk(pool, gfp_mask, nr, NULL, page_array);
+}
+EXPORT_SYMBOL(mempool_alloc_pages_bulk_array);
+
 /**
  * mempool_free - return an element to the pool.
  * @element:   pool element pointer.