mbox series

[v6,0/6] Introduce multi-preference mempolicy

Message ID 1626077374-81682-1-git-send-email-feng.tang@intel.com (mailing list archive)
Headers show
Series Introduce multi-preference mempolicy | expand

Message

Feng Tang July 12, 2021, 8:09 a.m. UTC
This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
preference for nodes which will fulfil memory allocation requests. Unlike the
MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it
works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or
invoke the OOM killer if those preferred nodes are not available.

Along with these patches are patches for libnuma, numactl, numademo, and memhog.
They still need some polish, but can be found here:
https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many
It allows new usage: `numactl -P 0,3,4`

The goal of the new mode is to enable some use-cases when using tiered memory
usage models which I've lovingly named.
1a. The Hare - The interconnect is fast enough to meet bandwidth and latency
requirements allowing preference to be given to all nodes with "fast" memory.
1b. The Indiscriminate Hare - An application knows it wants fast memory (or
perhaps slow memory), but doesn't care which node it runs on. The application
can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator,
etc). This reverses the nodes are chosen today where the kernel attempts to use
local memory to the CPU whenever possible. This will attempt to use the local
accelerator to the memory.
2. The Tortoise - The administrator (or the application itself) is aware it only
needs slow memory, and so can prefer that.

Much of this is almost achievable with the bind interface, but the bind
interface suffers from an inability to fallback to another set of nodes if
binding fails to all nodes in the nodemask.

Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
preference.

> /* Set first two nodes as preferred in an 8 node system. */
> const unsigned long nodes = 0x3
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

> /* Mimic interleave policy, but have fallback *.
> const unsigned long nodes = 0xaa
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

Some internal discussion took place around the interface. There are two
alternatives which we have discussed, plus one I stuck in:
1. Ordered list of nodes. Currently it's believed that the added complexity is
   nod needed for expected usecases.
2. A flag for bind to allow falling back to other nodes. This confuses the
   notion of binding and is less flexible than the current solution.
3. Create flags or new modes that helps with some ordering. This offers both a
   friendlier API as well as a solution for more customized usage. It's unknown
   if it's worth the complexity to support this. Here is sample code for how
   this might work:

> // Prefer specific nodes for some something wacky
> set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024);
>
> // Default
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> // which is the same as
> set_mempolicy(MPOL_DEFAULT, NULL, 0);
>
> // The Hare
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
>
> // The Tortoise
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
>
> // Prefer the fast memory of the first two sockets
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
>

In v1, Andi Kleen brought up reusing MPOL_PREFERRED as the mode for the API.
There wasn't consensus around this, so I've left the existing API as it was. I'm
open to more feedback here, but my slight preference is to use a new API as it
ensures if people are using it, they are entirely aware of what they're doing
and not accidentally misusing the old interface. (In a similar way to how
MPOL_LOCAL was introduced).

In v1, Michal also brought up renaming this MPOL_PREFERRED_MASK. I'm equally
fine with that change, but I hadn't heard much emphatic support for one way or
another, so I've left that too.

- Ben/Dave/Feng

---
Changelog: 

  Since v5:
  * Rebased against 5.14-rc1. 

  Since v4:
  * Rebased on latest -mm tree (v5.13-rc), whose mempolicy code has
    been refactored much since v4 submission
  * add a dedicated alloc_page_preferred_many() (Michal Hocko)
  * refactor and add fix to hugetlb supporting code (Michal Hocko) 

  Since v3:
  * Rebased against v5.12-rc2
  * Drop the v3/0013 patch of creating NO_SLOWPATH gfp_mask bit
  * Skip direct reclaim for the first allocation try for
    MPOL_PREFERRED_MANY, which makes its semantics close to
    existing MPOL_PREFFERRED policy

  Since v2:
  * Rebased against v5.11
  * Fix a stack overflow related panic, and a kernel warning (Feng)
  * Some code clearup (Feng)
  * One RFC patch to speedup mem alloc in some case (Feng)

  Since v1:
  * Dropped patch to replace numa_node_id in some places (mhocko)
  * Dropped all the page allocation patches in favor of new mechanism to
    use fallbacks. (mhocko)
  * Dropped the special snowflake preferred node algorithm (bwidawsk)
  * If the preferred node fails, ALL nodes are rechecked instead of just
    the non-preferred nodes.


---

Ben Widawsky (3):
  mm/mempolicy: enable page allocation for MPOL_PREFERRED_MANY for
    general cases
  mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  mm/mempolicy: Advertise new MPOL_PREFERRED_MANY

Dave Hansen (1):
  mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes

Feng Tang (2):
  mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY
    policy
  mm/mempolicy: unify the create() func for bind/interleave/prefer-many
    policies

 .../admin-guide/mm/numa_memory_policy.rst          | 16 +++--
 include/uapi/linux/mempolicy.h                     |  1 +
 mm/hugetlb.c                                       | 25 ++++++++
 mm/mempolicy.c                                     | 75 +++++++++++++++++-----
 4 files changed, 96 insertions(+), 21 deletions(-)

Comments

Andrew Morton July 15, 2021, 12:15 a.m. UTC | #1
On Mon, 12 Jul 2021 16:09:28 +0800 Feng Tang <feng.tang@intel.com> wrote:

> This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
> This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
> interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
> preference for nodes which will fulfil memory allocation requests. Unlike the
> MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it
> works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or
> invoke the OOM killer if those preferred nodes are not available.

Do we have any real-world testing which demonstrates the benefits of
all of this?
Feng Tang July 15, 2021, 2:13 a.m. UTC | #2
Hi Andrew,

Thanks for reviewing!

On Wed, Jul 14, 2021 at 05:15:40PM -0700, Andrew Morton wrote:
> On Mon, 12 Jul 2021 16:09:28 +0800 Feng Tang <feng.tang@intel.com> wrote:
> 
> > This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
> > This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
> > interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
> > preference for nodes which will fulfil memory allocation requests. Unlike the
> > MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it
> > works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or
> > invoke the OOM killer if those preferred nodes are not available.
> 
> Do we have any real-world testing which demonstrates the benefits of
> all of this?

We have done some internal tests, and are actively working with some external
customer on using this new 'prefer-many' policy, as they have different
types of memory (fast DRAM and slower Persistent memory) in system, and their
program wants to set clear preference for several NUMA nodes, to better deploy
the huge application data before running the application. 

We have met another issue that customer wanted to run a docker container
while binding it to 2 persistent memory nodes, which always failed. At that
time we tried 2 hack pachtes to solve it.
https://lore.kernel.org/lkml/1604470210-124827-2-git-send-email-feng.tang@intel.com/
https://lore.kernel.org/lkml/1604470210-124827-3-git-send-email-feng.tang@intel.com/
And that use case can be easily achieved with this new policy.

Thanks,
Feng
Dave Hansen July 15, 2021, 6:49 p.m. UTC | #3
On 7/14/21 5:15 PM, Andrew Morton wrote:
> On Mon, 12 Jul 2021 16:09:28 +0800 Feng Tang <feng.tang@intel.com> wrote:
>> This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
>> This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
>> interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
>> preference for nodes which will fulfil memory allocation requests. Unlike the
>> MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it
>> works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or
>> invoke the OOM killer if those preferred nodes are not available.
> Do we have any real-world testing which demonstrates the benefits of
> all of this?

Yes, it's actually been quite useful in practice already.

If we take persistent memory media (PMEM) and hot-add/online it with the
DAX kmem driver, we get NUMA nodes with lots of capacity (~6TB is
typical) but weird performance; PMEM has good read speed, but low write
speed.

That low write speed is *so* low that it dominates the performance more
than the distance from the CPUs.  Folks who want PMEM really don't care
about locality.  The discussions with the testers usually go something
like this:

Tester: How do I make my test use PMEM on nodes 2 and 3?
Kernel Guys: use 'numactl --membind=2-3'
Tester: I tried that, but I'm getting allocation failures once I fill up
        PMEM.  Shouldn't it fall back to DRAM?
Kernel Guys: Fine, use 'numactl --preferred=2-3'
Tester: That worked, but it started using DRAM after it exhausted node 2
Kernel Guys:  Dang it.  I forgot --preferred ignores everything after
              the first node.  Fine, we'll patch the kernel.

This has happened more than once.  End users want to be able to specify
a specific physical media, but don't want to have to deal with the sharp
edges of strict binding.

This has happened both with slow media like PMEM and "faster" media like
High-Bandwidth Memory.