From patchwork Mon Dec 3 23:34:57 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Jerome Glisse X-Patchwork-Id: 10710891 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9BDA813BF for ; Mon, 3 Dec 2018 23:35:50 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8A8512B1EF for ; Mon, 3 Dec 2018 23:35:50 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7E7E02B212; Mon, 3 Dec 2018 23:35:50 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 34C2C2B1EF for ; Mon, 3 Dec 2018 23:35:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 47D176B6BAB; Mon, 3 Dec 2018 18:35:48 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 407966B6BAC; Mon, 3 Dec 2018 18:35:48 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 280A36B6BAD; Mon, 3 Dec 2018 18:35:48 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com [209.85.160.198]) by kanga.kvack.org (Postfix) with ESMTP id E77576B6BAB for ; Mon, 3 Dec 2018 18:35:47 -0500 (EST) Received: by mail-qt1-f198.google.com with SMTP id p24so15318294qtl.2 for ; Mon, 03 Dec 2018 15:35:47 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=bAhfSHaNpVkjhiii1+FvF59gmZCQViAd0Klb0FF8pPo=; b=mboUmsMhu8NC7Dhk2g9mUQA1SyHYDQkW83JNT5Ys6QmIIVD84JvZPp0wc6+0RB7C4O 3GQwVGj/fLeaPANuVoK/Er3+44u5ATxSkPLa1BOxSCtBrZGMG2VhqtGb8vjWkonUyC6D ywiKEyYKxbHANdxW4UBf14aeTZZByUhGNCLMD+JfoUHHfLOkG494zhSR5NjgCvRlqbiE WF3VWxZl1KW0dEaJDgn4KGngMbcWlEO1FHEJ4JrPg5B5K/Qx2Bkb4t0CNaiTzYcTsgjw gG1OKt3A/adQH88XJnmKu7sUBEybQ1j/FHrVawaz6cO+96k9g2XqCtPqEOSDSm9pkkCo 4hqw== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of jglisse@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=jglisse@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com X-Gm-Message-State: AA+aEWZJHaIRCvREjugATjnNkrOTPG5KocirUEDFsyFe2M4ttnbiLTTq 3Y2OzI/LIU2UORphFt56+jHDxh9mvnioZzVSYao1nrqP6LD/oF8EUp/rLVL+HdnL8NbDsIHc2cX 5yYsq4xPpbazbtL2Zx0yNtNPs1WRnSH822EgDc4ZeeleNXTokhJ/BjDMVXrtfxcGNgQ== X-Received: by 2002:a37:a0d5:: with SMTP id j204mr16698697qke.261.1543880147644; Mon, 03 Dec 2018 15:35:47 -0800 (PST) X-Google-Smtp-Source: AFSGD/XQaWMTKvcBunxZJxG77ipah3vi2EeqxpTD8SXoXXQxVoK/qISbSSua2lUQF8dl7UtXUyrd X-Received: by 2002:a37:a0d5:: with SMTP id j204mr16698662qke.261.1543880146636; Mon, 03 Dec 2018 15:35:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1543880146; cv=none; d=google.com; s=arc-20160816; b=KhRM1iIzA/IsbtuCCoeroh2UzQZLUN2B6VEt39MltibtzKISRMLv9SYjOD7LaLvu+O pez26BsfoFqk9twvC0ggtb7B9AAb2ymueRagfoMda0Tz59eg/PH4qn28B6Ti4xwDyyj0 vWlsiGNHXb9v2mUUWt36Vz6AafxasRq5XtlrK8b2WEjuJ/PqexMdh9BNpnL+vMFLM36Q OJKV2YJacxvKl1YFWcLgxtrVHKAwi38Qh8nORn81UULi4R8m0P6uXYsm3D/ON8teO6OW ZVa3YDuQYJecfOifQzzIFpDtieqbrcR1DuRrzy089Vx/Ney2kmbmfslIqOvuMyc4wba/ C1Xw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from; bh=bAhfSHaNpVkjhiii1+FvF59gmZCQViAd0Klb0FF8pPo=; b=hYsAe8wQ8p3zvPf+o1fDW9qSkfefvFvgJbCa880Ge/DehGYZj0n8agiMHMx0QR9MQR d7DXTp6VqlNqi6ww1X+KiPKSkIZfWImr+qFdZ5MrXcbpdG6tk0cQdgLfMHl3XewYm8Xb cqTDWYOSdDxOvtAHKybZyyy6IxRIi/A/xLpt4Gq+Wu1pFKjWFPUx6ZO/cGjlVohHtO6+ QpMumZl/020mN10nVXQ5QSh2H2g42ZGzeyaGbD54iIYMaS5vKUixRqdhWeNfUrPQb0uH 667X3f/h8Elaa+0M7Y1x+EuaK3fa/fFh8mQqiZmUlnc5S1qjmE3kunKmyKP3HM/mEnhW cxpQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of jglisse@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=jglisse@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id v40si3164371qta.42.2018.12.03.15.35.46 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Dec 2018 15:35:46 -0800 (PST) Received-SPF: pass (google.com: domain of jglisse@redhat.com designates 209.132.183.28 as permitted sender) client-ip=209.132.183.28; Authentication-Results: mx.google.com; spf=pass (google.com: domain of jglisse@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=jglisse@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1CCEFC036770; Mon, 3 Dec 2018 23:35:45 +0000 (UTC) Received: from localhost.localdomain.com (ovpn-120-188.rdu2.redhat.com [10.10.120.188]) by smtp.corp.redhat.com (Postfix) with ESMTP id 90B2D605A8; Mon, 3 Dec 2018 23:35:36 +0000 (UTC) From: jglisse@redhat.com To: linux-mm@kvack.org Cc: Andrew Morton , linux-kernel@vger.kernel.org, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= , "Rafael J . Wysocki" , Ross Zwisler , Dan Williams , Dave Hansen , Haggai Eran , Balbir Singh , "Aneesh Kumar K . V" , Benjamin Herrenschmidt , Felix Kuehling , Philip Yang , =?utf-8?q?Christian_K=C3=B6nig?= , Paul Blinzer , Logan Gunthorpe , John Hubbard , Ralph Campbell , Michal Hocko , Jonathan Cameron , Mark Hairgrove , Vivek Kini , Mel Gorman , Dave Airlie , Ben Skeggs , Andrea Arcangeli Subject: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation Date: Mon, 3 Dec 2018 18:34:57 -0500 Message-Id: <20181203233509.20671-3-jglisse@redhat.com> In-Reply-To: <20181203233509.20671-1-jglisse@redhat.com> References: <20181203233509.20671-1-jglisse@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Mon, 03 Dec 2018 23:35:45 +0000 (UTC) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP From: Jérôme Glisse Add documentation to what is HMS and what it is for (see patch content). Signed-off-by: Jérôme Glisse Cc: Rafael J. Wysocki Cc: Ross Zwisler Cc: Dan Williams Cc: Dave Hansen Cc: Haggai Eran Cc: Balbir Singh Cc: Aneesh Kumar K.V Cc: Benjamin Herrenschmidt Cc: Felix Kuehling Cc: Philip Yang Cc: Christian König Cc: Paul Blinzer Cc: Logan Gunthorpe Cc: John Hubbard Cc: Ralph Campbell Cc: Michal Hocko Cc: Jonathan Cameron Cc: Mark Hairgrove Cc: Vivek Kini Cc: Mel Gorman Cc: Dave Airlie Cc: Ben Skeggs Cc: Andrea Arcangeli --- Documentation/vm/hms.rst | 275 ++++++++++++++++++++++++++++++++++----- 1 file changed, 246 insertions(+), 29 deletions(-) diff --git a/Documentation/vm/hms.rst b/Documentation/vm/hms.rst index dbf0f71918a9..bd7c9e8e7077 100644 --- a/Documentation/vm/hms.rst +++ b/Documentation/vm/hms.rst @@ -4,32 +4,249 @@ Heterogeneous Memory System (HMS) ================================= -System with complex memory topology needs a more versatile memory topology -description than just node where a node is a collection of memory and CPU. -In heterogeneous memory system we consider four types of object:: - - target: which is any kind of memory - - initiator: any kind of device or CPU - - inter-connect: any kind of links that connects target and initiator - - bridge: a link between two inter-connects - -Properties (like bandwidth, latency, bus width, ...) are define per bridge -and per inter-connect. Property of an inter-connect apply to all initiators -which are link to that inter-connect. Not all initiators are link to all -inter-connect and thus not all initiators can access all memory (this apply -to CPU too ie some CPU might not be able to access all memory). - -Bridges allow initiators (that can use the bridge) to access target for -which they do not have a direct link with (ie they do not share a common -inter-connect with the target). - -Through this four types of object we can describe any kind of system memory -topology. To expose this to userspace we expose a new sysfs hierarchy (that -co-exist with the existing one):: - - /sys/bus/hms/target* all targets in the system - - /sys/bus/hms/initiator* all initiators in the system - - /sys/bus/hms/interconnect* all inter-connects in the system - - /sys/bus/hms/bridge* all bridges in the system - -Inside each bridge or inter-connect directory they are symlinks to targets -and initiators that are linked to that bridge or inter-connect. Properties -are defined inside bridge and inter-connect directory. +Heterogeneous memory system are becoming more and more the norm, in +those system there is not only the main system memory for each node, +but also device memory and|or memory hierarchy to consider. Device +memory can comes from a device like GPU, FPGA, ... or from a memory +only device (persistent memory, or high density memory device). + +Memory hierarchy is when you not only have the main memory but also +other type of memory like HBM (High Bandwidth Memory often stack up +on CPU die or GPU die), peristent memory or high density memory (ie +something slower then regular DDR DIMM but much bigger). + +On top of this diversity of memories you also have to account for the +system bus topology ie how all CPUs and devices are connected to each +others. Userspace do not care about the exact physical topology but +care about topology from behavior point of view ie what are all the +paths between an initiator (anything that can initiate memory access +like CPU, GPU, FGPA, network controller ...) and a target memory and +what are all the properties of each of those path (bandwidth, latency, +granularity, ...). + +This means that it is no longer sufficient to consider a flat view +for each node in a system but for maximum performance we need to +account for all of this new memory but also for system topology. +This is why this proposal is unlike the HMAT proposal [1] which +tries to extend the existing NUMA for new type of memory. Here we +are tackling a much more profound change that depart from NUMA. + + +One of the reasons for radical change is the advance of accelerator +like GPU or FPGA means that CPU is no longer the only piece where +computation happens. It is becoming more and more common for an +application to use a mix and match of different accelerator to +perform its computation. So we can no longer satisfy our self with +a CPU centric and flat view of a system like NUMA and NUMA distance. + + +HMS tackle this problems through three aspects: + 1 - Expose complex system topology and various kind of memory + to user space so that application have a standard way and + single place to get all the information it cares about. + 2 - A new API for user space to bind/provide hint to kernel on + which memory to use for range of virtual address (a new + mbind() syscall). + 3 - Kernel side changes for vm policy to handle this changes + + +The rest of this documents is splits in 3 sections, the first section +talks about complex system topology: what it is, how it is use today +and how to describe it tomorrow. The second sections talks about +new API to bind/provide hint to kernel for range of virtual address. +The third section talks about new mechanism to track bind/hint +provided by user space or device driver inside the kernel. + + +1) Complex system topology and representing them +================================================ + +Inside a node you can have a complex topology of memory, for instance +you can have multiple HBM memory in a node, each HBM memory tie to a +set of CPUs (all of which are in the same node). This means that you +have a hierarchy of memory for CPUs. The local fast HBM but which is +expected to be relatively small compare to main memory and then the +main memory. New memory technology might also deepen this hierarchy +with another level of yet slower memory but gigantic in size (some +persistent memory technology might fall into that category). Another +example is device memory, and device themself can have a hierarchy +like HBM on top of device core and main device memory. + +On top of that you can have multiple path to access each memory and +each path can have different properties (latency, bandwidth, ...). +Also there is not always symmetry ie some memory might only be +accessible by some device or CPU ie not accessible by everyone. + +So a flat hierarchy for each node is not capable of representing this +kind of complexity. To simplify discussion and because we do not want +to single out CPU from device, from here on out we will use initiator +to refer to either CPU or device. An initiator is any kind of CPU or +device that can access memory (ie initiate memory access). + +At this point a example of such system might help: + - 2 nodes and for each node: + - 1 CPU per node with 2 complex of CPUs cores per CPU + - one HBM memory for each complex of CPUs cores (200GB/s) + - CPUs cores complex are linked to each other (100GB/s) + - main memory is (90GB/s) + - 4 GPUs each with: + - HBM memory for each GPU (1000GB/s) (not CPU accessible) + - GDDR memory for each GPU (500GB/s) (CPU accessible) + - connected to CPU root controller (60GB/s) + - connected to other GPUs (even GPUs from the second + node) with GPU link (400GB/s) + +In this example we restrict our self to bandwidth and ignore bus width +or latency, this is just to simplify discussions but obviously they +also factor in. + + +Userspace very much would like to know about this information, for +instance HPC folks have develop complex library to manage this and +there is wide research on the topics [2] [3] [4] [5]. Today most of +the work is done by hardcoding thing for specific platform. Which is +somewhat acceptable for HPC folks where the platform stays the same +for a long period of time. + +Roughly speaking i see two broads use case for topology information. +First is for virtualization and vm where you want to segment your +hardware properly for each vm (binding memory, CPU and GPU that are +all close to each others). Second is for application, many of which +can partition their workload to minimize exchange between partition +allowing each partition to be bind to a subset of device and CPUs +that are close to each others (for maximum locality). Here it is much +more than just NUMA distance, you can leverage the memory hierarchy +and the system topology all-together (see [2] [3] [4] [5] for more +references and details). + +So this is not exposing topology just for the sake of cool graph in +userspace. They are active user today of such information and if we +want to growth and broaden the usage we should provide a unified API +to standardize how that information is accessible to every one. + + +One proposal so far to handle new type of memory is to user CPU less +node for those [6]. While same idea can apply for device memory, it is +still hard to describe multiple path with different property in such +scheme. While it is backward compatible and have minimum changes, it +simplify can not convey complex topology (think any kind of random +graph, not just a tree like graph). + +So HMS use a new way to expose to userspace the system topology. It +relies on 4 types of objects: + - target: any kind of memory (main memory, HBM, device, ...) + - initiator: CPU or device (anything that can access memory) + - link: anything that link initiator and target + - bridges: anything that allow group of initiator to access + remote target (ie target they are not connected with directly + through an link) + +Properties like bandwidth, latency, ... are all sets per bridges and +links. All initiators connected to an link can access any target memory +also connected to the same link and all with the same link properties. + +Link do not need to match physical hardware ie you can have a single +physical link match a single or multiples software expose link. This +allows to model device connected to same physical link (like PCIE +for instance) but not with same characteristics (like number of lane +or lane speed in PCIE). The reverse is also true ie having a single +software expose link match multiples physical link. + +Bridges allows initiator to access remote link. A bridges connect two +links to each others and is also specific to list of initiators (ie +not all initiators connected to each of the link can use the bridge). +Bridges have their own properties (bandwidth, latency, ...) so that +the actual property value for each property is the lowest common +denominator between bridge and each of the links. + + +This model allows to describe any kind of directed graph and thus +allows to describe any kind of topology we might see in the future. +It is also easier to add new properties to each object type. + +Moreover it can be use to expose devices capable to do peer to peer +between them. For that simply have all devices capable to peer to +peer to have a common link or use the bridge object if the peer to +peer capabilities is only one way for instance. + + +HMS use the above scheme to expose system topology through sysfs under +/sys/bus/hms/ with: + - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, + each has a UID and you can usual value in that folder (node id, + size, ...) + + - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator + (CPU or device), each has a HMS UID but also a CPU id for CPU + (which match CPU id in (/sys/bus/cpu/). For device you have a + path that can be PCIE BUS ID for instance) + + - /sys/bus/hms/devices/v%version-%id-link : an link, each has a + UID and a file per property (bandwidth, latency, ...) you also + find a symlink to every target and initiator connected to that + link. + + - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has + a UID and a file per property (bandwidth, latency, ...) you + also find a symlink to all initiators that can use that bridge. + +To help with forward compatibility each object as a version value and +it is mandatory for user space to only use target or initiator with +version supported by the user space. For instance if user space only +knows about what version 1 means and sees a target with version 2 then +the user space must ignore that target as if it does not exist. + +Mandating that allows the additions of new properties that break back- +ward compatibility ie user space must know how this new property affect +the object to be able to use it safely. + +Main memory of each node is expose under a common target. For now +device driver are responsible to register memory they want to expose +through that scheme but in the future that information might come from +the system firmware (this is a different discussion). + + + +2) hbind() bind range of virtual address to heterogeneous memory +================================================================ + +So instead of using a bitmap, hbind() take an array of uid and each uid +is a unique memory target inside the new memory topology description. +User space also provide an array of modifiers. Modifier can be seen as +the flags parameter of mbind() but here we use an array so that user +space can not only supply a modifier but also value with it. This should +allow the API to grow more features in the future. Kernel should return +-EINVAL if it is provided with an unkown modifier and just ignore the +call all together, forcing the user space to restrict itself to modifier +supported by the kernel it is running on (i know i am dreaming about well +behave user space). + + +Note that none of this is exclusive of automatic memory placement like +autonuma. I also believe that we will see something similar to autonuma +for device memory. + + +3) Tracking and applying heterogeneous memory policies +====================================================== + +Current memory policy infrastructure is node oriented, instead of +changing that and risking breakage and regression HMS adds a new +heterogeneous policy tracking infra-structure. The expectation is +that existing application can keep using mbind() and all existing +infrastructure under-disturb and unaffected, while new application +will use the new API and should avoid mix and matching both (as they +can achieve the same thing with the new API). + +Also the policy is not directly tie to the vma structure for a few +reasons: + - avoid having to split vma for policy that do not cover full vma + - avoid changing too much vma code + - avoid growing the vma structure with an extra pointer + +The overall design is simple, on hbind() call a hms policy structure +is created for the supplied range and hms use the callback associated +with the target memory. This callback is provided by device driver +for device memory or by core HMS for regular main memory. The callback +can decide to migrate the range to the target memories or do nothing +(this can be influenced by flags provided to hbind() too).