[0/9] x86/clear_huge_page: multi-page clearing

Message ID	20230403052233.1880567-1-ankur.a.arora@oracle.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Ankur Arora <ankur.a.arora@oracle.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org Cc: torvalds@linux-foundation.org, akpm@linux-foundation.org, luto@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com, juri.lelli@redhat.com, willy@infradead.org, mgorman@suse.de, peterz@infradead.org, rostedt@goodmis.org, tglx@linutronix.de, vincent.guittot@linaro.org, jon.grimm@amd.com, bharata@amd.com, boris.ostrovsky@oracle.com, konrad.wilk@oracle.com, ankur.a.arora@oracle.com Subject: [PATCH 0/9] x86/clear_huge_page: multi-page clearing Date: Sun, 2 Apr 2023 22:22:24 -0700 Message-Id: <20230403052233.1880567-1-ankur.a.arora@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	x86/clear_huge_page: multi-page clearing \| expand [0/9] x86/clear_huge_page: multi-page clearing [1/9] huge_pages: get rid of process_huge_page() [2/9] huge_page: get rid of {clear,copy}_subpage() [3/9] huge_page: allow arch override for clear/copy_huge_page() [4/9] x86/clear_page: parameterize clear_page*() to specify length [5/9] x86/clear_pages: add clear_pages() [6/9] mm/clear_huge_page: use multi-page clearing [7/9] sched: define TIF_ALLOW_RESCHED [8/9] irqentry: define irqentry_exit_allow_resched() [9/9] x86/clear_huge_page: make clear_contig_region() preemptible

Message ID

20230403052233.1880567-1-ankur.a.arora@oracle.com (mailing list archive)

Headers

From: Ankur Arora <ankur.a.arora@oracle.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org
Cc: torvalds@linux-foundation.org, akpm@linux-foundation.org, luto@kernel.org,
        bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
        mingo@redhat.com, juri.lelli@redhat.com, willy@infradead.org,
        mgorman@suse.de, peterz@infradead.org, rostedt@goodmis.org,
        tglx@linutronix.de, vincent.guittot@linaro.org, jon.grimm@amd.com,
        bharata@amd.com, boris.ostrovsky@oracle.com, konrad.wilk@oracle.com,
        ankur.a.arora@oracle.com
Subject: [PATCH 0/9] x86/clear_huge_page: multi-page clearing
Date: Sun,  2 Apr 2023 22:22:24 -0700
Message-Id: <20230403052233.1880567-1-ankur.a.arora@oracle.com>
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 Mof2RPDe2pSxFJ3UFOEA1Q2xASHoAL5mNlAqb04buuNII9XdQyM9ohoQaxFpqoXHHAOwm4Iw+waa5FgoVMss9P4NtTbZhYAM7kSCMjl/pSYzK/oW0pkBdCKddD6351E061ugbhIUaob6vGsAGiyiTuz8JAFDTyCTKvXBJiSTDbxxjmXYtlGnkZBhS7b5YbKpf40QdTZ5Tl2v2U1X9kPdYli5k13RrVr5OPOT3bwMWEcAFg7H99JzlfHHwIIL312QDbxFV0xzycIza8qsxzSSyLIUSpiJaw9Ph1ESHD7Y0hPFEmSqjzMGQWI/n9Sk57Vq2gsJ9r/EAvsdSeKaRcWB3pLOZIW6z2eS0jczoAzaW42gtSKbpUK9xIS3WFttkJCBXxo07XNdUbHoMU/zCQzH9va1zaTmyKQSflqyEWM7xb5c0wHVzxJxtxhNYQbULviQ49mFH+w7ty1tDm/8xBc4WuNF3ElC0eaWeygvuc8I7JfYcEnB4iRx7i7pu9TmrAEvxsfbgwg/dDK8i0TLqYOZTZMOivqm5lLelwkchRVIxdo4/kGlFgtKwhm1O4akokUf2qC2tiAF3ugNEhKa73fgNAbLsj+vfBQntz8ymQbMh9u+O0ztwdFeH3NeENKHe9+X3PDQfxQcArIEANd4tFNTetIWYn4ipm1AFH3KT9w8tzLrIk4V70bcKYyaW/SxRpzI53QNzvVK/QKIkVXb3J7gelhTVa+pfVtp85uyLaAtWAWDS2j8nCO1NguCzzdXOK4Lny4Gu09KJSsp0sk4P/+9P/Cs81/t78aym6Tplyl4wdVSD5G4ifqlz1Xw4ElWzjkzAgVntndmXHmqqUa3E7pwjVX7dOqvQW4tBWeglLmgkJoX8egzWVEJJImK8eRglIXZP8G1R+i8nsl+PlFR/qAbJzO7YHMFGQdwhTX+3EO6OusCuFrJOPDyjNJv+3i1Z38dby51Jgw6D0IOJaMinmqlSprKzBrL4EcjLhzAb870d1FiIxESJ0RaGXtGwRBkrWUnc8ICfyrun3489CU36MfQ/t4EQOH2l7dWOnbqaC86ya5dLG0XySWZkbJDUjSUdbNlTAyE+ohxwYW8oqOoBgihnpnve9+XbKNi0JlzMGBJ7stEXKZNfSB5h843kjpU00HIDvg/PSt7Mhk2/uCTLuh9NHMLhh6typjlO3xHwT5Rg7xXOLRXqtMmcyYAqwcVZglubP/jxF9oe/CEec1LgDODHEErqytDZNYcWCESKgHGM6XrqXqewmdExIiRUIWxZr/7yPpUqR/9A2VjRKthEk/+UHTChCh7uz8wSAX9PDINdu7mhvdI0lGzEcD5SGiXFtHNM02J1bUDhcgnDSzRaBtijFMfR33pRsMOsX/hN9PnJxwWyeZ0mb0TyIjPRx2WcrZZ/TpN6hEmxYgFzhMH6IyzapLQGdZ7PiXwBaN0W7alMqwwKMIX5m6FcC47p0t6mLBlBWUiM27kUCq8fQpmZtuBXT0axxqJ5347SGlutzF8ofk9g1k0DSo4vch8DpVkHpc/3cC8qTK9Egf0jW5KI3YA0IeYzv2tNKWQ7iln0dxA1p5q0dOS+vNPVU9aAFm9NgN2
X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: 
 ejGk/i769rgE1OpdJSLttoKN0Y7+MNYjA0JDEFWr9uF7UAEWFSqJs4mvTM8HA8kiu2yN/SMsqJZJawGNGvLLMPBFNDJc6f9yYpZSnOQ67rSPyiyw4G9Zofp5kyzrshilw02Q6eWPQR6+zlrLSjg0vlUENMdVSvPkMxT6kn4jQFJIzG+c1ZKbsIufKRhCrM0nhujztTrEkGIhEvNu40tEaKbmPEhmCN7Et8mycrhscWVE+YZwsM1rXtZKQQOF4VrXQHcajNcGoWVX4d/vzYQu2wycHeScnu9sIhe0UXLncfFmQbBOv6IRHA9rVWpE7TGUou90BDRnnyEZ8iPBYKfWmb24iUQf7ibg/+lZ1m4DFjANSC3E4Mt+lIALjMFZ5zaJfKkuOHGp73vYRBrPak34/TIn7hRvIVRsHKSNm+CA+c2cIgGEu627KVxxni80UGdAslSUFyKbF+4MDcPK0/A80D5aYgjZFlW0EkqD9GYi5Qu/7skJbt6pNwoA+GnT5egG21nDD8yzNB+aEJ5MfLCdTPHeK7CnZuIs371mDQuqbcBBeBkPNRSG1KmO9HOmff1WH5zV3C7SOkJSIqM2d5FkwUL5exZyHlVweOprLVBJmm53HrIT+nDuABM/fnetuZUAr1FPUA5aZN6CiSwbfKRnUAVenYgOxbc85VqObXLYKweqZBM3cGFklOecMliv/xlo/V/ci+iUzCqcf0rL83ZP5D9JR21iAS7rfeAj/hL553jUmS8bbngQd4JORg5G68xgbQ3ZCR9qWhYmgcJO4E7CuytHwpY8jOpevVQtB68ThX0Gb4Gm2cCS0LOuy1nZn+zPXEFxDK3h6w1pIJFMv+u3dVbBnRZ8y0aGh5LGZZaFwfYfy2RcwtL6Q4AHtXJYkO+8dLtZ22V+jX3imdQlxrTLmdxO8vWQ+O8SJLOLCiJ39DjqK/JFxpav2EKKronmNKdT7Vwlw6miMDdHxI2QoIANLoavY2+EwF0BnvQ5zwnSAuTuRYf+Vfi1cv7o2Yrd9RYOkBzDca9kLP9IJTo+eUbz3yS2hmZJk0oxwEGKgno9ltloL56gqaGVgJ+LqR/BFSDKi2c85WuNwaKxGwCXl9DnqO2eX2/RcfB12KAHjGnAUWlAroD3fPnUrA/75u1AAKaYeCuvLbqWl/E5mWcXo72gAxOOMuq6dQ/ihbHbm+pnkVY=
X-OriginatorOrg: oracle.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 6ebacacb-b7df-4886-9268-08db34036862
X-MS-Exchange-CrossTenant-AuthSource: CO6PR10MB5409.namprd10.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 03 Apr 2023 05:22:24.5804
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 LTS7we/nHOi/x7atbimNiwvVe2HAwvK4mLUeINX6jNopL0HRzXkt17OA9HO5wIIUIrnMkSKrbDQ02LUOp+tIejlJtWHy6N6YjP0loiP4zFo=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SJ0PR10MB6398
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.254,Aquarius:18.0.942,Hydra:6.0.573,FMLib:17.11.170.22
 definitions=2023-04-03_02,2023-03-31_01,2023-02-09_01
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxlogscore=999
 mlxscore=0 adultscore=0
 suspectscore=0 spamscore=0 malwarescore=0 phishscore=0 bulkscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2303200000
 definitions=main-2304030039
X-Proofpoint-ORIG-GUID: -TMqwRyxIkhf-Zmg8lWxz7GySq0-Vj11
X-Proofpoint-GUID: -TMqwRyxIkhf-Zmg8lWxz7GySq0-Vj11
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 4632E80015
X-Stat-Signature: r3y4w5djytn9gs946eccqpdd85y8cspt
X-HE-Tag: 1680499375-173693
X-HE-Meta: 
 U2FsdGVkX1+RJOFoJw6uEvidibywEy6ek9L7tNaZslRuv3+Tn/1aQC7n/k6FLMJlt7RsLUn51vzR7Ok16j/A6AVDjcjykOiLjWSACUQL9/aAJPOid4Pmvr7fvGEB/RSiEBsVnDjWHd711OWwIfCewwbdIWdz4QgFA0eawMaeHvynYGAfk4TXPP1XM24naeTagGPWW6AHOWI4jeTmbW6DxZXeAWsXumZsA5AnDaAiDkrSTUdZ6qEGKCSwn30w6/yX5isuTbcex9KvhuOT8dx+P0P0NT5wps0Oi69O5d1FLcgBZGwanpvLzutbnaOxAXxLXgvz/jW4wFZDsRT9OfMVIqE2kQNXPROa7L+SCFO35mE4gB9utRFqoMq/Yoq/2kF58xkH/bwddTthxgIScdKzfWLdOyxZhgNPhwODzpW7SggZqhlgjcR0+kATMQvzq1GVExuoRjh0P4zyle37mYMJyKKDUZp6y/XdL0KD9Afqbp8N9uCIKa106rH/ECRTHyHATIH78/y1IjvRXRBjCQ2Q6D42SOQv1eqL4s0L/tURNTNjcXvLu2v7KUdz0mCiRvilhlC1AxS2+qX9eOSJ+maenBEuW1/9kLwK7irIw5r41c/JGJVr5a5Y/ZqM4pE19XztOCdvA25MMRVkM6bn9aXg3W03jNVvd9ExyRZ5uWvBVDjxlK/rKrskjbhAjyAIeScL8JhegIKQgyujzrndHrfYo1lMbCZ5CrJ6nHomRZiYzwJSaUnrx9ek1sCGZzc4+0LgY2we07be9ExsmgIyYEk6H3WzUORHRqJQK/OY7pw+GdWWpahUBlVeo334E64wg0NGGSQAJYvnJmulsxYBndCr9wWY09JkhqWS9+wynsyndmW7dwrrgmvNHPr7AEwGOPgdT2pbXja86V8t/CGFG3uVWDXVAUBPGB6Qpd0f7l04KeyOLP8wvUDZDRdjKvoc+Gxz+CaM7fN1M+Jk61WKp+i
 QdIbPbEp
 9+xjeHqBbtOiIOvjR55LmVW+QqrZ66C5X4Ng9qllTRrV6pQRzXydG81MfAIqXP7g6pW0/jsRUF7KnFxzFqwC/j1CaCJBEOFqOCMXf2c+Hf55dM5qFmUkQfC0MmEVND0gbZzGZl2/ffUC6oGVlhHKFzaUY0jsYek7l44zflMeswOBw5BLUh69Yb7ArJXYC3Z3v8FXsOB1y2ppcwSZMZNok+o4e8QUMEQQN8ZzKLiq88eTQD38UIHpSTgtoHl7FSFZW773iqYF7utT2MaIwLn+QMIMSThhHcJS0AEWD7zsARznF/1wvfOpz8ARC0CdImOL5eiUKC7vR26CMsMMiT1IWXFkW+NqQ1rzaTskCZFEYP7qflmWA+do0PMcv2DA1wiXYxgwQt4+Y2IzRY/zGiUBRyoZzrSn3r7eugXNj6cJMgzVt+0DQewql3rA0TcceKlVGu0gggRjwCpk8ac29r7FY9Q4+VNK6QkGfQwEnaCI00a5Hw2TaO9cy6QHwGbLvX90IbF6WTDkWaNJmlQOwTl76XA52tFNnx+5CI7hKUnNF5KrbhAlf1DA8ZYaFUiByiiTRFkHp
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Series

x86/clear_huge_page: multi-page clearing | expand

Message

Ankur Arora April 3, 2023, 5:22 a.m. UTC

This series introduces multi-page clearing for hugepages. 

This is a follow up of some of the ideas discussed at:
  https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com/

On x86 page clearing is typically done via string intructions. These,
unlike a MOV loop, allow us to explicitly advertise the region-size to
the processor, which could serve as a hint to current (and/or
future) uarchs to elide cacheline allocation.

In current generation processors, Milan (and presumably other Zen
variants) use the hint to elide cacheline allocation (for
region-size > LLC-size.)

An additional reason for doing this is that string instructions are typically
microcoded, and clearing in bigger chunks than the current page-at-a-
time logic amortizes some of the cost.

All uarchs tested (Milan, Icelakex, Skylakex) showed improved performance.

There are, however, some problems:

1. extended zeroing periods means there's an increased latency due to
   the now missing preemption points.

   That's handled in patches 7, 8, 9:
     "sched: define TIF_ALLOW_RESCHED"
     "irqentry: define irqentry_exit_allow_resched()"
     "x86/clear_huge_page: make clear_contig_region() preemptible"
   by the context marking itself reschedulable, and rescheduling in
   irqexit context if needed (for PREEMPTION_NONE/_VOLUNTARY.)

2. the current page-at-a-time clearing logic does left-right narrowing
   towards the faulting page which benefits workloads by maintaining
   cache locality for workloads which have a sequential pattern. Clearing
   in large chunks loses that.

   Some (but not all) of that could be ameliorated by something like
   this patch:
   https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/

   But, before doing that I'd like some comments on whether that is
   worth doing for this specific use case?

Rest of the series:
  Patches 1, 2, 3:
    "huge_pages: get rid of process_huge_page()"
    "huge_page: get rid of {clear,copy}_subpage()"
    "huge_page: allow arch override for clear/copy_huge_page()"
  are mechanical and they simplify some of the current clear_huge_page()
  logic.

  Patches 4, 5:
  "x86/clear_page: parameterize clear_page*() to specify length"
  "x86/clear_pages: add clear_pages()"

  add clear_pages() and helpers.

  Patch 6: "mm/clear_huge_page: use multi-page clearing" adds the
  chunked x86 clear_huge_page() implementation.


Performance
==

Demand fault performance gets a decent boost:

  *Icelakex*  mm/clear_huge_page   x86/clear_huge_page   change   
                          (GB/s)                (GB/s)            
                                                                  
  pg-sz=2MB                 8.76                 11.82   +34.93%  
  pg-sz=1GB                 8.99                 12.18   +35.48%  


  *Milan*     mm/clear_huge_page   x86/clear_huge_page   change    
                          (GB/s)                (GB/s)             
                                                                   
  pg-sz=2MB                12.24                 17.54    +43.30%  
  pg-sz=1GB                17.98                 37.24   +107.11%  


vm-scalability/case-anon-w-seq-hugetlb, gains in stime but performs
worse when user space tries to touch those pages:

  *Icelakex*                  mm/clear_huge_page   x86/clear_huge_page   change
  (mem=4GB/task, tasks=128)

  stime                           293.02 +- .49%        239.39 +- .83%   -18.30%
  utime                           440.11 +- .28%        508.74 +- .60%   +15.59%
  wall-clock                        5.96 +- .33%          6.27 +-2.23%   + 5.20%


  *Milan*                     mm/clear_huge_page   x86/clear_huge_page   change
  (mem=1GB/task, tasks=512)

  stime                          490.95 +- 3.55%       466.90 +- 4.79%   - 4.89%
  utime                          276.43 +- 2.85%       311.97 +- 5.15%   +12.85%
  wall-clock                       3.74 +- 6.41%         3.58 +- 7.82%   - 4.27%

Also at:
  github.com/terminus/linux clear-pages.v1

Comments appreciated!

Ankur Arora (9):
  huge_pages: get rid of process_huge_page()
  huge_page: get rid of {clear,copy}_subpage()
  huge_page: allow arch override for clear/copy_huge_page()
  x86/clear_page: parameterize clear_page*() to specify length
  x86/clear_pages: add clear_pages()
  mm/clear_huge_page: use multi-page clearing
  sched: define TIF_ALLOW_RESCHED
  irqentry: define irqentry_exit_allow_resched()
  x86/clear_huge_page: make clear_contig_region() preemptible

 arch/x86/include/asm/page.h        |   6 +
 arch/x86/include/asm/page_32.h     |   6 +
 arch/x86/include/asm/page_64.h     |  25 +++--
 arch/x86/include/asm/thread_info.h |   2 +
 arch/x86/lib/clear_page_64.S       |  45 ++++++--
 arch/x86/mm/hugetlbpage.c          |  59 ++++++++++
 include/linux/sched.h              |  29 +++++
 kernel/entry/common.c              |   8 ++
 kernel/sched/core.c                |  36 +++---
 mm/memory.c                        | 174 +++++++++++++++--------------
 10 files changed, 270 insertions(+), 120 deletions(-)

Comments

Raghavendra K T April 5, 2023, 7:48 p.m. UTC | #1

On 4/3/2023 10:52 AM, Ankur Arora wrote:
> This series introduces multi-page clearing for hugepages.
> 
> This is a follow up of some of the ideas discussed at:
>    https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com/
> 
> On x86 page clearing is typically done via string intructions. These,
> unlike a MOV loop, allow us to explicitly advertise the region-size to
> the processor, which could serve as a hint to current (and/or
> future) uarchs to elide cacheline allocation.
> 
> In current generation processors, Milan (and presumably other Zen
> variants) use the hint to elide cacheline allocation (for
> region-size > LLC-size.)
> 
> An additional reason for doing this is that string instructions are typically
> microcoded, and clearing in bigger chunks than the current page-at-a-
> time logic amortizes some of the cost.
> 
> All uarchs tested (Milan, Icelakex, Skylakex) showed improved performance.
> 
> There are, however, some problems:
> 
> 1. extended zeroing periods means there's an increased latency due to
>     the now missing preemption points.
> 
>     That's handled in patches 7, 8, 9:
>       "sched: define TIF_ALLOW_RESCHED"
>       "irqentry: define irqentry_exit_allow_resched()"
>       "x86/clear_huge_page: make clear_contig_region() preemptible"
>     by the context marking itself reschedulable, and rescheduling in
>     irqexit context if needed (for PREEMPTION_NONE/_VOLUNTARY.)
> 
> 2. the current page-at-a-time clearing logic does left-right narrowing
>     towards the faulting page which benefits workloads by maintaining
>     cache locality for workloads which have a sequential pattern. Clearing
>     in large chunks loses that.
> 
>     Some (but not all) of that could be ameliorated by something like
>     this patch:
>     https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/
> 
>     But, before doing that I'd like some comments on whether that is
>     worth doing for this specific use case?
> 
> Rest of the series:
>    Patches 1, 2, 3:
>      "huge_pages: get rid of process_huge_page()"
>      "huge_page: get rid of {clear,copy}_subpage()"
>      "huge_page: allow arch override for clear/copy_huge_page()"
>    are mechanical and they simplify some of the current clear_huge_page()
>    logic.
> 
>    Patches 4, 5:
>    "x86/clear_page: parameterize clear_page*() to specify length"
>    "x86/clear_pages: add clear_pages()"
> 
>    add clear_pages() and helpers.
> 
>    Patch 6: "mm/clear_huge_page: use multi-page clearing" adds the
>    chunked x86 clear_huge_page() implementation.
> 
> 
> Performance
> ==
> 
> Demand fault performance gets a decent boost:
> 
>    *Icelakex*  mm/clear_huge_page   x86/clear_huge_page   change
>                            (GB/s)                (GB/s)
>                                                                    
>    pg-sz=2MB                 8.76                 11.82   +34.93%
>    pg-sz=1GB                 8.99                 12.18   +35.48%
> 
> 
>    *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
>                            (GB/s)                (GB/s)
>                                                                     
>    pg-sz=2MB                12.24                 17.54    +43.30%
>    pg-sz=1GB                17.98                 37.24   +107.11%
> 
> 
> vm-scalability/case-anon-w-seq-hugetlb, gains in stime but performs
> worse when user space tries to touch those pages:
> 
>    *Icelakex*                  mm/clear_huge_page   x86/clear_huge_page   change
>    (mem=4GB/task, tasks=128)
> 
>    stime                           293.02 +- .49%        239.39 +- .83%   -18.30%
>    utime                           440.11 +- .28%        508.74 +- .60%   +15.59%
>    wall-clock                        5.96 +- .33%          6.27 +-2.23%   + 5.20%
> 
> 
>    *Milan*                     mm/clear_huge_page   x86/clear_huge_page   change
>    (mem=1GB/task, tasks=512)
> 
>    stime                          490.95 +- 3.55%       466.90 +- 4.79%   - 4.89%
>    utime                          276.43 +- 2.85%       311.97 +- 5.15%   +12.85%
>    wall-clock                       3.74 +- 6.41%         3.58 +- 7.82%   - 4.27%
> 
> Also at:
>    github.com/terminus/linux clear-pages.v1
> 
> Comments appreciated!
> 

Hello Ankur,

Was able to test your patches. To summarize, am seeing 2x-3x perf
improvement for 2M, 1GB base hugepage sizes.

SUT: Genoa AMD EPYC
    Thread(s) per core:  2
    Core(s) per socket:  128
    Socket(s):           2

NUMA:
   NUMA node(s):          2
   NUMA node0 CPU(s):     0-127,256-383
   NUMA node1 CPU(s):     128-255,384-511

Test:  Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA 
node0), for both base-hugepage-size=2M and 1GB

perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>

time in seconds elapsed (average of 10 runs) (lower = better)

Result:
page-size  mm/clear_huge_page   x86/clear_huge_page     change %
2M              5.4567          2.6774                  -50.93
1G              2.64452         1.011281                -61.76

Full perfstat info

  page size = 2M mm/clear_huge_page

  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 
runs):

           5,434.71 msec task-clock                #    0.996 CPUs 
utilized            ( +-  0.55% )
                  8      context-switches          #    1.466 /sec 
                ( +-  4.66% )
                  0      cpu-migrations            #    0.000 /sec
             32,918      page-faults               #    6.034 K/sec 
                ( +-  0.00% )
     16,977,242,482      cycles                    #    3.112 GHz 
                ( +-  0.04% )  (35.70%)
          1,961,724      stalled-cycles-frontend   #    0.01% frontend 
cycles idle     ( +-  1.09% )  (35.72%)
         35,685,674      stalled-cycles-backend    #    0.21% backend 
cycles idle      ( +-  3.48% )  (35.74%)
      1,038,327,182      instructions              #    0.06  insn per cycle
                                                   #    0.04  stalled 
cycles per insn  ( +-  0.38% )  (35.75%)
        221,409,216      branches                  #   40.584 M/sec 
                ( +-  0.36% )  (35.75%)
            350,730      branch-misses             #    0.16% of all 
branches          ( +-  1.18% )  (35.75%)
      2,520,888,779      L1-dcache-loads           #  462.077 M/sec 
                ( +-  0.03% )  (35.73%)
      1,094,178,209      L1-dcache-load-misses     #   43.46% of all 
L1-dcache accesses  ( +-  0.02% )  (35.71%)
         67,751,730      L1-icache-loads           #   12.419 M/sec 
                ( +-  0.11% )  (35.70%)
            271,118      L1-icache-load-misses     #    0.40% of all 
L1-icache accesses  ( +-  2.55% )  (35.70%)
            506,635      dTLB-loads                #   92.866 K/sec 
                ( +-  3.31% )  (35.70%)
            237,385      dTLB-load-misses          #   43.64% of all 
dTLB cache accesses  ( +-  7.00% )  (35.69%)
                268      iTLB-load-misses          # 6700.00% of all 
iTLB cache accesses  ( +- 13.86% )  (35.70%)

             5.4567 +- 0.0300 seconds time elapsed  ( +-  0.55% )

  page size = 2M x86/clear_huge_page
  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 
runs):

           2,780.69 msec task-clock                #    1.039 CPUs 
utilized            ( +-  1.03% )
                  3      context-switches          #    1.121 /sec 
                ( +- 21.34% )
                  0      cpu-migrations            #    0.000 /sec
             32,918      page-faults               #   12.301 K/sec 
                ( +-  0.00% )
      8,143,619,771      cycles                    #    3.043 GHz 
                ( +-  0.25% )  (35.62%)
          2,024,872      stalled-cycles-frontend   #    0.02% frontend 
cycles idle     ( +-320.93% )  (35.66%)
        717,198,728      stalled-cycles-backend    #    8.82% backend 
cycles idle      ( +-  8.26% )  (35.69%)
        606,549,334      instructions              #    0.07  insn per cycle
                                                   #    1.39  stalled 
cycles per insn  ( +-  0.23% )  (35.73%)
        108,856,550      branches                  #   40.677 M/sec 
                ( +-  0.24% )  (35.76%)
            202,490      branch-misses             #    0.18% of all 
branches          ( +-  3.58% )  (35.78%)
      2,348,818,806      L1-dcache-loads           #  877.701 M/sec 
                ( +-  0.03% )  (35.78%)
      1,081,562,988      L1-dcache-load-misses     #   46.04% of all 
L1-dcache accesses  ( +-  0.01% )  (35.78%)
    <not supported>      LLC-loads
    <not supported>      LLC-load-misses
         43,411,167      L1-icache-loads           #   16.222 M/sec 
                ( +-  0.19% )  (35.77%)
            273,042      L1-icache-load-misses     #    0.64% of all 
L1-icache accesses  ( +-  4.94% )  (35.76%)
            834,482      dTLB-loads                #  311.827 K/sec 
                ( +-  9.73% )  (35.72%)
            437,343      dTLB-load-misses          #   65.86% of all 
dTLB cache accesses  ( +-  8.56% )  (35.68%)
                  0      iTLB-loads                #    0.000 /sec 
                (35.65%)
                160      iTLB-load-misses          # 1777.78% of all 
iTLB cache accesses  ( +- 15.82% )  (35.62%)

             2.6774 +- 0.0287 seconds time elapsed  ( +-  1.07% )

  page size = 1G mm/clear_huge_page
  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 
runs):

           2,625.24 msec task-clock                #    0.993 CPUs 
utilized            ( +-  0.23% )
                  4      context-switches          #    1.513 /sec 
                ( +-  4.49% )
                  1      cpu-migrations            #    0.378 /sec
                214      page-faults               #   80.965 /sec 
                ( +-  0.13% )
      8,178,624,349      cycles                    #    3.094 GHz 
                ( +-  0.23% )  (35.65%)
          2,942,576      stalled-cycles-frontend   #    0.04% frontend 
cycles idle     ( +- 75.22% )  (35.69%)
          7,117,425      stalled-cycles-backend    #    0.09% backend 
cycles idle      ( +-  3.79% )  (35.73%)
        454,521,647      instructions              #    0.06  insn per cycle
                                                   #    0.02  stalled 
cycles per insn  ( +-  0.10% )  (35.77%)
        113,223,853      branches                  #   42.837 M/sec 
                ( +-  0.08% )  (35.80%)
             84,766      branch-misses             #    0.07% of all 
branches          ( +-  5.37% )  (35.80%)
      2,294,528,890      L1-dcache-loads           #  868.111 M/sec 
                ( +-  0.02% )  (35.81%)
      1,075,907,551      L1-dcache-load-misses     #   46.88% of all 
L1-dcache accesses  ( +-  0.02% )  (35.78%)
         26,167,323      L1-icache-loads           #    9.900 M/sec 
                ( +-  0.24% )  (35.74%)
            139,675      L1-icache-load-misses     #    0.54% of all 
L1-icache accesses  ( +-  0.37% )  (35.70%)
              3,459      dTLB-loads                #    1.309 K/sec 
                ( +- 12.75% )  (35.67%)
                732      dTLB-load-misses          #   19.71% of all 
dTLB cache accesses  ( +- 26.61% )  (35.62%)
                 11      iTLB-load-misses          #  192.98% of all 
iTLB cache accesses  ( +-238.28% )  (35.62%)

            2.64452 +- 0.00600 seconds time elapsed  ( +-  0.23% )


  page size = 1G x86/clear_huge_page
  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 
runs):

           1,009.09 msec task-clock                #    0.998 CPUs 
utilized            ( +-  0.06% )
                  2      context-switches          #    1.980 /sec 
                ( +- 23.63% )
                  1      cpu-migrations            #    0.990 /sec
                214      page-faults               #  211.887 /sec 
                ( +-  0.16% )
      3,154,980,463      cycles                    #    3.124 GHz 
                ( +-  0.06% )  (35.77%)
            145,051      stalled-cycles-frontend   #    0.00% frontend 
cycles idle     ( +-  6.26% )  (35.78%)
        730,087,143      stalled-cycles-backend    #   23.12% backend 
cycles idle      ( +-  9.75% )  (35.78%)
         45,813,391      instructions              #    0.01  insn per cycle
                                                   #   18.51  stalled 
cycles per insn  ( +-  1.00% )  (35.78%)
          8,498,282      branches                  #    8.414 M/sec 
                ( +-  1.54% )  (35.78%)
             63,351      branch-misses             #    0.74% of all 
branches          ( +-  6.70% )  (35.69%)
         29,135,863      L1-dcache-loads           #   28.848 M/sec 
                ( +-  5.67% )  (35.68%)
          8,537,280      L1-dcache-load-misses     #   28.66% of all 
L1-dcache accesses  ( +- 10.15% )  (35.68%)
          1,040,087      L1-icache-loads           #    1.030 M/sec 
                ( +-  1.60% )  (35.68%)
              9,147      L1-icache-load-misses     #    0.85% of all 
L1-icache accesses  ( +-  6.50% )  (35.67%)
              1,084      dTLB-loads                #    1.073 K/sec 
                ( +- 12.05% )  (35.68%)
                431      dTLB-load-misses          #   40.28% of all 
dTLB cache accesses  ( +- 43.46% )  (35.68%)
                 16      iTLB-load-misses          #    0.00% of all 
iTLB cache accesses  ( +- 40.54% )  (35.68%)

           1.011281 +- 0.000624 seconds time elapsed  ( +-  0.06% )

Please feel free to add

Tested-by: Raghavendra K T <raghavendra.kt@amd.com>

Will come back with further observations on patch/performance if any

Thanks and Regards

Ankur Arora April 8, 2023, 10:46 p.m. UTC | #2

Raghavendra K T <raghavendra.kt@amd.com> writes:

> On 4/3/2023 10:52 AM, Ankur Arora wrote:
>> This series introduces multi-page clearing for hugepages.

>    *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
>                            (GB/s)           (GB/s)
>   pg-sz=2MB                 12.24            17.54    +43.30%
>    pg-sz=1GB                17.98            37.24   +107.11%
>
>
> Hello Ankur,
>
> Was able to test your patches. To summarize, am seeing 2x-3x perf
> improvement for 2M, 1GB base hugepage sizes.

Great. Thanks Raghavendra.

> SUT: Genoa AMD EPYC
>    Thread(s) per core:  2
>    Core(s) per socket:  128
>    Socket(s):           2
>
> NUMA:
>   NUMA node(s):          2
>   NUMA node0 CPU(s):     0-127,256-383
>   NUMA node1 CPU(s):     128-255,384-511
>
> Test:  Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for
> both base-hugepage-size=2M and 1GB
>
> perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>
>
> time in seconds elapsed (average of 10 runs) (lower = better)
>
> Result:
> page-size  mm/clear_huge_page   x86/clear_huge_page
> 2M              5.4567          2.6774
> 1G              2.64452         1.011281

So translating into BW, for Genoa we have:

page-size  mm/clear_huge_page   x86/clear_huge_page
 2M              11.74              23.97
 1G              24.24              63.36

That's a pretty good bump over Milan:

>    *Milan*     mm/clear_huge_page   x86/clear_huge_page
>                            (GB/s)           (GB/s)
>   pg-sz=2MB                12.24            17.54
>   pg-sz=1GB                17.98            37.24

Btw, are these numbers with boost=1?

> Full perfstat info
>
>  page size = 2M mm/clear_huge_page
>
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs):
>
>           5,434.71 msec task-clock                #    0.996 CPUs utilized
>          ( +-  0.55% )
>                  8      context-switches          #    1.466 /sec
>                  ( +-  4.66% )
>                  0      cpu-migrations            #    0.000 /sec
>             32,918      page-faults               #    6.034 K/sec
>             ( +-  0.00% )
>     16,977,242,482      cycles                    #    3.112 GHz
>     ( +-  0.04% )  (35.70%)
>          1,961,724      stalled-cycles-frontend   #    0.01% frontend cycles
>         idle     ( +-  1.09% )  (35.72%)
>         35,685,674      stalled-cycles-backend    #    0.21% backend cycles idle
>        ( +-  3.48% )  (35.74%)
>      1,038,327,182      instructions              #    0.06  insn per cycle
>                                                   #    0.04  stalled cycles per
>                                                       insn  ( +-  0.38% )
>                                                       (35.75%)
>        221,409,216      branches                  #   40.584 M/sec
>        ( +-  0.36% )  (35.75%)
>            350,730      branch-misses             #    0.16% of all branches
>           ( +-  1.18% )  (35.75%)
>      2,520,888,779      L1-dcache-loads           #  462.077 M/sec
>      ( +-  0.03% )  (35.73%)
>      1,094,178,209      L1-dcache-load-misses     #   43.46% of all L1-dcache
>     accesses  ( +-  0.02% )  (35.71%)
>         67,751,730      L1-icache-loads           #   12.419 M/sec
>         ( +-  0.11% )  (35.70%)
>            271,118      L1-icache-load-misses     #    0.40% of all L1-icache
>           accesses  ( +-  2.55% )  (35.70%)
>            506,635      dTLB-loads                #   92.866 K/sec
>            ( +-  3.31% )  (35.70%)
>            237,385      dTLB-load-misses          #   43.64% of all dTLB cache
>           accesses  ( +-  7.00% )  (35.69%)
>                268      iTLB-load-misses          # 6700.00% of all iTLB cache
>               accesses  ( +- 13.86% )  (35.70%)
>
>             5.4567 +- 0.0300 seconds time elapsed  ( +-  0.55% )
>
>  page size = 2M x86/clear_huge_page
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs):
>
>           2,780.69 msec task-clock                #    1.039 CPUs utilized
>          ( +-  1.03% )
>                  3      context-switches          #    1.121 /sec
>                  ( +- 21.34% )
>                  0      cpu-migrations            #    0.000 /sec
>             32,918      page-faults               #   12.301 K/sec
>             ( +-  0.00% )
>      8,143,619,771      cycles                    #    3.043 GHz
>      ( +-  0.25% )  (35.62%)
>          2,024,872      stalled-cycles-frontend   #    0.02% frontend cycles
>         idle     ( +-320.93% )  (35.66%)
>        717,198,728      stalled-cycles-backend    #    8.82% backend cycles idle
>       ( +-  8.26% )  (35.69%)
>        606,549,334      instructions              #    0.07  insn per cycle
>                                                   #    1.39  stalled cycles per
>                                                       insn  ( +-  0.23% )
>                                                       (35.73%)
>        108,856,550      branches                  #   40.677 M/sec
>        ( +-  0.24% )  (35.76%)
>            202,490      branch-misses             #    0.18% of all branches
>           ( +-  3.58% )  (35.78%)
>      2,348,818,806      L1-dcache-loads           #  877.701 M/sec
>      ( +-  0.03% )  (35.78%)
>      1,081,562,988      L1-dcache-load-misses     #   46.04% of all L1-dcache
>     accesses  ( +-  0.01% )  (35.78%)
>    <not supported>      LLC-loads
>    <not supported>      LLC-load-misses
>         43,411,167      L1-icache-loads           #   16.222 M/sec
>         ( +-  0.19% )  (35.77%)
>            273,042      L1-icache-load-misses     #    0.64% of all L1-icache
>           accesses  ( +-  4.94% )  (35.76%)
>            834,482      dTLB-loads                #  311.827 K/sec
>            ( +-  9.73% )  (35.72%)
>            437,343      dTLB-load-misses          #   65.86% of all dTLB cache
>           accesses  ( +-  8.56% )  (35.68%)
>                  0      iTLB-loads                #    0.000 /sec
>                  (35.65%)
>                160      iTLB-load-misses          # 1777.78% of all iTLB cache
>               accesses  ( +- 15.82% )  (35.62%)
>
>             2.6774 +- 0.0287 seconds time elapsed  ( +-  1.07% )
>
>  page size = 1G mm/clear_huge_page
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
>           2,625.24 msec task-clock                #    0.993 CPUs utilized
>          ( +-  0.23% )
>                  4      context-switches          #    1.513 /sec
>                  ( +-  4.49% )
>                  1      cpu-migrations            #    0.378 /sec
>                214      page-faults               #   80.965 /sec
>                ( +-  0.13% )
>      8,178,624,349      cycles                    #    3.094 GHz
>      ( +-  0.23% )  (35.65%)
>          2,942,576      stalled-cycles-frontend   #    0.04% frontend cycles
>         idle     ( +- 75.22% )  (35.69%)
>          7,117,425      stalled-cycles-backend    #    0.09% backend cycles idle
>         ( +-  3.79% )  (35.73%)
>        454,521,647      instructions              #    0.06  insn per cycle
>                                                   #    0.02  stalled cycles per
>                                                       insn  ( +-  0.10% )
>                                                       (35.77%)
>        113,223,853      branches                  #   42.837 M/sec
>        ( +-  0.08% )  (35.80%)
>             84,766      branch-misses             #    0.07% of all branches
>            ( +-  5.37% )  (35.80%)
>      2,294,528,890      L1-dcache-loads           #  868.111 M/sec
>      ( +-  0.02% )  (35.81%)
>      1,075,907,551      L1-dcache-load-misses     #   46.88% of all L1-dcache
>     accesses  ( +-  0.02% )  (35.78%)
>         26,167,323      L1-icache-loads           #    9.900 M/sec
>         ( +-  0.24% )  (35.74%)
>            139,675      L1-icache-load-misses     #    0.54% of all L1-icache
>           accesses  ( +-  0.37% )  (35.70%)
>              3,459      dTLB-loads                #    1.309 K/sec
>              ( +- 12.75% )  (35.67%)
>                732      dTLB-load-misses          #   19.71% of all dTLB cache
>               accesses  ( +- 26.61% )  (35.62%)
>                 11      iTLB-load-misses          #  192.98% of all iTLB cache
>                accesses  ( +-238.28% )  (35.62%)
>
>            2.64452 +- 0.00600 seconds time elapsed  ( +-  0.23% )
>
>
>  page size = 1G x86/clear_huge_page
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
>           1,009.09 msec task-clock                #    0.998 CPUs utilized
>          ( +-  0.06% )
>                  2      context-switches          #    1.980 /sec
>                  ( +- 23.63% )
>                  1      cpu-migrations            #    0.990 /sec
>                214      page-faults               #  211.887 /sec
>                ( +-  0.16% )
>      3,154,980,463      cycles                    #    3.124 GHz
>      ( +-  0.06% )  (35.77%)
>            145,051      stalled-cycles-frontend   #    0.00% frontend cycles
>           idle     ( +-  6.26% )  (35.78%)
>        730,087,143      stalled-cycles-backend    #   23.12% backend cycles idle
>       ( +-  9.75% )  (35.78%)
>         45,813,391      instructions              #    0.01  insn per cycle
>                                                   #   18.51  stalled cycles per
>                                                      insn  ( +-  1.00% )
>                                                      (35.78%)
>          8,498,282      branches                  #    8.414 M/sec
>          ( +-  1.54% )  (35.78%)
>             63,351      branch-misses             #    0.74% of all branches
>            ( +-  6.70% )  (35.69%)
>         29,135,863      L1-dcache-loads           #   28.848 M/sec
>         ( +-  5.67% )  (35.68%)
>          8,537,280      L1-dcache-load-misses     #   28.66% of all L1-dcache
>         accesses  ( +- 10.15% )  (35.68%)
>          1,040,087      L1-icache-loads           #    1.030 M/sec
>          ( +-  1.60% )  (35.68%)
>              9,147      L1-icache-load-misses     #    0.85% of all L1-icache
>             accesses  ( +-  6.50% )  (35.67%)
>              1,084      dTLB-loads                #    1.073 K/sec
>              ( +- 12.05% )  (35.68%)
>                431      dTLB-load-misses          #   40.28% of all dTLB cache
>               accesses  ( +- 43.46% )  (35.68%)
>                 16      iTLB-load-misses          #    0.00% of all iTLB cache
>                accesses  ( +- 40.54% )  (35.68%)
>
>           1.011281 +- 0.000624 seconds time elapsed  ( +-  0.06% )
>
> Please feel free to add
>
> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>

Thanks

Ankur

> Will come back with further observations on patch/performance if any

Raghavendra K T April 10, 2023, 6:26 a.m. UTC | #3

On 4/9/2023 4:16 AM, Ankur Arora wrote:
> 
> Raghavendra K T <raghavendra.kt@amd.com> writes:
> 
>> On 4/3/2023 10:52 AM, Ankur Arora wrote:
>>> This series introduces multi-page clearing for hugepages.
> 
>>     *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
>>                             (GB/s)           (GB/s)
>>    pg-sz=2MB                 12.24            17.54    +43.30%
>>     pg-sz=1GB                17.98            37.24   +107.11%
>>
>>
>> Hello Ankur,
>>
>> Was able to test your patches. To summarize, am seeing 2x-3x perf
>> improvement for 2M, 1GB base hugepage sizes.
> 
> Great. Thanks Raghavendra.
> 
>> SUT: Genoa AMD EPYC
>>     Thread(s) per core:  2
>>     Core(s) per socket:  128
>>     Socket(s):           2
>>
>> NUMA:
>>    NUMA node(s):          2
>>    NUMA node0 CPU(s):     0-127,256-383
>>    NUMA node1 CPU(s):     128-255,384-511
>>
>> Test:  Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for
>> both base-hugepage-size=2M and 1GB
>>
>> perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>
>>
>> time in seconds elapsed (average of 10 runs) (lower = better)
>>
>> Result:
>> page-size  mm/clear_huge_page   x86/clear_huge_page
>> 2M              5.4567          2.6774
>> 1G              2.64452         1.011281
> 
> So translating into BW, for Genoa we have:
> 
> page-size  mm/clear_huge_page   x86/clear_huge_page
>   2M              11.74              23.97
>   1G              24.24              63.36
> 
> That's a pretty good bump over Milan:
> 
>>     *Milan*     mm/clear_huge_page   x86/clear_huge_page
>>                             (GB/s)           (GB/s)
>>    pg-sz=2MB                12.24            17.54
>>    pg-sz=1GB                17.98            37.24
> 
> Btw, are these numbers with boost=1?
> 

Yes it is. Also a note about config. I had not enabled
GCOV/LOCKSTAT related config because I faced some issues.