[3/4] percpu: implement partial chunk depopulation

Message ID	20210419225047.3415425-4-dennis@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=2gk7=JQ=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0785861090 From: Dennis Zhou <dennis@kernel.org> To: Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@linux.com>, Roman Gushchin <guro@fb.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dennis Zhou <dennis@kernel.org> Subject: [PATCH 3/4] percpu: implement partial chunk depopulation Date: Mon, 19 Apr 2021 22:50:46 +0000 Message-Id: <20210419225047.3415425-4-dennis@kernel.org> In-Reply-To: <20210419225047.3415425-1-dennis@kernel.org> References: <20210419225047.3415425-1-dennis@kernel.org> MIME-Version: 1.0 Received-SPF: none (gmail.com>: No applicable sender policy available) receiver=imf16; identity=mailfrom; envelope-from="<dennisszhou@gmail.com>"; helo=mail-io1-f50.google.com; client-ip=209.85.166.50 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	percpu: partial chunk depopulation \| expand [v4,0/4] percpu: partial chunk depopulation [1/4] percpu: factor out pcpu_check_block_hint() [2/4] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1 [3/4] percpu: implement partial chunk depopulation [4/4] percpu: use reclaim threshold instead of running for every page

Message ID

20210419225047.3415425-4-dennis@kernel.org (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0785861090
From: Dennis Zhou <dennis@kernel.org>
To: Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux.com>,
	Roman Gushchin <guro@fb.com>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Dennis Zhou <dennis@kernel.org>
Subject: [PATCH 3/4] percpu: implement partial chunk depopulation
Date: Mon, 19 Apr 2021 22:50:46 +0000
Message-Id: <20210419225047.3415425-4-dennis@kernel.org>
In-Reply-To: <20210419225047.3415425-1-dennis@kernel.org>
References: <20210419225047.3415425-1-dennis@kernel.org>
MIME-Version: 1.0
Received-SPF: none (gmail.com>: No applicable sender policy available)
 receiver=imf16; identity=mailfrom; envelope-from="<dennisszhou@gmail.com>";
 helo=mail-io1-f50.google.com; client-ip=209.85.166.50
Content-Transfer-Encoding: quoted-printable
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

percpu: partial chunk depopulation | expand

Commit Message

Dennis Zhou April 19, 2021, 10:50 p.m. UTC

From: Roman Gushchin <guro@fb.com>

This patch implements partial depopulation of percpu chunks.

As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.

This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.

The depopulation is scheduled on the free path if the chunk is all of
the following:
  1) has more than 1/4 of total pages free and populated
  2) the system has enough free percpu pages aside of this chunk
  3) isn't the reserved chunk
  4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.

pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.

Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 mm/percpu-internal.h |   4 +
 mm/percpu-km.c       |   5 ++
 mm/percpu-stats.c    |  12 +--
 mm/percpu-vm.c       |  30 ++++++++
 mm/percpu.c          | 180 +++++++++++++++++++++++++++++++++++++++----
 5 files changed, 211 insertions(+), 20 deletions(-)

Comments

Guenter Roeck July 2, 2021, 7:11 p.m. UTC | #1

Hi,

On Mon, Apr 19, 2021 at 10:50:46PM +0000, Dennis Zhou wrote:
> From: Roman Gushchin <guro@fb.com>
> 
> This patch implements partial depopulation of percpu chunks.
> 
> As of now, a chunk can be depopulated only as a part of the final
> destruction, if there are no more outstanding allocations. However
> to minimize a memory waste it might be useful to depopulate a
> partially filed chunk, if a small number of outstanding allocations
> prevents the chunk from being fully reclaimed.
> 
> This patch implements the following depopulation process: it scans
> over the chunk pages, looks for a range of empty and populated pages
> and performs the depopulation. To avoid races with new allocations,
> the chunk is previously isolated. After the depopulation the chunk is
> sidelined to a special list or freed. New allocations prefer using
> active chunks to sidelined chunks. If a sidelined chunk is used, it is
> reintegrated to the active lists.
> 
> The depopulation is scheduled on the free path if the chunk is all of
> the following:
>   1) has more than 1/4 of total pages free and populated
>   2) the system has enough free percpu pages aside of this chunk
>   3) isn't the reserved chunk
>   4) isn't the first chunk
> If it's already depopulated but got free populated pages, it's a good
> target too. The chunk is moved to a special slot,
> pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
> item is scheduled. On isolation, these pages are removed from the
> pcpu_nr_empty_pop_pages. It is constantly replaced to the
> to_depopulate_slot when it meets these qualifications.
> 
> pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
> becomes empty. The depopulation is performed in the reverse direction to
> keep populated pages close to the beginning. Depopulated chunks are
> sidelined to preferentially avoid them for new allocations. When no
> active chunk can suffice a new allocation, sidelined chunks are first
> checked before creating a new chunk.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Co-developed-by: Dennis Zhou <dennis@kernel.org>
> Signed-off-by: Dennis Zhou <dennis@kernel.org>

This patch results in a number of crashes and other odd behavior
when trying to boot mips images from Megasas controllers in qemu.
Sometimes the boot stalls, but I also see various crashes.
Some examples and bisect logs are attached.

Note: Bisect on mainline ended with

# first bad commit: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/l
inux/kernel/git/dennis/percpu

I then checked out the merge branch and ran a bisect there, which
points to this commit. I also rebased the merge branch to v5.13
and bisected again. Bisect results were the same.

Guenter

---
...
sd 0:2:0:0: [sda] Add. Sense: Internal target failure
CPU 0 Unable to handle kernel paging request at virtual address 00000004, epc == 805cf8fc, ra == 802ff3b0
Oops[#1]:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
$ 0   : 00000000 00000001 00000000 8258fc90
$ 4   : 825dbd40 820e7624 00000000 fffffff0
$ 8   : 80c70000 805e1a64 fffffffc 00000000
$12   : 81006d00 0000001f ffffffe0 00001e83
$16   : 00000000 825dbd30 80c70000 820e75f8
$20   : 8275c584 80cc4418 80c9409c 00000008
$24   : 0000004c 00000000
$28   : 8204c000 8204fc70 80c26c54 802ff3b0
Hi    : 0000004c
Lo    : 00000000
epc   : 805cf8fc rb_insert_color+0x1c/0x1e0
ra    : 802ff3b0 kernfs_link_sibling+0x94/0x120
Status: 1000a403	KERNEL EXL IE
Cause : 00800008 (ExcCode 02)
BadVA : 00000004
PrId  : 00019300 (MIPS 24Kc)
Modules linked in:
Process swapper/0 (pid: 1, threadinfo=(ptrval), task=(ptrval), tls=00000000)
Stack : 820e75f8 820e75f8 820e75f8 00000000 8275c584 fffffffe 825dbd30 8030084c
        820e75f8 803003f8 00000000 db668853 00000000 801655f4 00000000 00000000
        00000001 825dbd30 820e75f8 820e75f8 00000000 80300970 81006c80 82150fc0
        8204fd64 00000001 00000000 00000001 00000000 00000000 82150fc0 8275c580
        80c50000 80303dc8 82150fc0 8015ab94 81006c80 8015a960 00000000 8275c580
        ...
Call Trace:
[<805cf8fc>] rb_insert_color+0x1c/0x1e0
[<802ff3b0>] kernfs_link_sibling+0x94/0x120
[<8030084c>] kernfs_add_one+0xb8/0x184
[<80300970>] kernfs_create_dir_ns+0x58/0xb0
[<80303dc8>] sysfs_create_dir_ns+0x74/0x108
[<805ca51c>] kobject_add_internal+0xb4/0x364
[<805caaa0>] kobject_init_and_add+0x64/0xa8
[<8066f768>] bus_add_driver+0x98/0x230
[<806715a0>] driver_register+0x80/0x144
[<807c17b8>] usb_register_driver+0xa8/0x1c0
[<80cb89b8>] uas_init+0x44/0x78
[<8010065c>] do_one_initcall+0x50/0x1d4
[<80c95014>] kernel_init_freeable+0x20c/0x29c
[<80a66bd4>] kernel_init+0x14/0x118
[<80103098>] ret_from_kernel_thread+0x14/0x1c

Code: 30460001  14c00016  240afffc <8c460004> 34480001  10c30028  00404825  10c00012  00000000

---[ end trace bb7aba36814796cb ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

---
scsi host0: Avago SAS based MegaRAID driver
ata_piix 0000:00:0a.1: enabling device (0000 -> 0001)
random: fast init done
scsi 0:2:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
scsi host1: ata_piix
BUG: spinlock bad magic on CPU#0, kworker/u2:1/41
 lock: 0x82598a50, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
CPU: 0 PID: 41 Comm: kworker/u2:1 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
Workqueue: events_unbound async_run_entry_fn
Stack : 822839e4 80c4eee3 80c50000 80a54338 80c80000 801865e0 00000000 00000004
        822839e4 5e03e26b 80c50000 8014b4c8 80c50000 00000001 822839e0 8207e4c0
        00000000 00000000 80b7b9ac 82283828 00000001 8228383c 00000000 0000ffff
        00000008 00000007 00000280 822b7c00 80c50000 80c70000 00000000 80b80000
        00000003 00000000 80c50000 00000012 00000000 806591f8 00000000 80cf0000
        ...
Call Trace:
[<80109adc>] show_stack+0x84/0x11c
[<80a62b1c>] dump_stack+0xa8/0xe4
[<80181468>] do_raw_spin_lock+0xb0/0x128
[<80a70170>] _raw_spin_lock_irqsave+0x28/0x3c
[<80176640>] __wake_up_common_lock+0x68/0xe8
[<801766d4>] __wake_up+0x14/0x20
[<8054eb48>] percpu_ref_kill_and_confirm+0x120/0x178
[<80526d2c>] blk_freeze_queue_start+0x58/0x94
[<8051af0c>] blk_set_queue_dying+0x2c/0x60
[<8051afb4>] blk_cleanup_queue+0x40/0x130
[<806975b4>] __scsi_remove_device+0xd4/0x168
[<80693594>] scsi_probe_and_add_lun+0x53c/0xf44
[<806944c4>] __scsi_scan_target+0x158/0x754
[<80694eb4>] scsi_scan_host_selected+0x17c/0x2e0
[<806950c4>] do_scsi_scan_host+0xac/0xb4
[<806952f8>] do_scan_async+0x30/0x228
[<8015510c>] async_run_entry_fn+0x40/0x100
[<80148384>] process_one_work+0x170/0x428
[<80148be0>] worker_thread+0x188/0x578
[<80150d9c>] kthread+0x130/0x160
[<80103098>] ret_from_kernel_thread+0x14/0x1c

CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 801764a0, ra == 80176664
Oops[#1]:
CPU: 0 PID: 41 Comm: kworker/u2:1 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
Workqueue: events_unbound async_run_entry_fn
$ 0   : 00000000 00000001 00000000 00000000
$ 4   : 82283b14 00000003 00000000 00000000
$ 8   : 00000001 822837ac 00000000 0000ffff
$12   : 00000008 00000007 00000280 822b7c00
$16   : 82598a50 00000000 82283b08 00000003
$20   : 00000000 00000000 00000000 fffffff4
$24   : 00000000 806591f8
$28   : 82280000 82283ab8 82598a60 80176664
Hi    : 000000a7
Lo    : 3333335d
epc   : 801764a0 __wake_up_common+0x6c/0x1a4
ra    : 80176664 __wake_up_common_lock+0x8c/0xe8
Status: 1000a402	KERNEL EXL
Cause : 40808008 (ExcCode 02)
BadVA : 00000000
PrId  : 00019300 (MIPS 24Kc)
Modules linked in:
Process kworker/u2:1 (pid: 41, threadinfo=(ptrval), task=(ptrval), tls=00000000)
Stack : 82716400 82283c98 8246a880 8052cfe8 82598a50 00000000 00000000 00000000
        00000003 00000000 80c50000 00000012 80c80000 80176664 00000000 82126c80
        801762ec 82283afc 00000000 82283b08 00000000 00000000 00000000 82283b14
        82283b14 5e03e26b 825985c8 80c70000 00000001 00000000 80d10000 00000024
        00000003 801766d4 00000001 825985c8 80c70000 00000001 00000000 80d10000
        ...
Call Trace:
[<801764a0>] __wake_up_common+0x6c/0x1a4
[<80176664>] __wake_up_common_lock+0x8c/0xe8
[<801766d4>] __wake_up+0x14/0x20
[<8054eb48>] percpu_ref_kill_and_confirm+0x120/0x178
[<80526d2c>] blk_freeze_queue_start+0x58/0x94
[<8051af0c>] blk_set_queue_dying+0x2c/0x60
[<8051afb4>] blk_cleanup_queue+0x40/0x130
[<806975b4>] __scsi_remove_device+0xd4/0x168
[<80693594>] scsi_probe_and_add_lun+0x53c/0xf44
[<806944c4>] __scsi_scan_target+0x158/0x754
[<80694eb4>] scsi_scan_host_selected+0x17c/0x2e0
[<806950c4>] do_scsi_scan_host+0xac/0xb4
[<806952f8>] do_scan_async+0x30/0x228
[<8015510c>] async_run_entry_fn+0x40/0x100
[<80148384>] process_one_work+0x170/0x428
[<80148be0>] worker_thread+0x188/0x578
[<80150d9c>] kthread+0x130/0x160
[<80103098>] ret_from_kernel_thread+0x14/0x1c

---

megaraid_sas 0000:00:14.0: Max firmware commands: 1007 shared with default hw_queues = 1 poll_queues 0
scsi host0: Avago SAS based MegaRAID driver
ata_piix 0000:00:0a.1: enabling device (0000 -> 0001)
scsi 0:2:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
scsi host1: ata_piix
scsi host2: ata_piix
CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 00000000, ra == 8019d0b4
Oops[#1]:
CPU: 0 PID: 40 Comm: kworker/u2:1 Not tainted 5.13.0-07637-g3dbdb38e2869 #1
Workqueue: events_unbound async_run_entry_fn
$ 0   : 00000000 00000001 82568620 00000000
$ 4   : 82568620 00000200 8019d0b4 8212d580
$ 8   : ffffffe0 000003fc 00000000 81006d70
$12   : 81006d40 0000020c 00000000 80ab4400
$16   : 81007480 00000008 8201ff00 0000000a
$20   : 00000000 810074bc 80ced800 80cd0000
$24   : 000b0f1b 00000739
$28   : 82298000 8201fee8 8019d2f8 8019d0b4
Hi    : 00003f05
Lo    : 0000000f
epc   : 00000000 0x0
ra    : 8019d0b4 rcu_core+0x260/0x754
Status: 1000a403	KERNEL EXL IE
Cause : 00800008 (ExcCode 02)
BadVA : 00000000
PrId  : 00019300 (MIPS 24Kc)
Modules linked in:
Process kworker/u2:1 (pid: 40, threadinfo=(ptrval), task=(ptrval), tls=00000000)
Stack : 00000000 8018b544 ffffffc8 ffffffc8 80bde598 00000000 00000000 8201ff00
        00000048 2942dcc1 80ccc2c8 80cb8080 80d68358 0000000a 00000024 00000009
        00000100 80cb80a4 00000000 80aaac38 80cd0000 80cba400 80cba400 80191214
        00014680 2942dcc1 80cf9980 80ab3ce0 80bd9020 80d682f4 80d6e880 80d6e880
        ffff8fcf 80cd0000 80ab0000 04208060 80ccc2c8 00000001 00000020 80da0000
        ...
Call Trace:

[<8018b544>] __handle_irq_event_percpu+0xbc/0x184
[<80aaac38>] __do_softirq+0x190/0x33c
[<80191214>] handle_level_irq+0x130/0x1e8
[<80132fb8>] irq_exit+0x130/0x138
[<806112d0>] plat_irq_dispatch+0x9c/0x118
[<80103404>] handle_int+0x144/0x150

Code: (Bad address in epc)


---[ end trace 5d4c5bf55a0bb13f ]---
Kernel panic - not syncing: Fatal exception in interrupt
------------[ cut here ]------------

---
Bisect on mainline:

# bad: [3dbdb38e286903ec220aaf1fb29a8d94297da246] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
# good: [007b350a58754a93ca9fe50c498cc27780171153] Merge tag 'dlm-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
git bisect start '3dbdb38e2869' '007b350a5875'
# good: [b6df00789e2831fff7a2c65aa7164b2a4dcbe599] Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
git bisect good b6df00789e2831fff7a2c65aa7164b2a4dcbe599
# good: [990ec3014deedfed49e610cdc31dc6930ca63d8d] drm/amdgpu: add psp runtime db structures
git bisect good 990ec3014deedfed49e610cdc31dc6930ca63d8d
# good: [c288d9cd710433e5991d58a0764c4d08a933b871] Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block
git bisect good c288d9cd710433e5991d58a0764c4d08a933b871
# good: [514798d36572fb8eba6ccff3de10c9615063a7f5] Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
git bisect good 514798d36572fb8eba6ccff3de10c9615063a7f5
# good: [630e438f040c3838206b5e6717b9b5c29edf3548] RDMA/rtrs: Introduce head/tail wr
git bisect good 630e438f040c3838206b5e6717b9b5c29edf3548
# good: [a32b344e6f4375c5bdc3e89d0997b7eae187a3b1] Merge tag 'pinctrl-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
git bisect good a32b344e6f4375c5bdc3e89d0997b7eae187a3b1
# good: [cad065ed8d8831df67b9754cc4437ed55d8b48c0] MIPS: MT extensions are not available on MIPS32r1
git bisect good cad065ed8d8831df67b9754cc4437ed55d8b48c0
# good: [e4d777003a43feab2e000749163e531f6c48c385] percpu: optimize locking in pcpu_balance_workfn()
git bisect good e4d777003a43feab2e000749163e531f6c48c385
# bad: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
git bisect bad e267992f9ef0bf717d70a9ee18049782f77e4b3a
# good: [ab3040e1379bd6fcc260f1f7558ee9c2da62766b] MIPS: Ingenic: Add MAC syscon nodes for Ingenic SoCs.
git bisect good ab3040e1379bd6fcc260f1f7558ee9c2da62766b
# good: [34c522a07ccbfb0e6476713b41a09f9f51a06c9f] MIPS: CI20: Add second percpu timer for SMP.
git bisect good 34c522a07ccbfb0e6476713b41a09f9f51a06c9f
# good: [19b438592238b3b40c3f945bb5f9c4ca971c0c45] Merge tag 'mips_5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
git bisect good 19b438592238b3b40c3f945bb5f9c4ca971c0c45
# first bad commit: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu

---
Bisect on merge branch:

# bad: [e4d777003a43feab2e000749163e531f6c48c385] percpu: optimize locking in pcpu_balance_workfn()
# good: [d434405aaab7d0ebc516b68a8fc4100922d7f5ef] Linux 5.12-rc7
git bisect start 'HEAD' 'v5.12-rc7'
# bad: [f183324133ea535db4127f9fad3e19725ca88bf3] percpu: implement partial chunk depopulation
git bisect bad f183324133ea535db4127f9fad3e19725ca88bf3
# good: [67c2669d69fb5ada0f3b5123fb6ebf6fef9faee5] percpu: split __pcpu_balance_workfn()
git bisect good 67c2669d69fb5ada0f3b5123fb6ebf6fef9faee5
# good: [1c29a3ceaf5f02919e0a89119a70382581453dbb] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
git bisect good 1c29a3ceaf5f02919e0a89119a70382581453dbb
# first bad commit: [f183324133ea535db4127f9fad3e19725ca88bf3] percpu: implement partial chunk depopulation

---
Bisect on rebased merge branch:

# bad: [737dc4074d4969ee54d7f781591bcc608fc6990f] percpu: optimize locking in pcpu_balance_workfn()
# good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
git bisect start 'HEAD' 'v5.13'
# bad: [0bd2212ebd7a02a6c0e870bb4b35abc321c203bc] percpu: implement partial chunk depopulation
git bisect bad 0bd2212ebd7a02a6c0e870bb4b35abc321c203bc
# good: [a7aebdb482a3aa87a61f6414a87f31eb657c41f6] percpu: split __pcpu_balance_workfn()
git bisect good a7aebdb482a3aa87a61f6414a87f31eb657c41f6
# good: [123a0c4318bb8cfb984f41c0499064c383dd9eee] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
git bisect good 123a0c4318bb8cfb984f41c0499064c383dd9eee
# first bad commit: [0bd2212ebd7a02a6c0e870bb4b35abc321c203bc] percpu: implement partial chunk depopulation

Dennis Zhou July 2, 2021, 7:45 p.m. UTC | #2

Hello,

On Fri, Jul 02, 2021 at 12:11:40PM -0700, Guenter Roeck wrote:
> Hi,
> 
> On Mon, Apr 19, 2021 at 10:50:46PM +0000, Dennis Zhou wrote:
> > From: Roman Gushchin <guro@fb.com>
> > 
> > This patch implements partial depopulation of percpu chunks.
> > 
> > As of now, a chunk can be depopulated only as a part of the final
> > destruction, if there are no more outstanding allocations. However
> > to minimize a memory waste it might be useful to depopulate a
> > partially filed chunk, if a small number of outstanding allocations
> > prevents the chunk from being fully reclaimed.
> > 
> > This patch implements the following depopulation process: it scans
> > over the chunk pages, looks for a range of empty and populated pages
> > and performs the depopulation. To avoid races with new allocations,
> > the chunk is previously isolated. After the depopulation the chunk is
> > sidelined to a special list or freed. New allocations prefer using
> > active chunks to sidelined chunks. If a sidelined chunk is used, it is
> > reintegrated to the active lists.
> > 
> > The depopulation is scheduled on the free path if the chunk is all of
> > the following:
> >   1) has more than 1/4 of total pages free and populated
> >   2) the system has enough free percpu pages aside of this chunk
> >   3) isn't the reserved chunk
> >   4) isn't the first chunk
> > If it's already depopulated but got free populated pages, it's a good
> > target too. The chunk is moved to a special slot,
> > pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
> > item is scheduled. On isolation, these pages are removed from the
> > pcpu_nr_empty_pop_pages. It is constantly replaced to the
> > to_depopulate_slot when it meets these qualifications.
> > 
> > pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
> > becomes empty. The depopulation is performed in the reverse direction to
> > keep populated pages close to the beginning. Depopulated chunks are
> > sidelined to preferentially avoid them for new allocations. When no
> > active chunk can suffice a new allocation, sidelined chunks are first
> > checked before creating a new chunk.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Co-developed-by: Dennis Zhou <dennis@kernel.org>
> > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> 
> This patch results in a number of crashes and other odd behavior
> when trying to boot mips images from Megasas controllers in qemu.
> Sometimes the boot stalls, but I also see various crashes.
> Some examples and bisect logs are attached.

Ah, this doesn't look good.. Do you have a reproducer I could use to
debug this?

Thanks,
Dennis

> 
> Note: Bisect on mainline ended with
> 
> # first bad commit: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/l
> inux/kernel/git/dennis/percpu
> 
> I then checked out the merge branch and ran a bisect there, which
> points to this commit. I also rebased the merge branch to v5.13
> and bisected again. Bisect results were the same.
> 
> Guenter
> 
> ---
> ...
> sd 0:2:0:0: [sda] Add. Sense: Internal target failure
> CPU 0 Unable to handle kernel paging request at virtual address 00000004, epc == 805cf8fc, ra == 802ff3b0
> Oops[#1]:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
> $ 0   : 00000000 00000001 00000000 8258fc90
> $ 4   : 825dbd40 820e7624 00000000 fffffff0
> $ 8   : 80c70000 805e1a64 fffffffc 00000000
> $12   : 81006d00 0000001f ffffffe0 00001e83
> $16   : 00000000 825dbd30 80c70000 820e75f8
> $20   : 8275c584 80cc4418 80c9409c 00000008
> $24   : 0000004c 00000000
> $28   : 8204c000 8204fc70 80c26c54 802ff3b0
> Hi    : 0000004c
> Lo    : 00000000
> epc   : 805cf8fc rb_insert_color+0x1c/0x1e0
> ra    : 802ff3b0 kernfs_link_sibling+0x94/0x120
> Status: 1000a403	KERNEL EXL IE
> Cause : 00800008 (ExcCode 02)
> BadVA : 00000004
> PrId  : 00019300 (MIPS 24Kc)
> Modules linked in:
> Process swapper/0 (pid: 1, threadinfo=(ptrval), task=(ptrval), tls=00000000)
> Stack : 820e75f8 820e75f8 820e75f8 00000000 8275c584 fffffffe 825dbd30 8030084c
>         820e75f8 803003f8 00000000 db668853 00000000 801655f4 00000000 00000000
>         00000001 825dbd30 820e75f8 820e75f8 00000000 80300970 81006c80 82150fc0
>         8204fd64 00000001 00000000 00000001 00000000 00000000 82150fc0 8275c580
>         80c50000 80303dc8 82150fc0 8015ab94 81006c80 8015a960 00000000 8275c580
>         ...
> Call Trace:
> [<805cf8fc>] rb_insert_color+0x1c/0x1e0
> [<802ff3b0>] kernfs_link_sibling+0x94/0x120
> [<8030084c>] kernfs_add_one+0xb8/0x184
> [<80300970>] kernfs_create_dir_ns+0x58/0xb0
> [<80303dc8>] sysfs_create_dir_ns+0x74/0x108
> [<805ca51c>] kobject_add_internal+0xb4/0x364
> [<805caaa0>] kobject_init_and_add+0x64/0xa8
> [<8066f768>] bus_add_driver+0x98/0x230
> [<806715a0>] driver_register+0x80/0x144
> [<807c17b8>] usb_register_driver+0xa8/0x1c0
> [<80cb89b8>] uas_init+0x44/0x78
> [<8010065c>] do_one_initcall+0x50/0x1d4
> [<80c95014>] kernel_init_freeable+0x20c/0x29c
> [<80a66bd4>] kernel_init+0x14/0x118
> [<80103098>] ret_from_kernel_thread+0x14/0x1c
> 
> Code: 30460001  14c00016  240afffc <8c460004> 34480001  10c30028  00404825  10c00012  00000000
> 
> ---[ end trace bb7aba36814796cb ]---
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> 
> ---
> scsi host0: Avago SAS based MegaRAID driver
> ata_piix 0000:00:0a.1: enabling device (0000 -> 0001)
> random: fast init done
> scsi 0:2:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
> scsi host1: ata_piix
> BUG: spinlock bad magic on CPU#0, kworker/u2:1/41
>  lock: 0x82598a50, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
> CPU: 0 PID: 41 Comm: kworker/u2:1 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
> Workqueue: events_unbound async_run_entry_fn
> Stack : 822839e4 80c4eee3 80c50000 80a54338 80c80000 801865e0 00000000 00000004
>         822839e4 5e03e26b 80c50000 8014b4c8 80c50000 00000001 822839e0 8207e4c0
>         00000000 00000000 80b7b9ac 82283828 00000001 8228383c 00000000 0000ffff
>         00000008 00000007 00000280 822b7c00 80c50000 80c70000 00000000 80b80000
>         00000003 00000000 80c50000 00000012 00000000 806591f8 00000000 80cf0000
>         ...
> Call Trace:
> [<80109adc>] show_stack+0x84/0x11c
> [<80a62b1c>] dump_stack+0xa8/0xe4
> [<80181468>] do_raw_spin_lock+0xb0/0x128
> [<80a70170>] _raw_spin_lock_irqsave+0x28/0x3c
> [<80176640>] __wake_up_common_lock+0x68/0xe8
> [<801766d4>] __wake_up+0x14/0x20
> [<8054eb48>] percpu_ref_kill_and_confirm+0x120/0x178
> [<80526d2c>] blk_freeze_queue_start+0x58/0x94
> [<8051af0c>] blk_set_queue_dying+0x2c/0x60
> [<8051afb4>] blk_cleanup_queue+0x40/0x130
> [<806975b4>] __scsi_remove_device+0xd4/0x168
> [<80693594>] scsi_probe_and_add_lun+0x53c/0xf44
> [<806944c4>] __scsi_scan_target+0x158/0x754
> [<80694eb4>] scsi_scan_host_selected+0x17c/0x2e0
> [<806950c4>] do_scsi_scan_host+0xac/0xb4
> [<806952f8>] do_scan_async+0x30/0x228
> [<8015510c>] async_run_entry_fn+0x40/0x100
> [<80148384>] process_one_work+0x170/0x428
> [<80148be0>] worker_thread+0x188/0x578
> [<80150d9c>] kthread+0x130/0x160
> [<80103098>] ret_from_kernel_thread+0x14/0x1c
> 
> CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 801764a0, ra == 80176664
> Oops[#1]:
> CPU: 0 PID: 41 Comm: kworker/u2:1 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
> Workqueue: events_unbound async_run_entry_fn
> $ 0   : 00000000 00000001 00000000 00000000
> $ 4   : 82283b14 00000003 00000000 00000000
> $ 8   : 00000001 822837ac 00000000 0000ffff
> $12   : 00000008 00000007 00000280 822b7c00
> $16   : 82598a50 00000000 82283b08 00000003
> $20   : 00000000 00000000 00000000 fffffff4
> $24   : 00000000 806591f8
> $28   : 82280000 82283ab8 82598a60 80176664
> Hi    : 000000a7
> Lo    : 3333335d
> epc   : 801764a0 __wake_up_common+0x6c/0x1a4
> ra    : 80176664 __wake_up_common_lock+0x8c/0xe8
> Status: 1000a402	KERNEL EXL
> Cause : 40808008 (ExcCode 02)
> BadVA : 00000000
> PrId  : 00019300 (MIPS 24Kc)
> Modules linked in:
> Process kworker/u2:1 (pid: 41, threadinfo=(ptrval), task=(ptrval), tls=00000000)
> Stack : 82716400 82283c98 8246a880 8052cfe8 82598a50 00000000 00000000 00000000
>         00000003 00000000 80c50000 00000012 80c80000 80176664 00000000 82126c80
>         801762ec 82283afc 00000000 82283b08 00000000 00000000 00000000 82283b14
>         82283b14 5e03e26b 825985c8 80c70000 00000001 00000000 80d10000 00000024
>         00000003 801766d4 00000001 825985c8 80c70000 00000001 00000000 80d10000
>         ...
> Call Trace:
> [<801764a0>] __wake_up_common+0x6c/0x1a4
> [<80176664>] __wake_up_common_lock+0x8c/0xe8
> [<801766d4>] __wake_up+0x14/0x20
> [<8054eb48>] percpu_ref_kill_and_confirm+0x120/0x178
> [<80526d2c>] blk_freeze_queue_start+0x58/0x94
> [<8051af0c>] blk_set_queue_dying+0x2c/0x60
> [<8051afb4>] blk_cleanup_queue+0x40/0x130
> [<806975b4>] __scsi_remove_device+0xd4/0x168
> [<80693594>] scsi_probe_and_add_lun+0x53c/0xf44
> [<806944c4>] __scsi_scan_target+0x158/0x754
> [<80694eb4>] scsi_scan_host_selected+0x17c/0x2e0
> [<806950c4>] do_scsi_scan_host+0xac/0xb4
> [<806952f8>] do_scan_async+0x30/0x228
> [<8015510c>] async_run_entry_fn+0x40/0x100
> [<80148384>] process_one_work+0x170/0x428
> [<80148be0>] worker_thread+0x188/0x578
> [<80150d9c>] kthread+0x130/0x160
> [<80103098>] ret_from_kernel_thread+0x14/0x1c
> 
> ---
> 
> megaraid_sas 0000:00:14.0: Max firmware commands: 1007 shared with default hw_queues = 1 poll_queues 0
> scsi host0: Avago SAS based MegaRAID driver
> ata_piix 0000:00:0a.1: enabling device (0000 -> 0001)
> scsi 0:2:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
> scsi host1: ata_piix
> scsi host2: ata_piix
> CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 00000000, ra == 8019d0b4
> Oops[#1]:
> CPU: 0 PID: 40 Comm: kworker/u2:1 Not tainted 5.13.0-07637-g3dbdb38e2869 #1
> Workqueue: events_unbound async_run_entry_fn
> $ 0   : 00000000 00000001 82568620 00000000
> $ 4   : 82568620 00000200 8019d0b4 8212d580
> $ 8   : ffffffe0 000003fc 00000000 81006d70
> $12   : 81006d40 0000020c 00000000 80ab4400
> $16   : 81007480 00000008 8201ff00 0000000a
> $20   : 00000000 810074bc 80ced800 80cd0000
> $24   : 000b0f1b 00000739
> $28   : 82298000 8201fee8 8019d2f8 8019d0b4
> Hi    : 00003f05
> Lo    : 0000000f
> epc   : 00000000 0x0
> ra    : 8019d0b4 rcu_core+0x260/0x754
> Status: 1000a403	KERNEL EXL IE
> Cause : 00800008 (ExcCode 02)
> BadVA : 00000000
> PrId  : 00019300 (MIPS 24Kc)
> Modules linked in:
> Process kworker/u2:1 (pid: 40, threadinfo=(ptrval), task=(ptrval), tls=00000000)
> Stack : 00000000 8018b544 ffffffc8 ffffffc8 80bde598 00000000 00000000 8201ff00
>         00000048 2942dcc1 80ccc2c8 80cb8080 80d68358 0000000a 00000024 00000009
>         00000100 80cb80a4 00000000 80aaac38 80cd0000 80cba400 80cba400 80191214
>         00014680 2942dcc1 80cf9980 80ab3ce0 80bd9020 80d682f4 80d6e880 80d6e880
>         ffff8fcf 80cd0000 80ab0000 04208060 80ccc2c8 00000001 00000020 80da0000
>         ...
> Call Trace:
> 
> [<8018b544>] __handle_irq_event_percpu+0xbc/0x184
> [<80aaac38>] __do_softirq+0x190/0x33c
> [<80191214>] handle_level_irq+0x130/0x1e8
> [<80132fb8>] irq_exit+0x130/0x138
> [<806112d0>] plat_irq_dispatch+0x9c/0x118
> [<80103404>] handle_int+0x144/0x150
> 
> Code: (Bad address in epc)
> 
> 
> ---[ end trace 5d4c5bf55a0bb13f ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> ------------[ cut here ]------------
> 
> ---
> Bisect on mainline:
> 
> # bad: [3dbdb38e286903ec220aaf1fb29a8d94297da246] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
> # good: [007b350a58754a93ca9fe50c498cc27780171153] Merge tag 'dlm-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
> git bisect start '3dbdb38e2869' '007b350a5875'
> # good: [b6df00789e2831fff7a2c65aa7164b2a4dcbe599] Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> git bisect good b6df00789e2831fff7a2c65aa7164b2a4dcbe599
> # good: [990ec3014deedfed49e610cdc31dc6930ca63d8d] drm/amdgpu: add psp runtime db structures
> git bisect good 990ec3014deedfed49e610cdc31dc6930ca63d8d
> # good: [c288d9cd710433e5991d58a0764c4d08a933b871] Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block
> git bisect good c288d9cd710433e5991d58a0764c4d08a933b871
> # good: [514798d36572fb8eba6ccff3de10c9615063a7f5] Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
> git bisect good 514798d36572fb8eba6ccff3de10c9615063a7f5
> # good: [630e438f040c3838206b5e6717b9b5c29edf3548] RDMA/rtrs: Introduce head/tail wr
> git bisect good 630e438f040c3838206b5e6717b9b5c29edf3548
> # good: [a32b344e6f4375c5bdc3e89d0997b7eae187a3b1] Merge tag 'pinctrl-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
> git bisect good a32b344e6f4375c5bdc3e89d0997b7eae187a3b1
> # good: [cad065ed8d8831df67b9754cc4437ed55d8b48c0] MIPS: MT extensions are not available on MIPS32r1
> git bisect good cad065ed8d8831df67b9754cc4437ed55d8b48c0
> # good: [e4d777003a43feab2e000749163e531f6c48c385] percpu: optimize locking in pcpu_balance_workfn()
> git bisect good e4d777003a43feab2e000749163e531f6c48c385
> # bad: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
> git bisect bad e267992f9ef0bf717d70a9ee18049782f77e4b3a
> # good: [ab3040e1379bd6fcc260f1f7558ee9c2da62766b] MIPS: Ingenic: Add MAC syscon nodes for Ingenic SoCs.
> git bisect good ab3040e1379bd6fcc260f1f7558ee9c2da62766b
> # good: [34c522a07ccbfb0e6476713b41a09f9f51a06c9f] MIPS: CI20: Add second percpu timer for SMP.
> git bisect good 34c522a07ccbfb0e6476713b41a09f9f51a06c9f
> # good: [19b438592238b3b40c3f945bb5f9c4ca971c0c45] Merge tag 'mips_5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
> git bisect good 19b438592238b3b40c3f945bb5f9c4ca971c0c45
> # first bad commit: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
> 
> ---
> Bisect on merge branch:
> 
> # bad: [e4d777003a43feab2e000749163e531f6c48c385] percpu: optimize locking in pcpu_balance_workfn()
> # good: [d434405aaab7d0ebc516b68a8fc4100922d7f5ef] Linux 5.12-rc7
> git bisect start 'HEAD' 'v5.12-rc7'
> # bad: [f183324133ea535db4127f9fad3e19725ca88bf3] percpu: implement partial chunk depopulation
> git bisect bad f183324133ea535db4127f9fad3e19725ca88bf3
> # good: [67c2669d69fb5ada0f3b5123fb6ebf6fef9faee5] percpu: split __pcpu_balance_workfn()
> git bisect good 67c2669d69fb5ada0f3b5123fb6ebf6fef9faee5
> # good: [1c29a3ceaf5f02919e0a89119a70382581453dbb] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
> git bisect good 1c29a3ceaf5f02919e0a89119a70382581453dbb
> # first bad commit: [f183324133ea535db4127f9fad3e19725ca88bf3] percpu: implement partial chunk depopulation
> 
> ---
> Bisect on rebased merge branch:
> 
> # bad: [737dc4074d4969ee54d7f781591bcc608fc6990f] percpu: optimize locking in pcpu_balance_workfn()
> # good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
> git bisect start 'HEAD' 'v5.13'
> # bad: [0bd2212ebd7a02a6c0e870bb4b35abc321c203bc] percpu: implement partial chunk depopulation
> git bisect bad 0bd2212ebd7a02a6c0e870bb4b35abc321c203bc
> # good: [a7aebdb482a3aa87a61f6414a87f31eb657c41f6] percpu: split __pcpu_balance_workfn()
> git bisect good a7aebdb482a3aa87a61f6414a87f31eb657c41f6
> # good: [123a0c4318bb8cfb984f41c0499064c383dd9eee] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
> git bisect good 123a0c4318bb8cfb984f41c0499064c383dd9eee
> # first bad commit: [0bd2212ebd7a02a6c0e870bb4b35abc321c203bc] percpu: implement partial chunk depopulation

Guenter Roeck July 2, 2021, 8:28 p.m. UTC | #3

On 7/2/21 12:45 PM, Dennis Zhou wrote:
> Hello,
> 
> On Fri, Jul 02, 2021 at 12:11:40PM -0700, Guenter Roeck wrote:
>> Hi,
>>
>> On Mon, Apr 19, 2021 at 10:50:46PM +0000, Dennis Zhou wrote:
>>> From: Roman Gushchin <guro@fb.com>
>>>
>>> This patch implements partial depopulation of percpu chunks.
>>>
>>> As of now, a chunk can be depopulated only as a part of the final
>>> destruction, if there are no more outstanding allocations. However
>>> to minimize a memory waste it might be useful to depopulate a
>>> partially filed chunk, if a small number of outstanding allocations
>>> prevents the chunk from being fully reclaimed.
>>>
>>> This patch implements the following depopulation process: it scans
>>> over the chunk pages, looks for a range of empty and populated pages
>>> and performs the depopulation. To avoid races with new allocations,
>>> the chunk is previously isolated. After the depopulation the chunk is
>>> sidelined to a special list or freed. New allocations prefer using
>>> active chunks to sidelined chunks. If a sidelined chunk is used, it is
>>> reintegrated to the active lists.
>>>
>>> The depopulation is scheduled on the free path if the chunk is all of
>>> the following:
>>>    1) has more than 1/4 of total pages free and populated
>>>    2) the system has enough free percpu pages aside of this chunk
>>>    3) isn't the reserved chunk
>>>    4) isn't the first chunk
>>> If it's already depopulated but got free populated pages, it's a good
>>> target too. The chunk is moved to a special slot,
>>> pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
>>> item is scheduled. On isolation, these pages are removed from the
>>> pcpu_nr_empty_pop_pages. It is constantly replaced to the
>>> to_depopulate_slot when it meets these qualifications.
>>>
>>> pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
>>> becomes empty. The depopulation is performed in the reverse direction to
>>> keep populated pages close to the beginning. Depopulated chunks are
>>> sidelined to preferentially avoid them for new allocations. When no
>>> active chunk can suffice a new allocation, sidelined chunks are first
>>> checked before creating a new chunk.
>>>
>>> Signed-off-by: Roman Gushchin <guro@fb.com>
>>> Co-developed-by: Dennis Zhou <dennis@kernel.org>
>>> Signed-off-by: Dennis Zhou <dennis@kernel.org>
>>
>> This patch results in a number of crashes and other odd behavior
>> when trying to boot mips images from Megasas controllers in qemu.
>> Sometimes the boot stalls, but I also see various crashes.
>> Some examples and bisect logs are attached.
> 
> Ah, this doesn't look good.. Do you have a reproducer I could use to
> debug this?
> 

I copied the relevant information to http://server.roeck-us.net/qemu/mips/.

run.sh - qemu command (I tried with qemu 6.0 and 4.2.1)
rootfs.ext2 - root file system
config - complete configuration
defconfig - shortened configuration
vmlinux - a crashing kernel image (v5.13-7637-g3dbdb38e2869, with above configuration)

Interestingly, the crash doesn't always happen at the same location, even
with the same image. Some memory corruption, maybe ?

Hope this helps. Please let me know if I can provide anything else.

Thanks,
Guenter

Dennis Zhou July 2, 2021, 9 p.m. UTC | #4

On Fri, Jul 02, 2021 at 01:28:18PM -0700, Guenter Roeck wrote:
> On 7/2/21 12:45 PM, Dennis Zhou wrote:
> > Hello,
> > 
> > On Fri, Jul 02, 2021 at 12:11:40PM -0700, Guenter Roeck wrote:
> > > Hi,
> > > 
> > > On Mon, Apr 19, 2021 at 10:50:46PM +0000, Dennis Zhou wrote:
> > > > From: Roman Gushchin <guro@fb.com>
> > > > 
> > > > This patch implements partial depopulation of percpu chunks.
> > > > 
> > > > As of now, a chunk can be depopulated only as a part of the final
> > > > destruction, if there are no more outstanding allocations. However
> > > > to minimize a memory waste it might be useful to depopulate a
> > > > partially filed chunk, if a small number of outstanding allocations
> > > > prevents the chunk from being fully reclaimed.
> > > > 
> > > > This patch implements the following depopulation process: it scans
> > > > over the chunk pages, looks for a range of empty and populated pages
> > > > and performs the depopulation. To avoid races with new allocations,
> > > > the chunk is previously isolated. After the depopulation the chunk is
> > > > sidelined to a special list or freed. New allocations prefer using
> > > > active chunks to sidelined chunks. If a sidelined chunk is used, it is
> > > > reintegrated to the active lists.
> > > > 
> > > > The depopulation is scheduled on the free path if the chunk is all of
> > > > the following:
> > > >    1) has more than 1/4 of total pages free and populated
> > > >    2) the system has enough free percpu pages aside of this chunk
> > > >    3) isn't the reserved chunk
> > > >    4) isn't the first chunk
> > > > If it's already depopulated but got free populated pages, it's a good
> > > > target too. The chunk is moved to a special slot,
> > > > pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
> > > > item is scheduled. On isolation, these pages are removed from the
> > > > pcpu_nr_empty_pop_pages. It is constantly replaced to the
> > > > to_depopulate_slot when it meets these qualifications.
> > > > 
> > > > pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
> > > > becomes empty. The depopulation is performed in the reverse direction to
> > > > keep populated pages close to the beginning. Depopulated chunks are
> > > > sidelined to preferentially avoid them for new allocations. When no
> > > > active chunk can suffice a new allocation, sidelined chunks are first
> > > > checked before creating a new chunk.
> > > > 
> > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > Co-developed-by: Dennis Zhou <dennis@kernel.org>
> > > > Signed-off-by: Dennis Zhou <dennis@kernel.org>
> > > 
> > > This patch results in a number of crashes and other odd behavior
> > > when trying to boot mips images from Megasas controllers in qemu.
> > > Sometimes the boot stalls, but I also see various crashes.
> > > Some examples and bisect logs are attached.
> > 
> > Ah, this doesn't look good.. Do you have a reproducer I could use to
> > debug this?
> > 
> 
> I copied the relevant information to http://server.roeck-us.net/qemu/mips/.
> 

This is perfect! I'm able to reproduce it.

> run.sh - qemu command (I tried with qemu 6.0 and 4.2.1)
> rootfs.ext2 - root file system
> config - complete configuration
> defconfig - shortened configuration
> vmlinux - a crashing kernel image (v5.13-7637-g3dbdb38e2869, with above configuration)
> 
> Interestingly, the crash doesn't always happen at the same location, even
> with the same image. Some memory corruption, maybe ?
> 

Well a few factors matter, percpu gets placed in random places. Percpu
allocations may happen in different order and this will cause different
freeing patterns. Then the problem patch may free the wrong backing
page.

I'm working on it, x86 doesn't seem to have any immediate issues
(fingers crossed) so it must be some delta here.

> Hope this helps. Please let me know if I can provide anything else.
> 
> Thanks,
> Guenter

Thanks,
Dennis

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 095d7eaa0db4..10604dce806f 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -67,6 +67,8 @@  struct pcpu_chunk {
 
 	void			*data;		/* chunk data */
 	bool			immutable;	/* no [de]population allowed */
+	bool			isolated;	/* isolated from active chunk
+						   slots */
 	int			start_offset;	/* the overlap with the previous
 						   region to have a page aligned
 						   base_addr */
@@ -87,6 +89,8 @@  extern spinlock_t pcpu_lock;
 
 extern struct list_head *pcpu_chunk_lists;
 extern int pcpu_nr_slots;
+extern int pcpu_sidelined_slot;
+extern int pcpu_to_depopulate_slot;
 extern int pcpu_nr_empty_pop_pages[];
 
 extern struct pcpu_chunk *pcpu_first_chunk;
diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index 35c9941077ee..c84a9f781a6c 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -118,3 +118,8 @@  static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai)
 
 	return 0;
 }
+
+static bool pcpu_should_reclaim_chunk(struct pcpu_chunk *chunk)
+{
+	return false;
+}
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index f6026dbcdf6b..2125981acfb9 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -219,13 +219,15 @@  static int percpu_stats_show(struct seq_file *m, void *v)
 		for (slot = 0; slot < pcpu_nr_slots; slot++) {
 			list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot],
 					    list) {
-				if (chunk == pcpu_first_chunk) {
+				if (chunk == pcpu_first_chunk)
 					seq_puts(m, "Chunk: <- First Chunk\n");
-					chunk_map_stats(m, chunk, buffer);
-				} else {
+				else if (slot == pcpu_to_depopulate_slot)
+					seq_puts(m, "Chunk (to_depopulate)\n");
+				else if (slot == pcpu_sidelined_slot)
+					seq_puts(m, "Chunk (sidelined):\n");
+				else
 					seq_puts(m, "Chunk:\n");
-					chunk_map_stats(m, chunk, buffer);
-				}
+				chunk_map_stats(m, chunk, buffer);
 			}
 		}
 	}
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index e46f7a6917f9..c75f6f24f2d5 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -377,3 +377,33 @@  static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai)
 	/* no extra restriction */
 	return 0;
 }
+
+/**
+ * pcpu_should_reclaim_chunk - determine if a chunk should go into reclaim
+ * @chunk: chunk of interest
+ *
+ * This is the entry point for percpu reclaim.  If a chunk qualifies, it is then
+ * isolated and managed in separate lists at the back of pcpu_slot: sidelined
+ * and to_depopulate respectively.  The to_depopulate list holds chunks slated
+ * for depopulation.  They no longer contribute to pcpu_nr_empty_pop_pages once
+ * they are on this list.  Once depopulated, they are moved onto the sidelined
+ * list which enables them to be pulled back in for allocation if no other chunk
+ * can suffice the allocation.
+ */
+static bool pcpu_should_reclaim_chunk(struct pcpu_chunk *chunk)
+{
+	/* do not reclaim either the first chunk or reserved chunk */
+	if (chunk == pcpu_first_chunk || chunk == pcpu_reserved_chunk)
+		return false;
+
+	/*
+	 * If it is isolated, it may be on the sidelined list so move it back to
+	 * the to_depopulate list.  If we hit at least 1/4 pages empty pages AND
+	 * there is no system-wide shortage of empty pages aside from this
+	 * chunk, move it to the to_depopulate list.
+	 */
+	return ((chunk->isolated && chunk->nr_empty_pop_pages) ||
+		(pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
+		 PCPU_EMPTY_POP_PAGES_HIGH + chunk->nr_empty_pop_pages &&
+		chunk->nr_empty_pop_pages >= chunk->nr_pages / 4));
+}
diff --git a/mm/percpu.c b/mm/percpu.c
index d462222f4adc..79eebc80860d 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -136,6 +136,8 @@  static int pcpu_nr_units __ro_after_init;
 static int pcpu_atom_size __ro_after_init;
 int pcpu_nr_slots __ro_after_init;
 int pcpu_free_slot __ro_after_init;
+int pcpu_sidelined_slot __ro_after_init;
+int pcpu_to_depopulate_slot __ro_after_init;
 static size_t pcpu_chunk_struct_size __ro_after_init;
 
 /* cpus with the lowest and highest unit addresses */
@@ -562,10 +564,41 @@  static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
 {
 	int nslot = pcpu_chunk_slot(chunk);
 
+	/* leave isolated chunks in-place */
+	if (chunk->isolated)
+		return;
+
 	if (oslot != nslot)
 		__pcpu_chunk_move(chunk, nslot, oslot < nslot);
 }
 
+static void pcpu_isolate_chunk(struct pcpu_chunk *chunk)
+{
+	enum pcpu_chunk_type type = pcpu_chunk_type(chunk);
+	struct list_head *pcpu_slot = pcpu_chunk_list(type);
+
+	lockdep_assert_held(&pcpu_lock);
+
+	if (!chunk->isolated) {
+		chunk->isolated = true;
+		pcpu_nr_empty_pop_pages[type] -= chunk->nr_empty_pop_pages;
+	}
+	list_move(&chunk->list, &pcpu_slot[pcpu_to_depopulate_slot]);
+}
+
+static void pcpu_reintegrate_chunk(struct pcpu_chunk *chunk)
+{
+	enum pcpu_chunk_type type = pcpu_chunk_type(chunk);
+
+	lockdep_assert_held(&pcpu_lock);
+
+	if (chunk->isolated) {
+		chunk->isolated = false;
+		pcpu_nr_empty_pop_pages[type] += chunk->nr_empty_pop_pages;
+		pcpu_chunk_relocate(chunk, -1);
+	}
+}
+
 /*
  * pcpu_update_empty_pages - update empty page counters
  * @chunk: chunk of interest
@@ -578,7 +611,7 @@  static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
 static inline void pcpu_update_empty_pages(struct pcpu_chunk *chunk, int nr)
 {
 	chunk->nr_empty_pop_pages += nr;
-	if (chunk != pcpu_reserved_chunk)
+	if (chunk != pcpu_reserved_chunk && !chunk->isolated)
 		pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] += nr;
 }
 
@@ -1778,7 +1811,7 @@  static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 
 restart:
 	/* search through normal chunks */
-	for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
+	for (slot = pcpu_size_to_slot(size); slot <= pcpu_free_slot; slot++) {
 		list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
 			off = pcpu_find_block_fit(chunk, bits, bit_align,
 						  is_atomic);
@@ -1789,9 +1822,10 @@  static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 			}
 
 			off = pcpu_alloc_area(chunk, bits, bit_align, off);
-			if (off >= 0)
+			if (off >= 0) {
+				pcpu_reintegrate_chunk(chunk);
 				goto area_found;
-
+			}
 		}
 	}
 
@@ -1952,10 +1986,13 @@  void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
 /**
  * pcpu_balance_free - manage the amount of free chunks
  * @type: chunk type
+ * @empty_only: free chunks only if there are no populated pages
  *
- * Reclaim all fully free chunks except for the first one.
+ * If empty_only is %false, reclaim all fully free chunks regardless of the
+ * number of populated pages.  Otherwise, only reclaim chunks that have no
+ * populated pages.
  */
-static void pcpu_balance_free(enum pcpu_chunk_type type)
+static void pcpu_balance_free(enum pcpu_chunk_type type, bool empty_only)
 {
 	LIST_HEAD(to_free);
 	struct list_head *pcpu_slot = pcpu_chunk_list(type);
@@ -1975,7 +2012,8 @@  static void pcpu_balance_free(enum pcpu_chunk_type type)
 		if (chunk == list_first_entry(free_head, struct pcpu_chunk, list))
 			continue;
 
-		list_move(&chunk->list, &to_free);
+		if (!empty_only || chunk->nr_empty_pop_pages == 0)
+			list_move(&chunk->list, &to_free);
 	}
 
 	spin_unlock_irq(&pcpu_lock);
@@ -2083,20 +2121,121 @@  static void pcpu_balance_populated(enum pcpu_chunk_type type)
 	}
 }
 
+/**
+ * pcpu_reclaim_populated - scan over to_depopulate chunks and free empty pages
+ * @type: chunk type
+ *
+ * Scan over chunks in the depopulate list and try to release unused populated
+ * pages back to the system.  Depopulated chunks are sidelined to prevent
+ * repopulating these pages unless required.  Fully free chunks are reintegrated
+ * and freed accordingly (1 is kept around).  If we drop below the empty
+ * populated pages threshold, reintegrate the chunk if it has empty free pages.
+ * Each chunk is scanned in the reverse order to keep populated pages close to
+ * the beginning of the chunk.
+ */
+static void pcpu_reclaim_populated(enum pcpu_chunk_type type)
+{
+	struct list_head *pcpu_slot = pcpu_chunk_list(type);
+	struct pcpu_chunk *chunk;
+	struct pcpu_block_md *block;
+	int i, end;
+
+	spin_lock_irq(&pcpu_lock);
+
+restart:
+	/*
+	 * Once a chunk is isolated to the to_depopulate list, the chunk is no
+	 * longer discoverable to allocations whom may populate pages.  The only
+	 * other accessor is the free path which only returns area back to the
+	 * allocator not touching the populated bitmap.
+	 */
+	while (!list_empty(&pcpu_slot[pcpu_to_depopulate_slot])) {
+		chunk = list_first_entry(&pcpu_slot[pcpu_to_depopulate_slot],
+					 struct pcpu_chunk, list);
+		WARN_ON(chunk->immutable);
+
+		/*
+		 * Scan chunk's pages in the reverse order to keep populated
+		 * pages close to the beginning of the chunk.
+		 */
+		for (i = chunk->nr_pages - 1, end = -1; i >= 0; i--) {
+			/* no more work to do */
+			if (chunk->nr_empty_pop_pages == 0)
+				break;
+
+			/* reintegrate chunk to prevent atomic alloc failures */
+			if (pcpu_nr_empty_pop_pages[type] <
+			    PCPU_EMPTY_POP_PAGES_HIGH) {
+				pcpu_reintegrate_chunk(chunk);
+				goto restart;
+			}
+
+			/*
+			 * If the page is empty and populated, start or
+			 * extend the (i, end) range.  If i == 0, decrease
+			 * i and perform the depopulation to cover the last
+			 * (first) page in the chunk.
+			 */
+			block = chunk->md_blocks + i;
+			if (block->contig_hint == PCPU_BITMAP_BLOCK_BITS &&
+			    test_bit(i, chunk->populated)) {
+				if (end == -1)
+					end = i;
+				if (i > 0)
+					continue;
+				i--;
+			}
+
+			/* depopulate if there is an active range */
+			if (end == -1)
+				continue;
+
+			spin_unlock_irq(&pcpu_lock);
+			pcpu_depopulate_chunk(chunk, i + 1, end + 1);
+			cond_resched();
+			spin_lock_irq(&pcpu_lock);
+
+			pcpu_chunk_depopulated(chunk, i + 1, end + 1);
+
+			/* reset the range and continue */
+			end = -1;
+		}
+
+		if (chunk->free_bytes == pcpu_unit_size)
+			pcpu_reintegrate_chunk(chunk);
+		else
+			list_move(&chunk->list,
+				  &pcpu_slot[pcpu_sidelined_slot]);
+	}
+
+	spin_unlock_irq(&pcpu_lock);
+}
+
 /**
  * pcpu_balance_workfn - manage the amount of free chunks and populated pages
  * @work: unused
  *
- * Call pcpu_balance_free() and pcpu_balance_populated() for each chunk type.
+ * For each chunk type, manage the number of fully free chunks and the number of
+ * populated pages.  An important thing to consider is when pages are freed and
+ * how they contribute to the global counts.
  */
 static void pcpu_balance_workfn(struct work_struct *work)
 {
 	enum pcpu_chunk_type type;
 
+	/*
+	 * pcpu_balance_free() is called twice because the first time we may
+	 * trim pages in the active pcpu_nr_empty_pop_pages which may cause us
+	 * to grow other chunks.  This then gives pcpu_reclaim_populated() time
+	 * to move fully free chunks to the active list to be freed if
+	 * appropriate.
+	 */
 	for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
 		mutex_lock(&pcpu_alloc_mutex);
-		pcpu_balance_free(type);
+		pcpu_balance_free(type, false);
+		pcpu_reclaim_populated(type);
 		pcpu_balance_populated(type);
+		pcpu_balance_free(type, true);
 		mutex_unlock(&pcpu_alloc_mutex);
 	}
 }
@@ -2137,8 +2276,12 @@  void free_percpu(void __percpu *ptr)
 
 	pcpu_memcg_free_hook(chunk, off, size);
 
-	/* if there are more than one fully free chunks, wake up grim reaper */
-	if (chunk->free_bytes == pcpu_unit_size) {
+	/*
+	 * If there are more than one fully free chunks, wake up grim reaper.
+	 * If the chunk is isolated, it may be in the process of being
+	 * reclaimed.  Let reclaim manage cleaning up of that chunk.
+	 */
+	if (!chunk->isolated && chunk->free_bytes == pcpu_unit_size) {
 		struct pcpu_chunk *pos;
 
 		list_for_each_entry(pos, &pcpu_slot[pcpu_free_slot], list)
@@ -2146,6 +2289,9 @@  void free_percpu(void __percpu *ptr)
 				need_balance = true;
 				break;
 			}
+	} else if (pcpu_should_reclaim_chunk(chunk)) {
+		pcpu_isolate_chunk(chunk);
+		need_balance = true;
 	}
 
 	trace_percpu_free_percpu(chunk->base_addr, off, ptr);
@@ -2560,11 +2706,15 @@  void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	pcpu_stats_save_ai(ai);
 
 	/*
-	 * Allocate chunk slots.  The additional last slot is for
-	 * empty chunks.
+	 * Allocate chunk slots.  The slots after the active slots are:
+	 *   sidelined_slot - isolated, depopulated chunks
+	 *   free_slot - fully free chunks
+	 *   to_depopulate_slot - isolated, chunks to depopulate
 	 */
-	pcpu_free_slot = __pcpu_size_to_slot(pcpu_unit_size) + 1;
-	pcpu_nr_slots = pcpu_free_slot + 1;
+	pcpu_sidelined_slot = __pcpu_size_to_slot(pcpu_unit_size) + 1;
+	pcpu_free_slot = pcpu_sidelined_slot + 1;
+	pcpu_to_depopulate_slot = pcpu_free_slot + 1;
+	pcpu_nr_slots = pcpu_to_depopulate_slot + 1;
 	pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
 					  sizeof(pcpu_chunk_lists[0]) *
 					  PCPU_NR_CHUNK_TYPES,

[3/4] percpu: implement partial chunk depopulation

Commit Message

Comments

Patch