mbox series

[RFC,v3,0/3] pmem memmap dump support

Message ID 20230602102656.131654-1-lizhijian@fujitsu.com (mailing list archive)
Headers show
Series pmem memmap dump support | expand

Message

Zhijian Li (Fujitsu) June 2, 2023, 10:26 a.m. UTC
Hello folks,

After sending out the previous version of the patch set, we received some comments,
and we really appreciate your input. However, as you can see, the current patch
set is still in its early stages, especially in terms of the solution selection,
which may still undergo changes.

Changes in V3:
Mainly based on the understanding from the first version, I implemented the proposal
suggested by Dan. In the kdump kernel, the device's superblock is read through
a device file interface to calculate the metadata range. In the second version,
the first kernel writes the metadata range to vmcoreinfo, and after kdump occurs,
the kdump kernel can directly read it from /proc/vmcore.

Comparing these two approaches, the advantage of Version 3 is fewer kernel
modifications, but the downside is the introduction of a new external library,
libndctl, to search for each namespace, which introduces a higher level of
coupling with ndctl.

One important thing to note about both V2 and V3 is the introduction of a new
ELF flag, PF_DEV, to indicate whether a range is on a device. I'm not sure if
there are better alternatives or if we can use this flag internally without
exposing it in elf.h.

We greatly appreciate your feedback and would like to hear your response.

In RFC stage, I folded these 3 projects in this same cover letter for reviewing convenience.
kernel(3):
  nvdimm: set force_raw=1 in kdump kernel
  x86/crash: Add pmem region into PT_LOADs of vmcore
  kernel/kexec_file: Mark pmem region with new flag PF_DEV
kexec-tools(1):
  kexec: Add and mark pmem region into PT_LOADs
makedumpfile(3):
  elf_info.c: Introduce is_pmem_pt_load_range
  makedumpfile.c: Exclude all pmem pages
  makedumpfile: get metadata boundaries from pmem's infoblock

Currently, this RFC has already implemented to supported case D*. And the case A&B is disabled
deliberately in makedumpfile.
---

pmem memmap can also be called pmem metadata here.

### Background and motivate overview ###
---
Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
trouble around pmem (especially Filesystem-DAX).

A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
more details. In fsdax, struct page array becomes very important, it is one of the key data to find
status of reverse map.

So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
troubleshooters are unable to check more details about pmem from the dumpfile.

### Make pmem memmap dump support ###
---
Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.

First, based on our previous investigation, according to the location of metadata and the scope of
dump, we can divide it into the following four cases: A, B, C, D.
It should be noted that although we mentioned case A&B below, we do not want these two cases to be
part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
it may contain user sensitive data.

+-------------+----------+------------+
|\+--------+\     metadata location   |
|            ++-----------------------+
| dump scope  |  mem     |   PMEM     |
+-------------+----------+------------+
| entire pmem |     A    |     B      |
+-------------+----------+------------+
| metadata    |     C    |     D      |
+-------------+----------+------------+

### Testing ###
Only x86_64 are tested. Please note that we have to disable the 2nd kernel's libnvdimm to ensure the
metadata in 2nd kernel will not be touched again.

below 2 commits use sha256 to check the metadata in 1st kernel during panic and makedumpfile in 2nd kernel.
https://github.com/zhijianli88/makedumpfile/commit/91a135be6980e6e87b9e00b909aaaf8ef9566ec0
https://github.com/zhijianli88/linux/commit/55bef07f8f0b2e587737b796e73b92f242947e5a

### TODO ###
Only x86 are fully supported for both kexec_load(2) and kexec_file_load(2)
kexec_file_load(2) on other architectures are TODOs.
---
[1] Pmem region layout:
   ^<--namespace0.0---->^<--namespace0.1------>^
   |                    |                      |
   +--+m----------------+--+m------------------+---------------------+-+a
   |++|e                |++|e                  |                     |+|l
   |++|t                |++|t                  |                     |+|i
   |++|a                |++|a                  |                     |+|g
   |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
   |++|a    fsdax       |++|a     devdax       |                     |+|m
   |++|t                |++|t                  |                     |+|e
   +--+a----------------+--+a------------------+---------------------+-+n
   |                                                                   |t
   v<-----------------------pmem region------------------------------->v

[2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/
[3] https://lore.kernel.org/linux-mm/3c752fc2-b6a0-2975-ffec-dba3edcf4155@fujitsu.com/

### makedumpfile output in case B ####
kdump.sh[224]: makedumpfile: version 1.7.2++ (released on 20 Oct 2022)
kdump.sh[224]: command line: makedumpfile -l --message-level 31 -d 31 /proc/vmcore /sysroot/var/crash/127.0.0.1-2023-04-21-02:50:57//vmcore-incomplete
kdump.sh[224]: sadump: does not have partition header
kdump.sh[224]: sadump: read dump device as unknown format
kdump.sh[224]: sadump: unknown format
kdump.sh[224]:                phys_start         phys_end       virt_start         virt_end  is_pmem
kdump.sh[224]: LOAD[ 0]          1000000          3c26000 ffffffff81000000 ffffffff83c26000    false
kdump.sh[224]: LOAD[ 1]           100000         7f000000 ffff888000100000 ffff88807f000000    false
kdump.sh[224]: LOAD[ 2]         bf000000         bffd7000 ffff8880bf000000 ffff8880bffd7000    false
kdump.sh[224]: LOAD[ 3]        100000000        140000000 ffff888100000000 ffff888140000000    false
kdump.sh[224]: LOAD[ 4]        140000000        23e200000 ffff888140000000 ffff88823e200000     true
kdump.sh[224]: Linux kdump
kdump.sh[224]: VMCOREINFO   :
kdump.sh[224]:   OSRELEASE=6.3.0-rc3-pmem-bad+
kdump.sh[224]:   BUILD-ID=0546bd82db93706799d3eea38194ac648790aa85
kdump.sh[224]:   PAGESIZE=4096
kdump.sh[224]: page_size    : 4096
kdump.sh[224]:   SYMBOL(init_uts_ns)=ffffffff82671300
kdump.sh[224]:   OFFSET(uts_namespace.name)=0
kdump.sh[224]:   SYMBOL(node_online_map)=ffffffff826bbe08
kdump.sh[224]:   SYMBOL(swapper_pg_dir)=ffffffff82446000
kdump.sh[224]:   SYMBOL(_stext)=ffffffff81000000
kdump.sh[224]:   SYMBOL(vmap_area_list)=ffffffff82585fb0
kdump.sh[224]:   SYMBOL(devm_memmap_vmcore_head)=ffffffff825603c0
kdump.sh[224]:   SIZE(devm_memmap_vmcore)=40
kdump.sh[224]:   OFFSET(devm_memmap_vmcore.entry)=0
kdump.sh[224]:   OFFSET(devm_memmap_vmcore.start)=16
kdump.sh[224]:   OFFSET(devm_memmap_vmcore.end)=24
kdump.sh[224]:   SYMBOL(mem_section)=ffff88813fff4000
kdump.sh[224]:   LENGTH(mem_section)=2048
kdump.sh[224]:   SIZE(mem_section)=16
kdump.sh[224]:   OFFSET(mem_section.section_mem_map)=0
...
kdump.sh[224]: STEP [Checking for memory holes  ] : 0.012699 seconds
kdump.sh[224]: STEP [Excluding unnecessary pages] : 0.538059 seconds
kdump.sh[224]: STEP [Copying data               ] : 0.995418 seconds
kdump.sh[224]: STEP [Copying data               ] : 0.000067 seconds
kdump.sh[224]: Writing erase info...
kdump.sh[224]: offset_eraseinfo: 5d02266, size_eraseinfo: 0
kdump.sh[224]: Original pages  : 0x00000000001c0cfd
kdump.sh[224]:   Excluded pages   : 0x00000000001a58d2
kdump.sh[224]:     Pages filled with zero  : 0x0000000000006805
kdump.sh[224]:     Non-private cache pages : 0x0000000000019e93
kdump.sh[224]:     Private cache pages     : 0x0000000000077572
kdump.sh[224]:     User process data pages : 0x0000000000002c3b
kdump.sh[224]:     Free pages              : 0x0000000000010e8d
kdump.sh[224]:     Hwpoison pages          : 0x0000000000000000
kdump.sh[224]:     Offline pages           : 0x0000000000000000
kdump.sh[224]:     pmem metadata pages     : 0x0000000000000000
kdump.sh[224]:     pmem userdata pages     : 0x00000000000fa200
kdump.sh[224]:   Remaining pages  : 0x000000000001b42b
kdump.sh[224]:   (The number of pages is reduced to 6%.)
kdump.sh[224]: Memory Hole     : 0x000000000007d503
kdump.sh[224]: --------------------------------------------------
kdump.sh[224]: Total pages     : 0x000000000023e200
kdump.sh[224]: Write bytes     : 97522590
kdump.sh[224]: Cache hit: 191669, miss: 292, hit rate: 99.8%
kdump.sh[224]: The dumpfile is saved to /sysroot/var/crash/127.0.0.1-2023-04-21-02:50:57//vmcore-incomplete.
kdump.sh[224]: makedumpfile Completed.

CC: Baoquan He <bhe@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Dave Jiang <dave.jiang@intel.com>
CC: Dave Young <dyoung@redhat.com>
CC: Eric Biederman <ebiederm@xmission.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Ira Weiny <ira.weiny@intel.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Vishal Verma <vishal.l.verma@intel.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: x86@kernel.org
CC: kexec@lists.infradead.org
CC: nvdimm@lists.linux.dev

Comments

Baoquan He June 4, 2023, 12:59 p.m. UTC | #1
Hi Zhijian,

On 06/02/23 at 06:26pm, Li Zhijian wrote:
> Hello folks,
> 
> After sending out the previous version of the patch set, we received some comments,
> and we really appreciate your input. However, as you can see, the current patch
> set is still in its early stages, especially in terms of the solution selection,
> which may still undergo changes.

Thanks for the effort to make it and improve. And add Kazu and Simon to
the CC because they maintain kexec-tools and makedumpfile utility.

For better reviewing, I would suggest splitting the patches into
differet patchset for different component or repo. Here, it's obviouly
has kernel patchset, kexec-tools patch and makedumpfile patchset.

For the kernel patches, it looks straightforward and clear, if Dan can
approve it from nvdimm side, everything should be fine. Then next we can
focus on the relevant kexec-tools and makedumpfile utility support.

Thanks
Baoquan
Zhijian Li (Fujitsu) June 9, 2023, 1:21 a.m. UTC | #2
Baoquan,


On 04/06/2023 20:59, Baoquan He wrote:
> Hi Zhijian,
> 
> On 06/02/23 at 06:26pm, Li Zhijian wrote:
>> Hello folks,
>>
>> After sending out the previous version of the patch set, we received some comments,
>> and we really appreciate your input. However, as you can see, the current patch
>> set is still in its early stages, especially in terms of the solution selection,
>> which may still undergo changes.
> 
> Thanks for the effort to make it and improve. And add Kazu and Simon to
> the CC because they maintain kexec-tools and makedumpfile utility.
> 
> For better reviewing, I would suggest splitting the patches into
> differet patchset for different component or repo. Here, it's obviouly
> has kernel patchset, kexec-tools patch and makedumpfile patchset.

Thank you very much for your feedback.
Agreed, i will split them out if we can reach a basic proposal.


Thanks
Zhijian

> 
> For the kernel patches, it looks straightforward and clear, if Dan can
> approve it from nvdimm side, everything should be fine. Then next we can
> focus on the relevant kexec-tools and makedumpfile utility support.
> 
> Thanks
> Baoquan
>
Zhijian Li (Fujitsu) June 25, 2023, 10:27 a.m. UTC | #3
kindly ping, I do need your feedback especially some voices from the 
nvdimm. If there are any clarifications needed or if my initial email 
requires further details, please do not hesitate to let me know. I am 
more than willing to provide more additional information.

Thanks
Zhijian

on 6/2/2023 6:26 PM, Li Zhijian wrote:
> Hello folks,
>
> After sending out the previous version of the patch set, we received some comments,
> and we really appreciate your input. However, as you can see, the current patch
> set is still in its early stages, especially in terms of the solution selection,
> which may still undergo changes.
>
> Changes in V3:
> Mainly based on the understanding from the first version, I implemented the proposal
> suggested by Dan. In the kdump kernel, the device's superblock is read through
> a device file interface to calculate the metadata range. In the second version,
> the first kernel writes the metadata range to vmcoreinfo, and after kdump occurs,
> the kdump kernel can directly read it from /proc/vmcore.
>
> Comparing these two approaches, the advantage of Version 3 is fewer kernel
> modifications, but the downside is the introduction of a new external library,
> libndctl, to search for each namespace, which introduces a higher level of
> coupling with ndctl.
>
> One important thing to note about both V2 and V3 is the introduction of a new
> ELF flag, PF_DEV, to indicate whether a range is on a device. I'm not sure if
> there are better alternatives or if we can use this flag internally without
> exposing it in elf.h.
>
> We greatly appreciate your feedback and would like to hear your response.
>
> In RFC stage, I folded these 3 projects in this same cover letter for reviewing convenience.
> kernel(3):
>    nvdimm: set force_raw=1 in kdump kernel
>    x86/crash: Add pmem region into PT_LOADs of vmcore
>    kernel/kexec_file: Mark pmem region with new flag PF_DEV
> kexec-tools(1):
>    kexec: Add and mark pmem region into PT_LOADs
> makedumpfile(3):
>    elf_info.c: Introduce is_pmem_pt_load_range
>    makedumpfile.c: Exclude all pmem pages
>    makedumpfile: get metadata boundaries from pmem's infoblock
>
> Currently, this RFC has already implemented to supported case D*. And the case A&B is disabled
> deliberately in makedumpfile.
> ---
>
> pmem memmap can also be called pmem metadata here.
>
> ### Background and motivate overview ###
> ---
> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
> trouble around pmem (especially Filesystem-DAX).
>
> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
> status of reverse map.
>
> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
> troubleshooters are unable to check more details about pmem from the dumpfile.
>
> ### Make pmem memmap dump support ###
> ---
> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
>
> First, based on our previous investigation, according to the location of metadata and the scope of
> dump, we can divide it into the following four cases: A, B, C, D.
> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
> it may contain user sensitive data.
>
> +-------------+----------+------------+
> |\+--------+\     metadata location   |
> |            ++-----------------------+
> | dump scope  |  mem     |   PMEM     |
> +-------------+----------+------------+
> | entire pmem |     A    |     B      |
> +-------------+----------+------------+
> | metadata    |     C    |     D      |
> +-------------+----------+------------+
>
> ### Testing ###
> Only x86_64 are tested. Please note that we have to disable the 2nd kernel's libnvdimm to ensure the
> metadata in 2nd kernel will not be touched again.
>
> below 2 commits use sha256 to check the metadata in 1st kernel during panic and makedumpfile in 2nd kernel.
> https://github.com/zhijianli88/makedumpfile/commit/91a135be6980e6e87b9e00b909aaaf8ef9566ec0
> https://github.com/zhijianli88/linux/commit/55bef07f8f0b2e587737b796e73b92f242947e5a
>
> ### TODO ###
> Only x86 are fully supported for both kexec_load(2) and kexec_file_load(2)
> kexec_file_load(2) on other architectures are TODOs.
> ---
> [1] Pmem region layout:
>     ^<--namespace0.0---->^<--namespace0.1------>^
>     |                    |                      |
>     +--+m----------------+--+m------------------+---------------------+-+a
>     |++|e                |++|e                  |                     |+|l
>     |++|t                |++|t                  |                     |+|i
>     |++|a                |++|a                  |                     |+|g
>     |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
>     |++|a    fsdax       |++|a     devdax       |                     |+|m
>     |++|t                |++|t                  |                     |+|e
>     +--+a----------------+--+a------------------+---------------------+-+n
>     |                                                                   |t
>     v<-----------------------pmem region------------------------------->v
>
> [2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/
> [3] https://lore.kernel.org/linux-mm/3c752fc2-b6a0-2975-ffec-dba3edcf4155@fujitsu.com/
>
> ### makedumpfile output in case B ####
> kdump.sh[224]: makedumpfile: version 1.7.2++ (released on 20 Oct 2022)
> kdump.sh[224]: command line: makedumpfile -l --message-level 31 -d 31 /proc/vmcore /sysroot/var/crash/127.0.0.1-2023-04-21-02:50:57//vmcore-incomplete
> kdump.sh[224]: sadump: does not have partition header
> kdump.sh[224]: sadump: read dump device as unknown format
> kdump.sh[224]: sadump: unknown format
> kdump.sh[224]:                phys_start         phys_end       virt_start         virt_end  is_pmem
> kdump.sh[224]: LOAD[ 0]          1000000          3c26000 ffffffff81000000 ffffffff83c26000    false
> kdump.sh[224]: LOAD[ 1]           100000         7f000000 ffff888000100000 ffff88807f000000    false
> kdump.sh[224]: LOAD[ 2]         bf000000         bffd7000 ffff8880bf000000 ffff8880bffd7000    false
> kdump.sh[224]: LOAD[ 3]        100000000        140000000 ffff888100000000 ffff888140000000    false
> kdump.sh[224]: LOAD[ 4]        140000000        23e200000 ffff888140000000 ffff88823e200000     true
> kdump.sh[224]: Linux kdump
> kdump.sh[224]: VMCOREINFO   :
> kdump.sh[224]:   OSRELEASE=6.3.0-rc3-pmem-bad+
> kdump.sh[224]:   BUILD-ID=0546bd82db93706799d3eea38194ac648790aa85
> kdump.sh[224]:   PAGESIZE=4096
> kdump.sh[224]: page_size    : 4096
> kdump.sh[224]:   SYMBOL(init_uts_ns)=ffffffff82671300
> kdump.sh[224]:   OFFSET(uts_namespace.name)=0
> kdump.sh[224]:   SYMBOL(node_online_map)=ffffffff826bbe08
> kdump.sh[224]:   SYMBOL(swapper_pg_dir)=ffffffff82446000
> kdump.sh[224]:   SYMBOL(_stext)=ffffffff81000000
> kdump.sh[224]:   SYMBOL(vmap_area_list)=ffffffff82585fb0
> kdump.sh[224]:   SYMBOL(devm_memmap_vmcore_head)=ffffffff825603c0
> kdump.sh[224]:   SIZE(devm_memmap_vmcore)=40
> kdump.sh[224]:   OFFSET(devm_memmap_vmcore.entry)=0
> kdump.sh[224]:   OFFSET(devm_memmap_vmcore.start)=16
> kdump.sh[224]:   OFFSET(devm_memmap_vmcore.end)=24
> kdump.sh[224]:   SYMBOL(mem_section)=ffff88813fff4000
> kdump.sh[224]:   LENGTH(mem_section)=2048
> kdump.sh[224]:   SIZE(mem_section)=16
> kdump.sh[224]:   OFFSET(mem_section.section_mem_map)=0
> ...
> kdump.sh[224]: STEP [Checking for memory holes  ] : 0.012699 seconds
> kdump.sh[224]: STEP [Excluding unnecessary pages] : 0.538059 seconds
> kdump.sh[224]: STEP [Copying data               ] : 0.995418 seconds
> kdump.sh[224]: STEP [Copying data               ] : 0.000067 seconds
> kdump.sh[224]: Writing erase info...
> kdump.sh[224]: offset_eraseinfo: 5d02266, size_eraseinfo: 0
> kdump.sh[224]: Original pages  : 0x00000000001c0cfd
> kdump.sh[224]:   Excluded pages   : 0x00000000001a58d2
> kdump.sh[224]:     Pages filled with zero  : 0x0000000000006805
> kdump.sh[224]:     Non-private cache pages : 0x0000000000019e93
> kdump.sh[224]:     Private cache pages     : 0x0000000000077572
> kdump.sh[224]:     User process data pages : 0x0000000000002c3b
> kdump.sh[224]:     Free pages              : 0x0000000000010e8d
> kdump.sh[224]:     Hwpoison pages          : 0x0000000000000000
> kdump.sh[224]:     Offline pages           : 0x0000000000000000
> kdump.sh[224]:     pmem metadata pages     : 0x0000000000000000
> kdump.sh[224]:     pmem userdata pages     : 0x00000000000fa200
> kdump.sh[224]:   Remaining pages  : 0x000000000001b42b
> kdump.sh[224]:   (The number of pages is reduced to 6%.)
> kdump.sh[224]: Memory Hole     : 0x000000000007d503
> kdump.sh[224]: --------------------------------------------------
> kdump.sh[224]: Total pages     : 0x000000000023e200
> kdump.sh[224]: Write bytes     : 97522590
> kdump.sh[224]: Cache hit: 191669, miss: 292, hit rate: 99.8%
> kdump.sh[224]: The dumpfile is saved to /sysroot/var/crash/127.0.0.1-2023-04-21-02:50:57//vmcore-incomplete.
> kdump.sh[224]: makedumpfile Completed.
>
> CC: Baoquan He <bhe@redhat.com>
> CC: Borislav Petkov <bp@alien8.de>
> CC: Dan Williams <dan.j.williams@intel.com>
> CC: Dave Hansen <dave.hansen@linux.intel.com>
> CC: Dave Jiang <dave.jiang@intel.com>
> CC: Dave Young <dyoung@redhat.com>
> CC: Eric Biederman <ebiederm@xmission.com>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Ira Weiny <ira.weiny@intel.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Vishal Verma <vishal.l.verma@intel.com>
> CC: Vivek Goyal <vgoyal@redhat.com>
> CC: x86@kernel.org
> CC: kexec@lists.infradead.org
> CC: nvdimm@lists.linux.dev
>