mbox series

[RFC,v2,0/3] pmem memmap dump support

Message ID 20230427101838.12267-1-lizhijian@fujitsu.com (mailing list archive)
Headers show
Series pmem memmap dump support | expand

Message

Zhijian Li (Fujitsu) April 27, 2023, 10:18 a.m. UTC
Hello folks,

About 2 months ago, we posted our first RFC[3] and received your kindly feedback. Thank you :)
Now, I'm back with the code.

Currently, this RFC has already implemented to supported case D*. And the case A&B is disabled
deliberately in makedumpfile. It includes changes in 3 source code as below:
-----------+-------------------------------------------------------------------+
Source     |                      changes                                      |
-----------+-------------------------------------------------------------------+
I.         | 1. export a linked list(devm_memmap_vmcore) to vmcoreinfo         |
kernel     | 2. link metatada region to the linked list                        |
           | 3. mark the whole pmem's PT_LOAD for kexec_file_load(2) syscall   |
-----------+-------------------------------------------------------------------+
II. kexec- | 1. mark the whole pmem's PT_LOAD for kexe_load(2) syscall         |
tool       |                                                                   |
-----------+-------------------------------------------------------------------+
III.       | 1. restore the linked list from devm_memmap_vmcore in vmcoreinfo  |
makedump-  | 2. skip pmem userdata region(relies on I.3 and II.2)              |
file       | 3. exclude pmem metadata region if needed                         |
-----------+-------------------------------------------------------------------+
* Refer to the following section for the cases description.

In RFC stage, I folded these 3 projects in this same cover letter for reviewing convenience.
kernel:
  crash: export dev memmap header to vmcoreinfo
  drivers/nvdimm: export memmap of namespace to vmcoreinfo
  resource, crash: Make kexec_file_load support pmem
kexec-tools:
  kexec: Add and mark pmem region into PT_LOADs
makedumpfile:
  elf_info.c: Introduce is_pmem_pt_load_range
  makedumpfile.c: Exclude all pmem pages
  makedumpfile.c: Allow excluding metadata of pmem region
---

pmem memmap can also be called pmem metadata here.

### Background and motivate overview ###
---
Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
trouble around pmem (especially Filesystem-DAX).

A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
more details. In fsdax, struct page array becomes very important, it is one of the key data to find
status of reverse map.

So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
troubleshooters are unable to check more details about pmem from the dumpfile.

### Make pmem memmap dump support ###
---
Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.

First, based on our previous investigation, according to the location of metadata and the scope of
dump, we can divide it into the following four cases: A, B, C, D.
It should be noted that although we mentioned case A&B below, we do not want these two cases to be
part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
it may contain user sensitive data.

+-------------+----------+------------+
|\+--------+\     metadata location   |
|            ++-----------------------+
| dump scope  |  mem     |   PMEM     |
+-------------+----------+------------+
| entire pmem |     A    |     B      |
+-------------+----------+------------+
| metadata    |     C    |     D      |
+-------------+----------+------------+

### Testing ###
Only x86_64 are tested. Please note that we have to disable the 2nd kernel's libnvdimm to ensure the
metadata in 2nd kernel will not be touched again.

below 2 commits use sha256 to check the metadata in 1st kernel during panic and makedumpfile in 2nd kernel.
https://github.com/zhijianli88/makedumpfile/commit/91a135be6980e6e87b9e00b909aaaf8ef9566ec0
https://github.com/zhijianli88/linux/commit/55bef07f8f0b2e587737b796e73b92f242947e5a

### TODO ###
Only x86 are fully supported for both kexec_load(2) and kexec_file_load(2)
kexec_file_load(2) on other architectures are TODOs.
---
[1] Pmem region layout:
   ^<--namespace0.0---->^<--namespace0.1------>^
   |                    |                      |
   +--+m----------------+--+m------------------+---------------------+-+a
   |++|e                |++|e                  |                     |+|l
   |++|t                |++|t                  |                     |+|i
   |++|a                |++|a                  |                     |+|g
   |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
   |++|a    fsdax       |++|a     devdax       |                     |+|m
   |++|t                |++|t                  |                     |+|e
   +--+a----------------+--+a------------------+---------------------+-+n
   |                                                                   |t
   v<-----------------------pmem region------------------------------->v

[2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/
[3] https://lore.kernel.org/linux-mm/3c752fc2-b6a0-2975-ffec-dba3edcf4155@fujitsu.com/

### makedumpfile output in case B ####
kdump.sh[224]: makedumpfile: version 1.7.2++ (released on 20 Oct 2022)
kdump.sh[224]: command line: makedumpfile -l --message-level 31 -d 31 /proc/vmcore /sysroot/var/crash/127.0.0.1-2023-04-21-02:50:57//vmcore-incomplete
kdump.sh[224]: sadump: does not have partition header
kdump.sh[224]: sadump: read dump device as unknown format
kdump.sh[224]: sadump: unknown format
kdump.sh[224]:                phys_start         phys_end       virt_start         virt_end  is_pmem
kdump.sh[224]: LOAD[ 0]          1000000          3c26000 ffffffff81000000 ffffffff83c26000    false
kdump.sh[224]: LOAD[ 1]           100000         7f000000 ffff888000100000 ffff88807f000000    false
kdump.sh[224]: LOAD[ 2]         bf000000         bffd7000 ffff8880bf000000 ffff8880bffd7000    false
kdump.sh[224]: LOAD[ 3]        100000000        140000000 ffff888100000000 ffff888140000000    false
kdump.sh[224]: LOAD[ 4]        140000000        23e200000 ffff888140000000 ffff88823e200000     true
kdump.sh[224]: Linux kdump
kdump.sh[224]: VMCOREINFO   :
kdump.sh[224]:   OSRELEASE=6.3.0-rc3-pmem-bad+
kdump.sh[224]:   BUILD-ID=0546bd82db93706799d3eea38194ac648790aa85
kdump.sh[224]:   PAGESIZE=4096
kdump.sh[224]: page_size    : 4096
kdump.sh[224]:   SYMBOL(init_uts_ns)=ffffffff82671300
kdump.sh[224]:   OFFSET(uts_namespace.name)=0
kdump.sh[224]:   SYMBOL(node_online_map)=ffffffff826bbe08
kdump.sh[224]:   SYMBOL(swapper_pg_dir)=ffffffff82446000
kdump.sh[224]:   SYMBOL(_stext)=ffffffff81000000
kdump.sh[224]:   SYMBOL(vmap_area_list)=ffffffff82585fb0
kdump.sh[224]:   SYMBOL(devm_memmap_vmcore_head)=ffffffff825603c0
kdump.sh[224]:   SIZE(devm_memmap_vmcore)=40
kdump.sh[224]:   OFFSET(devm_memmap_vmcore.entry)=0
kdump.sh[224]:   OFFSET(devm_memmap_vmcore.start)=16
kdump.sh[224]:   OFFSET(devm_memmap_vmcore.end)=24
kdump.sh[224]:   SYMBOL(mem_section)=ffff88813fff4000
kdump.sh[224]:   LENGTH(mem_section)=2048
kdump.sh[224]:   SIZE(mem_section)=16
kdump.sh[224]:   OFFSET(mem_section.section_mem_map)=0
...
kdump.sh[224]: STEP [Checking for memory holes  ] : 0.012699 seconds
kdump.sh[224]: STEP [Excluding unnecessary pages] : 0.538059 seconds
kdump.sh[224]: STEP [Copying data               ] : 0.995418 seconds
kdump.sh[224]: STEP [Copying data               ] : 0.000067 seconds
kdump.sh[224]: Writing erase info...
kdump.sh[224]: offset_eraseinfo: 5d02266, size_eraseinfo: 0
kdump.sh[224]: Original pages  : 0x00000000001c0cfd
kdump.sh[224]:   Excluded pages   : 0x00000000001a58d2
kdump.sh[224]:     Pages filled with zero  : 0x0000000000006805
kdump.sh[224]:     Non-private cache pages : 0x0000000000019e93
kdump.sh[224]:     Private cache pages     : 0x0000000000077572
kdump.sh[224]:     User process data pages : 0x0000000000002c3b
kdump.sh[224]:     Free pages              : 0x0000000000010e8d
kdump.sh[224]:     Hwpoison pages          : 0x0000000000000000
kdump.sh[224]:     Offline pages           : 0x0000000000000000
kdump.sh[224]:     pmem metadata pages     : 0x0000000000000000
kdump.sh[224]:     pmem userdata pages     : 0x00000000000fa200
kdump.sh[224]:   Remaining pages  : 0x000000000001b42b
kdump.sh[224]:   (The number of pages is reduced to 6%.)
kdump.sh[224]: Memory Hole     : 0x000000000007d503
kdump.sh[224]: --------------------------------------------------
kdump.sh[224]: Total pages     : 0x000000000023e200
kdump.sh[224]: Write bytes     : 97522590
kdump.sh[224]: Cache hit: 191669, miss: 292, hit rate: 99.8%
kdump.sh[224]: The dumpfile is saved to /sysroot/var/crash/127.0.0.1-2023-04-21-02:50:57//vmcore-incomplete.
kdump.sh[224]: makedumpfile Completed.

Comments

Dan Williams April 28, 2023, 6:59 p.m. UTC | #1
Li Zhijian wrote:
> Hello folks,
> 
> About 2 months ago, we posted our first RFC[3] and received your kindly feedback. Thank you :)
> Now, I'm back with the code.
> 
> Currently, this RFC has already implemented to supported case D*. And the case A&B is disabled
> deliberately in makedumpfile. It includes changes in 3 source code as below:

I think the reason this patchkit is difficult to follow is that it
spends a lot of time describing a chosen solution, but not enough time
describing the problem and the tradeoffs.

For example why is updating /proc/vmcore with pmem metadata the chosen
solution? Why not leave the kernel out of it and have makedumpfile
tooling aware of how to parse persistent memory namespace info-blocks
and retrieve that dump itself? This is what I proposed here:

http://lore.kernel.org/r/641484f7ef780_a52e2940@dwillia2-mobl3.amr.corp.intel.com.notmuch

...but never got an answer, or I missed the answer.
Zhijian Li (Fujitsu) May 8, 2023, 9:45 a.m. UTC | #2
Dan,


On 29/04/2023 02:59, Dan Williams wrote:
> Li Zhijian wrote:
>> Hello folks,
>>
>> About 2 months ago, we posted our first RFC[3] and received your kindly feedback. Thank you :)
>> Now, I'm back with the code.
>>
>> Currently, this RFC has already implemented to supported case D*. And the case A&B is disabled
>> deliberately in makedumpfile. It includes changes in 3 source code as below:
> 
> I think the reason this patchkit is difficult to follow is that it
> spends a lot of time describing a chosen solution, but not enough time
> describing the problem and the tradeoffs.
> 
> For example why is updating /proc/vmcore with pmem metadata the chosen
> solution? Why not leave the kernel out of it and have makedumpfile
> tooling aware of how to parse persistent memory namespace info-blocks
> and retrieve that dump itself? This is what I proposed here:
> 
> http://lore.kernel.org/r/641484f7ef780_a52e2940@dwillia2-mobl3.amr.corp.intel.com.notmuch

Sorry for the late reply. I'm just back from the vacation.
And sorry again for missing your previous *important* information in V1.

Your proposal also sounds to me with less kernel changes, but more ndctl coupling with makedumpfile tools.
In my current understanding, it will includes following source changes.

-----------+-------------------------------------------------------------------+
Source     |                      changes                                      |
-----------+-------------------------------------------------------------------+
I.         | 1. enter force_raw in kdump kernel automatically(avoid metadata being updated again)|
kernel     |                                                                   |
            | 2. mark the whole pmem's PT_LOAD for kexec_file_load(2) syscall   |
-----------+-------------------------------------------------------------------+
II. kexec- | 1. mark the whole pmem's PT_LOAD for kexe_load(2) syscall         |
tool       |                                                                   |
-----------+-------------------------------------------------------------------+
III.       | 1. parse the infoblock and calculate the boundaries of userdata and metadata   |
makedump-  | 2. skip pmem userdata region                                      |
file       | 3. exclude pmem metadata region if needed                         |
-----------+-------------------------------------------------------------------+

I will try rewrite it with your proposal ASAP

Thanks again

Thanks
Zhijian

> 
> ...but never got an answer, or I missed the answer.
Zhijian Li (Fujitsu) May 10, 2023, 10:41 a.m. UTC | #3
Hi Dan


on 5/8/2023 5:45 PM, Zhijian Li (Fujitsu) wrote:
> Dan,
>
>
> On 29/04/2023 02:59, Dan Williams wrote:
>> Li Zhijian wrote:
>>> Hello folks,
>>>
>>> About 2 months ago, we posted our first RFC[3] and received your kindly feedback. Thank you :)
>>> Now, I'm back with the code.
>>>
>>> Currently, this RFC has already implemented to supported case D*. And the case A&B is disabled
>>> deliberately in makedumpfile. It includes changes in 3 source code as below:
>> I think the reason this patchkit is difficult to follow is that it
>> spends a lot of time describing a chosen solution, but not enough time
>> describing the problem and the tradeoffs.
>>
>> For example why is updating /proc/vmcore with pmem metadata the chosen
>> solution? Why not leave the kernel out of it and have makedumpfile
>> tooling aware of how to parse persistent memory namespace info-blocks
>> and retrieve that dump itself? This is what I proposed here:
>>
>> http://lore.kernel.org/r/641484f7ef780_a52e2940@dwillia2-mobl3.amr.corp.intel.com.notmuch
> Sorry for the late reply. I'm just back from the vacation.
> And sorry again for missing your previous *important* information in V1.
>
> Your proposal also sounds to me with less kernel changes, but more ndctl coupling with makedumpfile tools.
> In my current understanding, it will includes following source changes.

The kernel and makedumpfile has updated. It's still in a early stage, but in order to make sure I'm following your proposal.
i want to share the changes with you early. Alternatively, you are able to refer to my github for the full details.
https://github.com/zhijianli88/makedumpfile/commit/8ebfe38c015cfca0545cb3b1d7a6cc9a58fc9bb3

If I'm going the wrong way, fee free to let me know :)


>
> -----------+-------------------------------------------------------------------+
> Source     |                      changes                                      |
> -----------+-------------------------------------------------------------------+
> I.         | 1. enter force_raw in kdump kernel automatically(avoid metadata being updated again)|

kernel should adapt it so that the metadata of pmem will be updated again in the kdump kernel:

diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index c60ec0b373c5..2e59be8b9c78 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -8,6 +8,7 @@
  #include <linux/slab.h>
  #include <linux/list.h>
  #include <linux/nd.h>
+#include <linux/crash_dump.h>
  #include "nd-core.h"
  #include "pmem.h"
  #include "pfn.h"
@@ -1504,6 +1505,8 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
                         return ERR_PTR(-ENODEV);
         }
  
+       if (is_kdump_kernel())
+               ndns->force_raw = true;
         return ndns;
  }
  EXPORT_SYMBOL(nvdimm_namespace_common_probe);

> kernel     |                                                                   |
>              | 2. mark the whole pmem's PT_LOAD for kexec_file_load(2) syscall   |
> -----------+-------------------------------------------------------------------+
> II. kexec- | 1. mark the whole pmem's PT_LOAD for kexe_load(2) syscall         |
> tool       |                                                                   |
> -----------+-------------------------------------------------------------------+
> III.       | 1. parse the infoblock and calculate the boundaries of userdata and metadata   |
> makedump-  | 2. skip pmem userdata region                                      |
> file       | 3. exclude pmem metadata region if needed                         |
> -----------+-------------------------------------------------------------------+
>
> I will try rewrite it with your proposal ASAP

inspect_pmem_namespace() will walk the namespaces and the read its resource.start and infoblock. With this
information, we can calculate the boundaries of userdata and metadata easily. But currently this changes are
strongly coupling with the ndctl/pmem which looks a bit messy and ugly.

============makedumpfile=======

diff --git a/Makefile b/Makefile
index a289e41ef44d..4b4ded639cfd 100644
--- a/Makefile
+++ b/Makefile
@@ -50,7 +50,7 @@ OBJ_PART=$(patsubst %.c,%.o,$(SRC_PART))
  SRC_ARCH = arch/arm.c arch/arm64.c arch/x86.c arch/x86_64.c arch/ia64.c arch/ppc64.c arch/s390x.c arch/ppc.c arch/sparc64.c arch/mips64.c arch/loongarch64.c
  OBJ_ARCH=$(patsubst %.c,%.o,$(SRC_ARCH))
  
-LIBS = -ldw -lbz2 -ldl -lelf -lz
+LIBS = -ldw -lbz2 -ldl -lelf -lz -lndctl
  ifneq ($(LINKTYPE), dynamic)
  LIBS := -static $(LIBS) -llzma
  endif
diff --git a/makedumpfile.c b/makedumpfile.c
index 98c3b8c7ced9..db68d05a29f9 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -27,6 +27,8 @@
  #include <limits.h>
  #include <assert.h>
  #include <zlib.h>
+#include <sys/types.h>
+#include <ndctl/libndctl.h>

+
+#define INFOBLOCK_SZ (8192)
+#define SZ_4K (4096)
+#define PFN_SIG_LEN 16
+
+typedef uint64_t u64;
+typedef int64_t s64;
+typedef uint32_t u32;
+typedef int32_t s32;
+typedef uint16_t u16;
+typedef int16_t s16;
+typedef uint8_t u8;
+typedef int8_t s8;
+
+typedef int64_t le64;
+typedef int32_t le32;
+typedef int16_t le16;
+
+struct pfn_sb {
+       u8 signature[PFN_SIG_LEN];
+       u8 uuid[16];
+       u8 parent_uuid[16];
+       le32 flags;
+       le16 version_major;
+       le16 version_minor;
+       le64 dataoff; /* relative to namespace_base + start_pad */
+       le64 npfns;
+       le32 mode;
+       /* minor-version-1 additions for section alignment */
+       le32 start_pad;
+       le32 end_trunc;
+       /* minor-version-2 record the base alignment of the mapping */
+       le32 align;
+       /* minor-version-3 guarantee the padding and flags are zero */
+       /* minor-version-4 record the page size and struct page size */
+       le32 page_size;
+       le16 page_struct_size;
+       u8 padding[3994];
+       le64 checksum;
+};
+
+static int nd_read_infoblock_dataoff(struct ndctl_namespace *ndns)
+{
+       int fd, rc;
+       char path[50];
+       char buf[INFOBLOCK_SZ + 1];
+       struct pfn_sb *pfn_sb = (struct pfn_sb *)(buf + SZ_4K);
+
+       sprintf(path, "/dev/%s", ndctl_namespace_get_block_device(ndns));
+
+       fd = open(path, O_RDONLY|O_EXCL);
+       if (fd < 0)
+               return -1;
+
+
+       rc = read(fd, buf, INFOBLOCK_SZ);
+       if (rc < INFOBLOCK_SZ) {
+               return -1;
+       }
+
+       return pfn_sb->dataoff;
+}
+
+int inspect_pmem_namespace(void)
+{
+       struct ndctl_ctx *ctx;
+       struct ndctl_bus *bus;
+       int rc = -1;
+
+       fprintf(stderr, "\n\ninspect_pmem_namespace!!\n\n");
+       rc = ndctl_new(&ctx);
+       if (rc)
+               return -1;
+
+       ndctl_bus_foreach(ctx, bus) {
+               struct ndctl_region *region;
+
+               ndctl_region_foreach(bus, region) {
+                       struct ndctl_namespace *ndns;
+
+                       ndctl_namespace_foreach(region, ndns) {
+                               enum ndctl_namespace_mode mode;
+                               long long start, end_metadata;
+
+                               mode = ndctl_namespace_get_mode(ndns);
+                               /* kdump kernel should set force_raw, mode become *safe* */
+                               if (mode == NDCTL_NS_MODE_SAFE) {
+                                       fprintf(stderr, "Only raw can be dumpable\n");
+                                       continue;
+                               }
+
+                               start = ndctl_namespace_get_resource(ndns);
+                               end_metadata = nd_read_infoblock_dataoff(ndns);
+
+                               /* metadata really starts from 2M alignment */
+                               if (start != ULLONG_MAX && end_metadata > 2 * 1024 * 1024) // 2M
+                                       pmem_add_next(start, end_metadata);
+                       }
+               }
+       }
+
+       ndctl_unref(ctx);
+       return 0;
+}
+

Thanks
Zhijian



>
> Thanks again
>
> Thanks
> Zhijian
>
>> ...but never got an answer, or I missed the answer.
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
Zhijian Li (Fujitsu) May 25, 2023, 5:36 a.m. UTC | #4
Ping

Baoquan, Dan

Sorry to bother you again.

Could you further comment a word or two on this set?


Thanks
Zhijian


on 5/10/2023 6:41 PM, Zhijian Li (Fujitsu) wrote:
> Hi Dan
>
>
> on 5/8/2023 5:45 PM, Zhijian Li (Fujitsu) wrote:
>> Dan,
>>
>>
>> On 29/04/2023 02:59, Dan Williams wrote:
>>> Li Zhijian wrote:
>>>> Hello folks,
>>>>
>>>> About 2 months ago, we posted our first RFC[3] and received your kindly feedback. Thank you :)
>>>> Now, I'm back with the code.
>>>>
>>>> Currently, this RFC has already implemented to supported case D*. And the case A&B is disabled
>>>> deliberately in makedumpfile. It includes changes in 3 source code as below:
>>> I think the reason this patchkit is difficult to follow is that it
>>> spends a lot of time describing a chosen solution, but not enough time
>>> describing the problem and the tradeoffs.
>>>
>>> For example why is updating /proc/vmcore with pmem metadata the chosen
>>> solution? Why not leave the kernel out of it and have makedumpfile
>>> tooling aware of how to parse persistent memory namespace info-blocks
>>> and retrieve that dump itself? This is what I proposed here:
>>>
>>> http://lore.kernel.org/r/641484f7ef780_a52e2940@dwillia2-mobl3.amr.corp.intel.com.notmuch
>> Sorry for the late reply. I'm just back from the vacation.
>> And sorry again for missing your previous *important* information in V1.
>>
>> Your proposal also sounds to me with less kernel changes, but more ndctl coupling with makedumpfile tools.
>> In my current understanding, it will includes following source changes.
> The kernel and makedumpfile has updated. It's still in a early stage, but in order to make sure I'm following your proposal.
> i want to share the changes with you early. Alternatively, you are able to refer to my github for the full details.
> https://github.com/zhijianli88/makedumpfile/commit/8ebfe38c015cfca0545cb3b1d7a6cc9a58fc9bb3
>
> If I'm going the wrong way, fee free to let me know :)
>
>
>> -----------+-------------------------------------------------------------------+
>> Source     |                      changes                                      |
>> -----------+-------------------------------------------------------------------+
>> I.         | 1. enter force_raw in kdump kernel automatically(avoid metadata being updated again)|
> kernel should adapt it so that the metadata of pmem will be updated again in the kdump kernel:
>
> diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
> index c60ec0b373c5..2e59be8b9c78 100644
> --- a/drivers/nvdimm/namespace_devs.c
> +++ b/drivers/nvdimm/namespace_devs.c
> @@ -8,6 +8,7 @@
>    #include <linux/slab.h>
>    #include <linux/list.h>
>    #include <linux/nd.h>
> +#include <linux/crash_dump.h>
>    #include "nd-core.h"
>    #include "pmem.h"
>    #include "pfn.h"
> @@ -1504,6 +1505,8 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
>                           return ERR_PTR(-ENODEV);
>           }
>    
> +       if (is_kdump_kernel())
> +               ndns->force_raw = true;
>           return ndns;
>    }
>    EXPORT_SYMBOL(nvdimm_namespace_common_probe);
>
>> kernel     |                                                                   |
>>               | 2. mark the whole pmem's PT_LOAD for kexec_file_load(2) syscall   |
>> -----------+-------------------------------------------------------------------+
>> II. kexec- | 1. mark the whole pmem's PT_LOAD for kexe_load(2) syscall         |
>> tool       |                                                                   |
>> -----------+-------------------------------------------------------------------+
>> III.       | 1. parse the infoblock and calculate the boundaries of userdata and metadata   |
>> makedump-  | 2. skip pmem userdata region                                      |
>> file       | 3. exclude pmem metadata region if needed                         |
>> -----------+-------------------------------------------------------------------+
>>
>> I will try rewrite it with your proposal ASAP
> inspect_pmem_namespace() will walk the namespaces and the read its resource.start and infoblock. With this
> information, we can calculate the boundaries of userdata and metadata easily. But currently this changes are
> strongly coupling with the ndctl/pmem which looks a bit messy and ugly.
>
> ============makedumpfile=======
>
> diff --git a/Makefile b/Makefile
> index a289e41ef44d..4b4ded639cfd 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -50,7 +50,7 @@ OBJ_PART=$(patsubst %.c,%.o,$(SRC_PART))
>    SRC_ARCH = arch/arm.c arch/arm64.c arch/x86.c arch/x86_64.c arch/ia64.c arch/ppc64.c arch/s390x.c arch/ppc.c arch/sparc64.c arch/mips64.c arch/loongarch64.c
>    OBJ_ARCH=$(patsubst %.c,%.o,$(SRC_ARCH))
>    
> -LIBS = -ldw -lbz2 -ldl -lelf -lz
> +LIBS = -ldw -lbz2 -ldl -lelf -lz -lndctl
>    ifneq ($(LINKTYPE), dynamic)
>    LIBS := -static $(LIBS) -llzma
>    endif
> diff --git a/makedumpfile.c b/makedumpfile.c
> index 98c3b8c7ced9..db68d05a29f9 100644
> --- a/makedumpfile.c
> +++ b/makedumpfile.c
> @@ -27,6 +27,8 @@
>    #include <limits.h>
>    #include <assert.h>
>    #include <zlib.h>
> +#include <sys/types.h>
> +#include <ndctl/libndctl.h>
>
> +
> +#define INFOBLOCK_SZ (8192)
> +#define SZ_4K (4096)
> +#define PFN_SIG_LEN 16
> +
> +typedef uint64_t u64;
> +typedef int64_t s64;
> +typedef uint32_t u32;
> +typedef int32_t s32;
> +typedef uint16_t u16;
> +typedef int16_t s16;
> +typedef uint8_t u8;
> +typedef int8_t s8;
> +
> +typedef int64_t le64;
> +typedef int32_t le32;
> +typedef int16_t le16;
> +
> +struct pfn_sb {
> +       u8 signature[PFN_SIG_LEN];
> +       u8 uuid[16];
> +       u8 parent_uuid[16];
> +       le32 flags;
> +       le16 version_major;
> +       le16 version_minor;
> +       le64 dataoff; /* relative to namespace_base + start_pad */
> +       le64 npfns;
> +       le32 mode;
> +       /* minor-version-1 additions for section alignment */
> +       le32 start_pad;
> +       le32 end_trunc;
> +       /* minor-version-2 record the base alignment of the mapping */
> +       le32 align;
> +       /* minor-version-3 guarantee the padding and flags are zero */
> +       /* minor-version-4 record the page size and struct page size */
> +       le32 page_size;
> +       le16 page_struct_size;
> +       u8 padding[3994];
> +       le64 checksum;
> +};
> +
> +static int nd_read_infoblock_dataoff(struct ndctl_namespace *ndns)
> +{
> +       int fd, rc;
> +       char path[50];
> +       char buf[INFOBLOCK_SZ + 1];
> +       struct pfn_sb *pfn_sb = (struct pfn_sb *)(buf + SZ_4K);
> +
> +       sprintf(path, "/dev/%s", ndctl_namespace_get_block_device(ndns));
> +
> +       fd = open(path, O_RDONLY|O_EXCL);
> +       if (fd < 0)
> +               return -1;
> +
> +
> +       rc = read(fd, buf, INFOBLOCK_SZ);
> +       if (rc < INFOBLOCK_SZ) {
> +               return -1;
> +       }
> +
> +       return pfn_sb->dataoff;
> +}
> +
> +int inspect_pmem_namespace(void)
> +{
> +       struct ndctl_ctx *ctx;
> +       struct ndctl_bus *bus;
> +       int rc = -1;
> +
> +       fprintf(stderr, "\n\ninspect_pmem_namespace!!\n\n");
> +       rc = ndctl_new(&ctx);
> +       if (rc)
> +               return -1;
> +
> +       ndctl_bus_foreach(ctx, bus) {
> +               struct ndctl_region *region;
> +
> +               ndctl_region_foreach(bus, region) {
> +                       struct ndctl_namespace *ndns;
> +
> +                       ndctl_namespace_foreach(region, ndns) {
> +                               enum ndctl_namespace_mode mode;
> +                               long long start, end_metadata;
> +
> +                               mode = ndctl_namespace_get_mode(ndns);
> +                               /* kdump kernel should set force_raw, mode become *safe* */
> +                               if (mode == NDCTL_NS_MODE_SAFE) {
> +                                       fprintf(stderr, "Only raw can be dumpable\n");
> +                                       continue;
> +                               }
> +
> +                               start = ndctl_namespace_get_resource(ndns);
> +                               end_metadata = nd_read_infoblock_dataoff(ndns);
> +
> +                               /* metadata really starts from 2M alignment */
> +                               if (start != ULLONG_MAX && end_metadata > 2 * 1024 * 1024) // 2M
> +                                       pmem_add_next(start, end_metadata);
> +                       }
> +               }
> +       }
> +
> +       ndctl_unref(ctx);
> +       return 0;
> +}
> +
>
> Thanks
> Zhijian
>
>
>
>> Thanks again
>>
>> Thanks
>> Zhijian
>>
>>> ...but never got an answer, or I missed the answer.
>> _______________________________________________
>> kexec mailing list
>> kexec@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec