diff mbox series

[v1,07/11] fs/proc/vmcore: introduce PROC_VMCORE_DEVICE_RAM to detect device RAM ranges in 2nd kernel

Message ID 20241025151134.1275575-8-david@redhat.com (mailing list archive)
State New
Headers show
Series fs/proc/vmcore: kdump support for virtio-mem on s390 | expand

Commit Message

David Hildenbrand Oct. 25, 2024, 3:11 p.m. UTC
s390 allocates+prepares the elfcore hdr in the dump (2nd) kernel, not in
the crashed kernel.

RAM provided by memory devices such as virtio-mem can only be detected
using the device driver; when vmcore_init() is called, these device
drivers are usually not loaded yet, or the devices did not get probed
yet. Consequently, on s390 these RAM ranges will not be included in
the crash dump, which makes the dump partially corrupt and is
unfortunate.

Instead of deferring the vmcore_init() call, to an (unclear?) later point,
let's reuse the vmcore_cb infrastructure to obtain device RAM ranges as
the device drivers probe the device and get access to this information.

Then, we'll add these ranges to the vmcore, adding more PT_LOAD
entries and updating the offsets+vmcore size.

Use Kconfig tricks to include this code automatically only if (a) there is
a device driver compiled that implements the callback
(PROVIDE_PROC_VMCORE_DEVICE_RAM) and; (b) the architecture actually needs
this information (NEED_PROC_VMCORE_DEVICE_RAM).

The current target use case is s390, which only creates an elf64
elfcore, so focusing on elf64 is sufficient.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/Kconfig            |  25 ++++++
 fs/proc/vmcore.c           | 156 +++++++++++++++++++++++++++++++++++++
 include/linux/crash_dump.h |   9 +++
 3 files changed, 190 insertions(+)

Comments

Baoquan He Nov. 20, 2024, 10:13 a.m. UTC | #1
On 10/25/24 at 05:11pm, David Hildenbrand wrote:
> s390 allocates+prepares the elfcore hdr in the dump (2nd) kernel, not in
> the crashed kernel.
> 
> RAM provided by memory devices such as virtio-mem can only be detected
> using the device driver; when vmcore_init() is called, these device
> drivers are usually not loaded yet, or the devices did not get probed
> yet. Consequently, on s390 these RAM ranges will not be included in
> the crash dump, which makes the dump partially corrupt and is
> unfortunate.
> 
> Instead of deferring the vmcore_init() call, to an (unclear?) later point,
> let's reuse the vmcore_cb infrastructure to obtain device RAM ranges as
> the device drivers probe the device and get access to this information.
> 
> Then, we'll add these ranges to the vmcore, adding more PT_LOAD
> entries and updating the offsets+vmcore size.
> 
> Use Kconfig tricks to include this code automatically only if (a) there is
> a device driver compiled that implements the callback
> (PROVIDE_PROC_VMCORE_DEVICE_RAM) and; (b) the architecture actually needs
> this information (NEED_PROC_VMCORE_DEVICE_RAM).
> 
> The current target use case is s390, which only creates an elf64
> elfcore, so focusing on elf64 is sufficient.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  fs/proc/Kconfig            |  25 ++++++
>  fs/proc/vmcore.c           | 156 +++++++++++++++++++++++++++++++++++++
>  include/linux/crash_dump.h |   9 +++
>  3 files changed, 190 insertions(+)
> 
> diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
> index d80a1431ef7b..1e11de5f9380 100644
> --- a/fs/proc/Kconfig
> +++ b/fs/proc/Kconfig
> @@ -61,6 +61,31 @@ config PROC_VMCORE_DEVICE_DUMP
>  	  as ELF notes to /proc/vmcore. You can still disable device
>  	  dump using the kernel command line option 'novmcoredd'.
>  
> +config PROVIDE_PROC_VMCORE_DEVICE_RAM
> +	def_bool n
> +
> +config NEED_PROC_VMCORE_DEVICE_RAM
> +	def_bool n
> +
> +config PROC_VMCORE_DEVICE_RAM
> +	def_bool y
> +	depends on PROC_VMCORE
> +	depends on NEED_PROC_VMCORE_DEVICE_RAM
> +	depends on PROVIDE_PROC_VMCORE_DEVICE_RAM

Kconfig item is always a thing I need learn to master. When I checked
this part, I have to write them down to deliberate. I am wondering if 
below 'simple version' works too and more understandable. Please help
point out what I have missed.

===========simple version======
config PROC_VMCORE_DEVICE_RAM
        def_bool y
        depends on PROC_VMCORE && VIRTIO_MEM
        depends on NEED_PROC_VMCORE_DEVICE_RAM

config S390
        select NEED_PROC_VMCORE_DEVICE_RAM
============


======= config items extracted from this patchset====
config PROVIDE_PROC_VMCORE_DEVICE_RAM
        def_bool n

config NEED_PROC_VMCORE_DEVICE_RAM
        def_bool n

config PROC_VMCORE_DEVICE_RAM
        def_bool y
        depends on PROC_VMCORE
        depends on NEED_PROC_VMCORE_DEVICE_RAM
        depends on PROVIDE_PROC_VMCORE_DEVICE_RAM

config VIRTIO_MEM
	depends on X86_64 || ARM64 || RISCV
         ~~~~~ I don't get why VIRTIO_MEM dones't depend on S390 if
               s390 need PROC_VMCORE_DEVICE_RAM. 
        ......
        select PROVIDE_PROC_VMCORE_DEVICE_RAM if PROC_VMCORE

config S390
        select NEED_PROC_VMCORE_DEVICE_RAM if PROC_VMCORE
=================================================
David Hildenbrand Nov. 20, 2024, 10:48 a.m. UTC | #2
On 20.11.24 11:13, Baoquan He wrote:
> On 10/25/24 at 05:11pm, David Hildenbrand wrote:
>> s390 allocates+prepares the elfcore hdr in the dump (2nd) kernel, not in
>> the crashed kernel.
>>
>> RAM provided by memory devices such as virtio-mem can only be detected
>> using the device driver; when vmcore_init() is called, these device
>> drivers are usually not loaded yet, or the devices did not get probed
>> yet. Consequently, on s390 these RAM ranges will not be included in
>> the crash dump, which makes the dump partially corrupt and is
>> unfortunate.
>>
>> Instead of deferring the vmcore_init() call, to an (unclear?) later point,
>> let's reuse the vmcore_cb infrastructure to obtain device RAM ranges as
>> the device drivers probe the device and get access to this information.
>>
>> Then, we'll add these ranges to the vmcore, adding more PT_LOAD
>> entries and updating the offsets+vmcore size.
>>
>> Use Kconfig tricks to include this code automatically only if (a) there is
>> a device driver compiled that implements the callback
>> (PROVIDE_PROC_VMCORE_DEVICE_RAM) and; (b) the architecture actually needs
>> this information (NEED_PROC_VMCORE_DEVICE_RAM).
>>
>> The current target use case is s390, which only creates an elf64
>> elfcore, so focusing on elf64 is sufficient.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   fs/proc/Kconfig            |  25 ++++++
>>   fs/proc/vmcore.c           | 156 +++++++++++++++++++++++++++++++++++++
>>   include/linux/crash_dump.h |   9 +++
>>   3 files changed, 190 insertions(+)
>>
>> diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
>> index d80a1431ef7b..1e11de5f9380 100644
>> --- a/fs/proc/Kconfig
>> +++ b/fs/proc/Kconfig
>> @@ -61,6 +61,31 @@ config PROC_VMCORE_DEVICE_DUMP
>>   	  as ELF notes to /proc/vmcore. You can still disable device
>>   	  dump using the kernel command line option 'novmcoredd'.
>>   
>> +config PROVIDE_PROC_VMCORE_DEVICE_RAM
>> +	def_bool n
>> +
>> +config NEED_PROC_VMCORE_DEVICE_RAM
>> +	def_bool n
>> +
>> +config PROC_VMCORE_DEVICE_RAM
>> +	def_bool y
>> +	depends on PROC_VMCORE
>> +	depends on NEED_PROC_VMCORE_DEVICE_RAM
>> +	depends on PROVIDE_PROC_VMCORE_DEVICE_RAM
> 
> Kconfig item is always a thing I need learn to master.

Yes, it's usually a struggle to get it right. It took me a couple of 
iterations to get to this point :)

> When I checked
> this part, I have to write them down to deliberate. I am wondering if
> below 'simple version' works too and more understandable. Please help
> point out what I have missed.
> 
> ===========simple version======
> config PROC_VMCORE_DEVICE_RAM
>          def_bool y
>          depends on PROC_VMCORE && VIRTIO_MEM
>          depends on NEED_PROC_VMCORE_DEVICE_RAM
> 
> config S390
>          select NEED_PROC_VMCORE_DEVICE_RAM
 > ============

So the three changes you did are

(a) Remove the config option but select/depend on them.

(b) Remove the "depends on PROC_VMCORE" from PROC_VMCORE_DEVICE_RAM,
     and the "if PROC_VMCORE" from s390.

(c) Remove the PROVIDE_PROC_VMCORE_DEVICE_RAM


Regarding (a), that doesn't work. If you select a config option that 
doesn't exist, it is silently dropped. It's always treated as if it 
wouldn't be set.

Regarding (b), I think that's an anti-pattern (having config options 
enabled that are completely ineffective) and I don't see a benefit 
dropping them.

Regarding (c), it would mean that s390x unconditionally includes that 
code even if virtio-mem is not configured in.

So while we could drop PROVIDE_PROC_VMCORE_DEVICE_RAM -- (c), it would 
that we end up including code in configurations that don't possibly need 
it. That's why I included that part.

> 
> 
> ======= config items extracted from this patchset====
> config PROVIDE_PROC_VMCORE_DEVICE_RAM
>          def_bool n
> 
> config NEED_PROC_VMCORE_DEVICE_RAM
>          def_bool n
> 
> config PROC_VMCORE_DEVICE_RAM
>          def_bool y
>          depends on PROC_VMCORE
>          depends on NEED_PROC_VMCORE_DEVICE_RAM
>          depends on PROVIDE_PROC_VMCORE_DEVICE_RAM
> 
> config VIRTIO_MEM
> 	depends on X86_64 || ARM64 || RISCV
>           ~~~~~ I don't get why VIRTIO_MEM dones't depend on S390 if
>                 s390 need PROC_VMCORE_DEVICE_RAM.

This series depends on s390 support for virtio-mem, which just went 
upstream.

See

commit 38968bcdcc1d46f2fdcd3a72599d5193bf8baf84
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Oct 25 16:14:49 2024 +0200

     virtio-mem: s390 support


>          ......
>          select PROVIDE_PROC_VMCORE_DEVICE_RAM if PROC_VMCORE
> 
> config S390
>          select NEED_PROC_VMCORE_DEVICE_RAM if PROC_VMCORE
> =================================================
> 

Thanks for having a look!
Baoquan He Nov. 20, 2024, 2:05 p.m. UTC | #3
On 11/20/24 at 11:48am, David Hildenbrand wrote:
> On 20.11.24 11:13, Baoquan He wrote:
> > On 10/25/24 at 05:11pm, David Hildenbrand wrote:
> > > s390 allocates+prepares the elfcore hdr in the dump (2nd) kernel, not in
> > > the crashed kernel.
> > > 
> > > RAM provided by memory devices such as virtio-mem can only be detected
> > > using the device driver; when vmcore_init() is called, these device
> > > drivers are usually not loaded yet, or the devices did not get probed
> > > yet. Consequently, on s390 these RAM ranges will not be included in
> > > the crash dump, which makes the dump partially corrupt and is
> > > unfortunate.
> > > 
> > > Instead of deferring the vmcore_init() call, to an (unclear?) later point,
> > > let's reuse the vmcore_cb infrastructure to obtain device RAM ranges as
> > > the device drivers probe the device and get access to this information.
> > > 
> > > Then, we'll add these ranges to the vmcore, adding more PT_LOAD
> > > entries and updating the offsets+vmcore size.
> > > 
> > > Use Kconfig tricks to include this code automatically only if (a) there is
> > > a device driver compiled that implements the callback
> > > (PROVIDE_PROC_VMCORE_DEVICE_RAM) and; (b) the architecture actually needs
> > > this information (NEED_PROC_VMCORE_DEVICE_RAM).
> > > 
> > > The current target use case is s390, which only creates an elf64
> > > elfcore, so focusing on elf64 is sufficient.
> > > 
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >   fs/proc/Kconfig            |  25 ++++++
> > >   fs/proc/vmcore.c           | 156 +++++++++++++++++++++++++++++++++++++
> > >   include/linux/crash_dump.h |   9 +++
> > >   3 files changed, 190 insertions(+)
> > > 
> > > diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
> > > index d80a1431ef7b..1e11de5f9380 100644
> > > --- a/fs/proc/Kconfig
> > > +++ b/fs/proc/Kconfig
> > > @@ -61,6 +61,31 @@ config PROC_VMCORE_DEVICE_DUMP
> > >   	  as ELF notes to /proc/vmcore. You can still disable device
> > >   	  dump using the kernel command line option 'novmcoredd'.
> > > +config PROVIDE_PROC_VMCORE_DEVICE_RAM
> > > +	def_bool n
> > > +
> > > +config NEED_PROC_VMCORE_DEVICE_RAM
> > > +	def_bool n
> > > +
> > > +config PROC_VMCORE_DEVICE_RAM
> > > +	def_bool y
> > > +	depends on PROC_VMCORE
> > > +	depends on NEED_PROC_VMCORE_DEVICE_RAM
> > > +	depends on PROVIDE_PROC_VMCORE_DEVICE_RAM
> > 
> > Kconfig item is always a thing I need learn to master.
> 
> Yes, it's usually a struggle to get it right. It took me a couple of
> iterations to get to this point :)
> 
> > When I checked
> > this part, I have to write them down to deliberate. I am wondering if
> > below 'simple version' works too and more understandable. Please help
> > point out what I have missed.
> > 
> > ===========simple version======
> > config PROC_VMCORE_DEVICE_RAM
> >          def_bool y
> >          depends on PROC_VMCORE && VIRTIO_MEM
> >          depends on NEED_PROC_VMCORE_DEVICE_RAM
> > 
> > config S390
> >          select NEED_PROC_VMCORE_DEVICE_RAM
> > ============

Sorry, things written down didn't correctly reflect them in my mind. 

===========simple version======
fs/proc/Kconfig:
config PROC_VMCORE_DEVICE_RAM
        def_bool y
        depends on PROC_VMCORE && VIRTIO_MEM
        depends on NEED_PROC_VMCORE_DEVICE_RAM

arch/s390/Kconfig:
config NEED_PROC_VMCORE_DEVICE_RAM
        def y
==================================


> 
> So the three changes you did are
> 
> (a) Remove the config option but select/depend on them.
> 
> (b) Remove the "depends on PROC_VMCORE" from PROC_VMCORE_DEVICE_RAM,
>     and the "if PROC_VMCORE" from s390.
> 
> (c) Remove the PROVIDE_PROC_VMCORE_DEVICE_RAM
> 
> 
> Regarding (a), that doesn't work. If you select a config option that doesn't
> exist, it is silently dropped. It's always treated as if it wouldn't be set.
> 
> Regarding (b), I think that's an anti-pattern (having config options enabled
> that are completely ineffective) and I don't see a benefit dropping them.
> 
> Regarding (c), it would mean that s390x unconditionally includes that code
> even if virtio-mem is not configured in.
> 
> So while we could drop PROVIDE_PROC_VMCORE_DEVICE_RAM -- (c), it would that
> we end up including code in configurations that don't possibly need it.
> That's why I included that part.
> 
> > 
> > 
> > ======= config items extracted from this patchset====
> > config PROVIDE_PROC_VMCORE_DEVICE_RAM
> >          def_bool n
> > 
> > config NEED_PROC_VMCORE_DEVICE_RAM
> >          def_bool n
> > 
> > config PROC_VMCORE_DEVICE_RAM
> >          def_bool y
> >          depends on PROC_VMCORE
> >          depends on NEED_PROC_VMCORE_DEVICE_RAM
> >          depends on PROVIDE_PROC_VMCORE_DEVICE_RAM
> > 
> > config VIRTIO_MEM
> > 	depends on X86_64 || ARM64 || RISCV
> >           ~~~~~ I don't get why VIRTIO_MEM dones't depend on S390 if
> >                 s390 need PROC_VMCORE_DEVICE_RAM.
> 
> This series depends on s390 support for virtio-mem, which just went
> upstream.

Got It, I just applied this series on top of the latest mainline's
master branch. Thanks for telling.

> 
> 
> commit 38968bcdcc1d46f2fdcd3a72599d5193bf8baf84
> Author: David Hildenbrand <david@redhat.com>
> Date:   Fri Oct 25 16:14:49 2024 +0200
> 
>     virtio-mem: s390 support
> 
> 
> >          ......
> >          select PROVIDE_PROC_VMCORE_DEVICE_RAM if PROC_VMCORE
> > 
> > config S390
> >          select NEED_PROC_VMCORE_DEVICE_RAM if PROC_VMCORE
> > =================================================
> > 
> 
> Thanks for having a look!
> 
> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand Nov. 20, 2024, 2:39 p.m. UTC | #4
On 20.11.24 15:05, Baoquan He wrote:
> On 11/20/24 at 11:48am, David Hildenbrand wrote:
>> On 20.11.24 11:13, Baoquan He wrote:
>>> On 10/25/24 at 05:11pm, David Hildenbrand wrote:
>>>> s390 allocates+prepares the elfcore hdr in the dump (2nd) kernel, not in
>>>> the crashed kernel.
>>>>
>>>> RAM provided by memory devices such as virtio-mem can only be detected
>>>> using the device driver; when vmcore_init() is called, these device
>>>> drivers are usually not loaded yet, or the devices did not get probed
>>>> yet. Consequently, on s390 these RAM ranges will not be included in
>>>> the crash dump, which makes the dump partially corrupt and is
>>>> unfortunate.
>>>>
>>>> Instead of deferring the vmcore_init() call, to an (unclear?) later point,
>>>> let's reuse the vmcore_cb infrastructure to obtain device RAM ranges as
>>>> the device drivers probe the device and get access to this information.
>>>>
>>>> Then, we'll add these ranges to the vmcore, adding more PT_LOAD
>>>> entries and updating the offsets+vmcore size.
>>>>
>>>> Use Kconfig tricks to include this code automatically only if (a) there is
>>>> a device driver compiled that implements the callback
>>>> (PROVIDE_PROC_VMCORE_DEVICE_RAM) and; (b) the architecture actually needs
>>>> this information (NEED_PROC_VMCORE_DEVICE_RAM).
>>>>
>>>> The current target use case is s390, which only creates an elf64
>>>> elfcore, so focusing on elf64 is sufficient.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    fs/proc/Kconfig            |  25 ++++++
>>>>    fs/proc/vmcore.c           | 156 +++++++++++++++++++++++++++++++++++++
>>>>    include/linux/crash_dump.h |   9 +++
>>>>    3 files changed, 190 insertions(+)
>>>>
>>>> diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
>>>> index d80a1431ef7b..1e11de5f9380 100644
>>>> --- a/fs/proc/Kconfig
>>>> +++ b/fs/proc/Kconfig
>>>> @@ -61,6 +61,31 @@ config PROC_VMCORE_DEVICE_DUMP
>>>>    	  as ELF notes to /proc/vmcore. You can still disable device
>>>>    	  dump using the kernel command line option 'novmcoredd'.
>>>> +config PROVIDE_PROC_VMCORE_DEVICE_RAM
>>>> +	def_bool n
>>>> +
>>>> +config NEED_PROC_VMCORE_DEVICE_RAM
>>>> +	def_bool n
>>>> +
>>>> +config PROC_VMCORE_DEVICE_RAM
>>>> +	def_bool y
>>>> +	depends on PROC_VMCORE
>>>> +	depends on NEED_PROC_VMCORE_DEVICE_RAM
>>>> +	depends on PROVIDE_PROC_VMCORE_DEVICE_RAM
>>>
>>> Kconfig item is always a thing I need learn to master.
>>
>> Yes, it's usually a struggle to get it right. It took me a couple of
>> iterations to get to this point :)
>>
>>> When I checked
>>> this part, I have to write them down to deliberate. I am wondering if
>>> below 'simple version' works too and more understandable. Please help
>>> point out what I have missed.
>>>
>>> ===========simple version======
>>> config PROC_VMCORE_DEVICE_RAM
>>>           def_bool y
>>>           depends on PROC_VMCORE && VIRTIO_MEM
>>>           depends on NEED_PROC_VMCORE_DEVICE_RAM
>>>
>>> config S390
>>>           select NEED_PROC_VMCORE_DEVICE_RAM
>>> ============
> 
> Sorry, things written down didn't correctly reflect them in my mind.
> 
> ===========simple version======
> fs/proc/Kconfig:
> config PROC_VMCORE_DEVICE_RAM
>          def_bool y
>          depends on PROC_VMCORE && VIRTIO_MEM
>          depends on NEED_PROC_VMCORE_DEVICE_RAM
> 
> arch/s390/Kconfig:
> config NEED_PROC_VMCORE_DEVICE_RAM
>          def y
> ==================================

That would work, but I don't completely like it.

(a) I want s390x to select NEED_PROC_VMCORE_DEVICE_RAM instead. Staring 
at a bunch of similar cases (git grep "config NEED" | grep Kconfig, git 
grep "config ARCH_WANTS" | grep Kconfig), "select" is the common way to 
do it.

So unless there is a pretty good reason, I'll keep 
NEED_PROC_VMCORE_DEVICE_RAM as is.


(b) In the context of this patch, "depends on VIRTIO_MEM" does not make 
sense. We could have an intermediate:

config PROC_VMCORE_DEVICE_RAM
          def_bool n
          depends on PROC_VMCORE
          depends on NEED_PROC_VMCORE_DEVICE_RAM

And change that with VIRTIO_MEM support in the relevant patch.


I faintly remember that we try avoiding such dependencies and prefer 
selecting Kconfigs instead. Just look at the SPLIT_PTE_PTLOCKS mess we 
still have to clean up. But as we don't expect that many providers for 
now, I don't care.

Thanks!
Baoquan He Nov. 21, 2024, 4:30 a.m. UTC | #5
On 11/20/24 at 03:39pm, David Hildenbrand wrote:
> On 20.11.24 15:05, Baoquan He wrote:
> > On 11/20/24 at 11:48am, David Hildenbrand wrote:
> > > On 20.11.24 11:13, Baoquan He wrote:
> > > > On 10/25/24 at 05:11pm, David Hildenbrand wrote:
> > > > > s390 allocates+prepares the elfcore hdr in the dump (2nd) kernel, not in
> > > > > the crashed kernel.
> > > > > 
> > > > > RAM provided by memory devices such as virtio-mem can only be detected
> > > > > using the device driver; when vmcore_init() is called, these device
> > > > > drivers are usually not loaded yet, or the devices did not get probed
> > > > > yet. Consequently, on s390 these RAM ranges will not be included in
> > > > > the crash dump, which makes the dump partially corrupt and is
> > > > > unfortunate.
> > > > > 
> > > > > Instead of deferring the vmcore_init() call, to an (unclear?) later point,
> > > > > let's reuse the vmcore_cb infrastructure to obtain device RAM ranges as
> > > > > the device drivers probe the device and get access to this information.
> > > > > 
> > > > > Then, we'll add these ranges to the vmcore, adding more PT_LOAD
> > > > > entries and updating the offsets+vmcore size.
> > > > > 
> > > > > Use Kconfig tricks to include this code automatically only if (a) there is
> > > > > a device driver compiled that implements the callback
> > > > > (PROVIDE_PROC_VMCORE_DEVICE_RAM) and; (b) the architecture actually needs
> > > > > this information (NEED_PROC_VMCORE_DEVICE_RAM).
> > > > > 
> > > > > The current target use case is s390, which only creates an elf64
> > > > > elfcore, so focusing on elf64 is sufficient.
> > > > > 
> > > > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > > > ---
> > > > >    fs/proc/Kconfig            |  25 ++++++
> > > > >    fs/proc/vmcore.c           | 156 +++++++++++++++++++++++++++++++++++++
> > > > >    include/linux/crash_dump.h |   9 +++
> > > > >    3 files changed, 190 insertions(+)
> > > > > 
> > > > > diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
> > > > > index d80a1431ef7b..1e11de5f9380 100644
> > > > > --- a/fs/proc/Kconfig
> > > > > +++ b/fs/proc/Kconfig
> > > > > @@ -61,6 +61,31 @@ config PROC_VMCORE_DEVICE_DUMP
> > > > >    	  as ELF notes to /proc/vmcore. You can still disable device
> > > > >    	  dump using the kernel command line option 'novmcoredd'.
> > > > > +config PROVIDE_PROC_VMCORE_DEVICE_RAM
> > > > > +	def_bool n
> > > > > +
> > > > > +config NEED_PROC_VMCORE_DEVICE_RAM
> > > > > +	def_bool n
> > > > > +
> > > > > +config PROC_VMCORE_DEVICE_RAM
> > > > > +	def_bool y
> > > > > +	depends on PROC_VMCORE
> > > > > +	depends on NEED_PROC_VMCORE_DEVICE_RAM
> > > > > +	depends on PROVIDE_PROC_VMCORE_DEVICE_RAM
> > > > 
> > > > Kconfig item is always a thing I need learn to master.
> > > 
> > > Yes, it's usually a struggle to get it right. It took me a couple of
> > > iterations to get to this point :)
> > > 
> > > > When I checked
> > > > this part, I have to write them down to deliberate. I am wondering if
> > > > below 'simple version' works too and more understandable. Please help
> > > > point out what I have missed.
> > > > 
> > > > ===========simple version======
> > > > config PROC_VMCORE_DEVICE_RAM
> > > >           def_bool y
> > > >           depends on PROC_VMCORE && VIRTIO_MEM
> > > >           depends on NEED_PROC_VMCORE_DEVICE_RAM
> > > > 
> > > > config S390
> > > >           select NEED_PROC_VMCORE_DEVICE_RAM
> > > > ============
> > 
> > Sorry, things written down didn't correctly reflect them in my mind.
> > 
> > ===========simple version======
> > fs/proc/Kconfig:
> > config PROC_VMCORE_DEVICE_RAM
> >          def_bool y
> >          depends on PROC_VMCORE && VIRTIO_MEM
> >          depends on NEED_PROC_VMCORE_DEVICE_RAM
> > config NEED_PROC_VMCORE_DEVICE_RAM
> >          def y
> > 
> > arch/s390/Kconfig:
> > config NEED_PROC_VMCORE_DEVICE_RAM
> >          def y
> > ==================================
> 
> That would work, but I don't completely like it.
> 
> (a) I want s390x to select NEED_PROC_VMCORE_DEVICE_RAM instead. Staring at a
> bunch of similar cases (git grep "config NEED" | grep Kconfig, git grep
> "config ARCH_WANTS" | grep Kconfig), "select" is the common way to do it.
> 
> So unless there is a pretty good reason, I'll keep
> NEED_PROC_VMCORE_DEVICE_RAM as is.

That's easy to satify, see below:

============simple version=====
fs/proc/Kconfig:
config NEED_PROC_VMCORE_DEVICE_RAM
        def n

config PROC_VMCORE_DEVICE_RAM
        def_bool y
        depends on PROC_VMCORE && VIRTIO_MEM
        depends on NEED_PROC_VMCORE_DEVICE_RAM

arch/s390/Kconfig:
config S390
        select NEED_PROC_VMCORE_DEVICE_RAM
==============================

> 
> (b) In the context of this patch, "depends on VIRTIO_MEM" does not make
> sense. We could have an intermediate:
> 
> config PROC_VMCORE_DEVICE_RAM
>          def_bool n
>          depends on PROC_VMCORE
>          depends on NEED_PROC_VMCORE_DEVICE_RAM
> 
> And change that with VIRTIO_MEM support in the relevant patch.

Oh, it's not comment for this patch, I made the simple version based on
the whole patchset. When I had a glance at this patch, I also took
several iterations to get it after I applied the whole patchset and
tried to understand the whole code.

> 
> 
> I faintly remember that we try avoiding such dependencies and prefer
> selecting Kconfigs instead. Just look at the SPLIT_PTE_PTLOCKS mess we still
> have to clean up. But as we don't expect that many providers for now, I
> don't care.

With the simple version, Kconfig learner as me can easily understand what
they are doing. If it took you a couple of iterations to make them as
you had mentioned earlier, and it took me several iterations to
understand them, I believe there must be room to improve the presented
ones in this patchset. These are only my humble opinion, and I am not
aware of virtio-mem at all, I'll leave this to you and other virtio-mem
dev to decide what should be taken. Thanks for your patience and
provided information, I learned a lot from this discussion.

===================
fs/proc/Kconfig:
config PROVIDE_PROC_VMCORE_DEVICE_RAM
        def_bool n

config NEED_PROC_VMCORE_DEVICE_RAM
        def_bool n

config PROC_VMCORE_DEVICE_RAM
        def_bool y
        depends on PROC_VMCORE
        depends on NEED_PROC_VMCORE_DEVICE_RAM
        depends on PROVIDE_PROC_VMCORE_DEVICE_RAM

drivers/virtio/Kconfig:
config VIRTIO_MEM
        select PROVIDE_PROC_VMCORE_DEVICE_RAM if PROC_VMCORE
                                              ~~~~~~~~~~~~~~

arch/s390/Kconfig:
config S390
        select NEED_PROC_VMCORE_DEVICE_RAM if PROC_VMCORE
                                           ~~~~~~~~~~~~~~
========================

One last thing I haven't got well, If PROC_VMCORE_DEVICE_RAM has had
dependency on PROC_VMCORE, can we take off the ' if PROC_VMCORE' when
select PROVIDE_PROC_VMCORE_DEVICE_RAM and NEED_PROC_VMCORE_DEVICE_RAM?

Thanks
Baoquan
diff mbox series

Patch

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index d80a1431ef7b..1e11de5f9380 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -61,6 +61,31 @@  config PROC_VMCORE_DEVICE_DUMP
 	  as ELF notes to /proc/vmcore. You can still disable device
 	  dump using the kernel command line option 'novmcoredd'.
 
+config PROVIDE_PROC_VMCORE_DEVICE_RAM
+	def_bool n
+
+config NEED_PROC_VMCORE_DEVICE_RAM
+	def_bool n
+
+config PROC_VMCORE_DEVICE_RAM
+	def_bool y
+	depends on PROC_VMCORE
+	depends on NEED_PROC_VMCORE_DEVICE_RAM
+	depends on PROVIDE_PROC_VMCORE_DEVICE_RAM
+	help
+	  If the elfcore hdr is allocated and prepared by the dump kernel
+	  ("2nd kernel") instead of the crashed kernel, RAM provided by memory
+	  devices such as virtio-mem will not be included in the dump
+	  image, because only the device driver can properly detect them.
+
+	  With this config enabled, these RAM ranges will be queried from the
+	  device drivers once the device gets probed, so they can be included
+	  in the crash dump.
+
+	  Relevant architectures should select NEED_PROC_VMCORE_DEVICE_RAM
+	  and relevant device drivers should select
+	  PROVIDE_PROC_VMCORE_DEVICE_RAM.
+
 config PROC_SYSCTL
 	bool "Sysctl support (/proc/sys)" if EXPERT
 	depends on PROC_FS
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 3e90416ee54e..c332a9a4920b 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -69,6 +69,8 @@  static LIST_HEAD(vmcore_cb_list);
 /* Whether the vmcore has been opened once. */
 static bool vmcore_opened;
 
+static void vmcore_process_device_ram(struct vmcore_cb *cb);
+
 void register_vmcore_cb(struct vmcore_cb *cb)
 {
 	INIT_LIST_HEAD(&cb->next);
@@ -80,6 +82,8 @@  void register_vmcore_cb(struct vmcore_cb *cb)
 	 */
 	if (vmcore_opened)
 		pr_warn_once("Unexpected vmcore callback registration\n");
+	else if (cb->get_device_ram)
+		vmcore_process_device_ram(cb);
 	mutex_unlock(&vmcore_mutex);
 }
 EXPORT_SYMBOL_GPL(register_vmcore_cb);
@@ -1511,6 +1515,158 @@  int vmcore_add_device_dump(struct vmcoredd_data *data)
 EXPORT_SYMBOL(vmcore_add_device_dump);
 #endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_RAM
+static int vmcore_realloc_elfcore_buffer_elf64(size_t new_size)
+{
+	char *elfcorebuf_new;
+
+	if (WARN_ON_ONCE(new_size < elfcorebuf_sz))
+		return -EINVAL;
+	if (get_order(elfcorebuf_sz_orig) == get_order(new_size)) {
+		elfcorebuf_sz_orig = new_size;
+		return 0;
+	}
+
+	elfcorebuf_new = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
+						  get_order(new_size));
+	if (!elfcorebuf_new)
+		return -ENOMEM;
+	memcpy(elfcorebuf_new, elfcorebuf, elfcorebuf_sz);
+	free_pages((unsigned long)elfcorebuf, get_order(elfcorebuf_sz_orig));
+	elfcorebuf = elfcorebuf_new;
+	elfcorebuf_sz_orig = new_size;
+	return 0;
+}
+
+static void vmcore_reset_offsets_elf64(void)
+{
+	Elf64_Phdr *phdr_start = (Elf64_Phdr *)(elfcorebuf + sizeof(Elf64_Ehdr));
+	loff_t vmcore_off = elfcorebuf_sz + elfnotes_sz;
+	Elf64_Ehdr *ehdr = (Elf64_Ehdr *)elfcorebuf;
+	Elf64_Phdr *phdr;
+	int i;
+
+	for (i = 0, phdr = phdr_start; i < ehdr->e_phnum; i++, phdr++) {
+		u64 start, end;
+
+		/*
+		 * After merge_note_headers_elf64() we should only have a single
+		 * PT_NOTE entry that starts immediately after elfcorebuf_sz.
+		 */
+		if (phdr->p_type == PT_NOTE) {
+			phdr->p_offset = elfcorebuf_sz;
+			continue;
+		}
+
+		start = rounddown(phdr->p_offset, PAGE_SIZE);
+		end = roundup(phdr->p_offset + phdr->p_memsz, PAGE_SIZE);
+		phdr->p_offset = vmcore_off + (phdr->p_offset - start);
+		vmcore_off = vmcore_off + end - start;
+	}
+	set_vmcore_list_offsets(elfcorebuf_sz, elfnotes_sz, &vmcore_list);
+}
+
+static int vmcore_add_device_ram_elf64(struct list_head *list, size_t count)
+{
+	Elf64_Phdr *phdr_start = (Elf64_Phdr *)(elfcorebuf + sizeof(Elf64_Ehdr));
+	Elf64_Ehdr *ehdr = (Elf64_Ehdr *)elfcorebuf;
+	struct vmcore_mem_node *cur;
+	Elf64_Phdr *phdr;
+	size_t new_size;
+	int rc;
+
+	if ((Elf32_Half)(ehdr->e_phnum + count) != ehdr->e_phnum + count) {
+		pr_err("Kdump: too many device ram ranges\n");
+		return -ENOSPC;
+	}
+
+	/* elfcorebuf_sz must always cover full pages. */
+	new_size = sizeof(Elf64_Ehdr) +
+		   (ehdr->e_phnum + count) * sizeof(Elf64_Phdr);
+	new_size = roundup(new_size, PAGE_SIZE);
+
+	/*
+	 * Make sure we have sufficient space to include the new PT_LOAD
+	 * entries.
+	 */
+	rc = vmcore_realloc_elfcore_buffer_elf64(new_size);
+	if (rc) {
+		pr_err("Kdump: resizing elfcore failed\n");
+		return rc;
+	}
+
+	/* Modify our used elfcore buffer size to cover the new entries. */
+	elfcorebuf_sz = new_size;
+
+	/* Fill the added PT_LOAD entries. */
+	phdr = phdr_start + ehdr->e_phnum;
+	list_for_each_entry(cur, list, list) {
+		WARN_ON_ONCE(!IS_ALIGNED(cur->paddr | cur->size, PAGE_SIZE));
+		elfcorehdr_fill_device_ram_ptload_elf64(phdr, cur->paddr, cur->size);
+
+		/* p_offset will be adjusted later. */
+		phdr++;
+		ehdr->e_phnum++;
+	}
+	list_splice_tail(list, &vmcore_list);
+
+	/* We changed elfcorebuf_sz and added new entries; reset all offsets. */
+	vmcore_reset_offsets_elf64();
+
+	/* Finally, recalculated the total vmcore size. */
+	vmcore_size = get_vmcore_size(elfcorebuf_sz, elfnotes_sz,
+				      &vmcore_list);
+	proc_vmcore->size = vmcore_size;
+	return 0;
+}
+
+static void vmcore_process_device_ram(struct vmcore_cb *cb)
+{
+	unsigned char *e_ident = (unsigned char *)elfcorebuf;
+	struct vmcore_mem_node *first, *m;
+	LIST_HEAD(list);
+	int count;
+
+	if (cb->get_device_ram(cb, &list)) {
+		pr_err("Kdump: obtaining device ram ranges failed\n");
+		return;
+	}
+	count = list_count_nodes(&list);
+	if (!count)
+		return;
+
+	/* We only support Elf64 dumps for now. */
+	if (WARN_ON_ONCE(e_ident[EI_CLASS] != ELFCLASS64)) {
+		pr_err("Kdump: device ram ranges only support Elf64\n");
+		goto out_free;
+	}
+
+	/*
+	 * For some reason these ranges are already know? Might happen
+	 * with unusual register->unregister->register sequences; we'll simply
+	 * sanity check using the first range.
+	 */
+	first = list_first_entry(&list, struct vmcore_mem_node, list);
+	list_for_each_entry(m, &vmcore_list, list) {
+		unsigned long long m_end = m->paddr + m->size;
+		unsigned long long first_end = first->paddr + first->size;
+
+		if (first->paddr < m_end && m->paddr < first_end)
+			goto out_free;
+	}
+
+	/* If adding the mem nodes succeeds, they must not be freed. */
+	if (!vmcore_add_device_ram_elf64(&list, count))
+		return;
+out_free:
+	vmcore_free_mem_nodes(&list);
+}
+#else /* !CONFIG_PROC_VMCORE_DEVICE_RAM */
+static void vmcore_process_device_ram(struct vmcore_cb *cb)
+{
+}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_RAM */
+
 /* Free all dumps in vmcore device dump list */
 static void vmcore_free_device_dumps(void)
 {
diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
index 722dbcff7371..8e581a053d7f 100644
--- a/include/linux/crash_dump.h
+++ b/include/linux/crash_dump.h
@@ -20,6 +20,8 @@  extern int elfcorehdr_alloc(unsigned long long *addr, unsigned long long *size);
 extern void elfcorehdr_free(unsigned long long addr);
 extern ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos);
 extern ssize_t elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos);
+void elfcorehdr_fill_device_ram_ptload_elf64(Elf64_Phdr *phdr,
+		unsigned long long paddr, unsigned long long size);
 extern int remap_oldmem_pfn_range(struct vm_area_struct *vma,
 				  unsigned long from, unsigned long pfn,
 				  unsigned long size, pgprot_t prot);
@@ -99,6 +101,12 @@  static inline void vmcore_unusable(void)
  *              indicated in the vmcore instead. For example, a ballooned page
  *              contains no data and reading from such a page will cause high
  *              load in the hypervisor.
+ * @get_device_ram: query RAM ranges that can only be detected by device
+ *   drivers, such as the virtio-mem driver, so they can be included in
+ *   the crash dump on architectures that allocate the elfcore hdr in the dump
+ *   ("2nd") kernel. Indicated RAM ranges may contain holes to reduce the
+ *   total number of ranges; such holes can be detected using the pfn_is_ram
+ *   callback just like for other RAM.
  * @next: List head to manage registered callbacks internally; initialized by
  *        register_vmcore_cb().
  *
@@ -109,6 +117,7 @@  static inline void vmcore_unusable(void)
  */
 struct vmcore_cb {
 	bool (*pfn_is_ram)(struct vmcore_cb *cb, unsigned long pfn);
+	int (*get_device_ram)(struct vmcore_cb *cb, struct list_head *list);
 	struct list_head next;
 };
 extern void register_vmcore_cb(struct vmcore_cb *cb);