diff mbox series

[v4,4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk

Message ID 20230221023523.1498500-5-jeeheng.sia@starfivetech.com (mailing list archive)
State Superseded
Headers show
Series RISC-V Hibernation Support | expand

Checks

Context Check Description
conchuod/cover_letter success Series has a cover letter
conchuod/tree_selection success Guessed tree name to be for-next
conchuod/fixes_present success Fixes tag not required for -next series
conchuod/maintainers_pattern success MAINTAINERS pattern errors before the patch: 13 and now 13
conchuod/verify_signedoff success Signed-off-by tag matches author and committer
conchuod/kdoc success Errors and warnings before: 0 this patch: 0
conchuod/build_rv64_clang_allmodconfig success Errors and warnings before: 2458 this patch: 0
conchuod/module_param success Was 0 now: 0
conchuod/build_rv64_gcc_allmodconfig success Errors and warnings before: 17297 this patch: 0
conchuod/alphanumeric_selects warning Out of order selects before the patch: 729 and now 736
conchuod/build_rv32_defconfig success Build OK
conchuod/dtb_warn_rv64 success Errors and warnings before: 2 this patch: 2
conchuod/header_inline success No static functions without inline keyword in header files
conchuod/checkpatch warning CHECK: Consider using #include <linux/cacheflush.h> instead of <asm/cacheflush.h> CHECK: Consider using #include <linux/mmu_context.h> instead of <asm/mmu_context.h> CHECK: Consider using #include <linux/pgtable.h> instead of <asm/pgtable.h> CHECK: Consider using #include <linux/smp.h> instead of <asm/smp.h> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
conchuod/source_inline fail Was 0 now: 1
conchuod/build_rv64_nommu_k210_defconfig success Build OK
conchuod/verify_fixes success No Fixes tag
conchuod/build_rv64_nommu_virt_defconfig success Build OK

Commit Message

Sia Jee Heng Feb. 21, 2023, 2:35 a.m. UTC
Low level Arch functions were created to support hibernation.
swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
cpu state onto the stack, then calling swsusp_save() to save the memory
image.

Arch specific hibernation header is implemented and is utilized by the
arch_hibernation_header_restore() and arch_hibernation_header_save()
functions. The arch specific hibernation header consists of satp, hartid,
and the cpu_resume address. The kernel built version is also need to be
saved into the hibernation image header to making sure only the same
kernel is restore when resume.

swsusp_arch_resume() creates a temporary page table that covering only
the linear map. It copies the restore code to a 'safe' page, then start
to restore the memory image. Once completed, it restores the original
kernel's page table. It then calls into __hibernate_cpu_resume()
to restore the CPU context. Finally, it follows the normal hibernation
path back to the hibernation core.

To enable hibernation/suspend to disk into RISCV, the below config
need to be enabled:
- CONFIG_ARCH_HIBERNATION_HEADER
- CONFIG_ARCH_HIBERNATION_POSSIBLE

Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
---
 arch/riscv/Kconfig                 |   7 +
 arch/riscv/include/asm/assembler.h |  20 ++
 arch/riscv/include/asm/suspend.h   |  19 ++
 arch/riscv/kernel/Makefile         |   1 +
 arch/riscv/kernel/asm-offsets.c    |   5 +
 arch/riscv/kernel/hibernate-asm.S  |  77 +++++
 arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
 7 files changed, 576 insertions(+)
 create mode 100644 arch/riscv/kernel/hibernate-asm.S
 create mode 100644 arch/riscv/kernel/hibernate.c

Comments

Andrew Jones Feb. 23, 2023, 6:07 p.m. UTC | #1
On Tue, Feb 21, 2023 at 10:35:23AM +0800, Sia Jee Heng wrote:
> Low level Arch functions were created to support hibernation.
> swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> cpu state onto the stack, then calling swsusp_save() to save the memory
> image.
> 
> Arch specific hibernation header is implemented and is utilized by the
> arch_hibernation_header_restore() and arch_hibernation_header_save()
> functions. The arch specific hibernation header consists of satp, hartid,
> and the cpu_resume address. The kernel built version is also need to be
> saved into the hibernation image header to making sure only the same
> kernel is restore when resume.
> 
> swsusp_arch_resume() creates a temporary page table that covering only
> the linear map. It copies the restore code to a 'safe' page, then start
> to restore the memory image. Once completed, it restores the original
> kernel's page table. It then calls into __hibernate_cpu_resume()
> to restore the CPU context. Finally, it follows the normal hibernation
> path back to the hibernation core.
> 
> To enable hibernation/suspend to disk into RISCV, the below config
> need to be enabled:
> - CONFIG_ARCH_HIBERNATION_HEADER
> - CONFIG_ARCH_HIBERNATION_POSSIBLE
> 
> Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> ---
>  arch/riscv/Kconfig                 |   7 +
>  arch/riscv/include/asm/assembler.h |  20 ++
>  arch/riscv/include/asm/suspend.h   |  19 ++
>  arch/riscv/kernel/Makefile         |   1 +
>  arch/riscv/kernel/asm-offsets.c    |   5 +
>  arch/riscv/kernel/hibernate-asm.S  |  77 +++++
>  arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
>  7 files changed, 576 insertions(+)
>  create mode 100644 arch/riscv/kernel/hibernate-asm.S
>  create mode 100644 arch/riscv/kernel/hibernate.c
> 
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index e2b656043abf..4555848a817f 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -690,6 +690,13 @@ menu "Power management options"
>  
>  source "kernel/power/Kconfig"
>  
> +config ARCH_HIBERNATION_POSSIBLE
> +	def_bool y
> +
> +config ARCH_HIBERNATION_HEADER
> +	def_bool y
> +	depends on HIBERNATION

nit: I think this can be simplified as def_bool HIBERNATION

> +
>  endmenu # "Power management options"
>  
>  menu "CPU Power Management"
> diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> index 727a97735493..68c46c0e0ea8 100644
> --- a/arch/riscv/include/asm/assembler.h
> +++ b/arch/riscv/include/asm/assembler.h
> @@ -59,4 +59,24 @@
>  		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
>  	.endm
>  
> +/*
> + * copy_page - copy 1 page (4KB) of data from source to destination
> + * @a0 - destination
> + * @a1 - source
> + */
> +	.macro	copy_page a0, a1
> +		lui	a2, 0x1
> +		add	a2, a2, a0
> +1 :
    ^ please remove this space

> +		REG_L	t0, 0(a1)
> +		REG_L	t1, SZREG(a1)
> +
> +		REG_S	t0, 0(a0)
> +		REG_S	t1, SZREG(a0)
> +
> +		addi	a0, a0, 2 * SZREG
> +		addi	a1, a1, 2 * SZREG
> +		bne	a2, a0, 1b
> +	.endm
> +
>  #endif	/* __ASM_ASSEMBLER_H */
> diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> index 75419c5ca272..3362da56a9d8 100644
> --- a/arch/riscv/include/asm/suspend.h
> +++ b/arch/riscv/include/asm/suspend.h
> @@ -21,6 +21,11 @@ struct suspend_context {
>  #endif
>  };
>  
> +/*
> + * Used by hibernation core and cleared during resume sequence
> + */
> +extern int in_suspend;
> +
>  /* Low-level CPU suspend entry function */
>  int __cpu_suspend_enter(struct suspend_context *context);
>  
> @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
>  /* Used to save and restore the csr */
>  void suspend_save_csrs(struct suspend_context *context);
>  void suspend_restore_csrs(struct suspend_context *context);
> +
> +/* Low-level API to support hibernation */
> +int swsusp_arch_suspend(void);
> +int swsusp_arch_resume(void);
> +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> +int arch_hibernation_header_restore(void *addr);
> +int __hibernate_cpu_resume(void);
> +
> +/* Used to resume on the CPU we hibernated on */
> +int hibernate_resume_nonboot_cpu_disable(void);
> +
> +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> +					unsigned long cpu_resume);
> +asmlinkage int hibernate_core_restore_code(void);
>  #endif
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index 4cf303a779ab..daab341d55e4 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
>  obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
>  
>  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
>  
>  obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
>  obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> index df9444397908..d6a75aac1d27 100644
> --- a/arch/riscv/kernel/asm-offsets.c
> +++ b/arch/riscv/kernel/asm-offsets.c
> @@ -9,6 +9,7 @@
>  #include <linux/kbuild.h>
>  #include <linux/mm.h>
>  #include <linux/sched.h>
> +#include <linux/suspend.h>
>  #include <asm/kvm_host.h>
>  #include <asm/thread_info.h>
>  #include <asm/ptrace.h>
> @@ -116,6 +117,10 @@ void asm_offsets(void)
>  
>  	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
>  
> +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> +
>  	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
>  	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
>  	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> new file mode 100644
> index 000000000000..846affe4dced
> --- /dev/null
> +++ b/arch/riscv/kernel/hibernate-asm.S
> @@ -0,0 +1,77 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Hibernation low level support for RISCV.
> + *
> + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> + *
> + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> + */
> +
> +#include <asm/asm.h>
> +#include <asm/asm-offsets.h>
> +#include <asm/assembler.h>
> +#include <asm/csr.h>
> +
> +#include <linux/linkage.h>
> +
> +/*
> + * int __hibernate_cpu_resume(void)
> + * Switch back to the hibernated image's page table prior to restoring the CPU
> + * context.
> + *
> + * Always returns 0
> + */
> +ENTRY(__hibernate_cpu_resume)
> +	/* switch to hibernated image's page table. */
> +	csrw CSR_SATP, s0
> +	sfence.vma
> +
> +	REG_L	a0, hibernate_cpu_context
> +
> +	restore_csr
> +	restore_reg
> +
> +	/* Return zero value. */
> +	add	a0, zero, zero

nit: mv a0, zero

> +
> +	ret
> +END(__hibernate_cpu_resume)
> +
> +/*
> + * Prepare to restore the image.
> + * a0: satp of saved page tables.
> + * a1: satp of temporary page tables.
> + * a2: cpu_resume.
> + */
> +ENTRY(hibernate_restore_image)
> +	mv	s0, a0
> +	mv	s1, a1
> +	mv	s2, a2
> +	REG_L	s4, restore_pblist
> +	REG_L	a1, relocated_restore_code
> +
> +	jalr	a1
> +END(hibernate_restore_image)
> +
> +/*
> + * The below code will be executed from a 'safe' page.
> + * It first switches to the temporary page table, then starts to copy the pages
> + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> + * to restore the CPU context.
> + */
> +ENTRY(hibernate_core_restore_code)
> +	/* switch to temp page table. */
> +	csrw satp, s1
> +	sfence.vma
> +.Lcopy:
> +	/* The below code will restore the hibernated image. */
> +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> +	REG_L	a0, HIBERN_PBE_ORIG(s4)

Are we sure restore_pblist will never be NULL?

> +
> +	copy_page a0, a1
> +
> +	REG_L	s4, HIBERN_PBE_NEXT(s4)
> +	bnez	s4, .Lcopy
> +
> +	jalr	s2
> +END(hibernate_core_restore_code)
> diff --git a/arch/riscv/kernel/hibernate.c b/arch/riscv/kernel/hibernate.c
> new file mode 100644
> index 000000000000..46a2f470db6e
> --- /dev/null
> +++ b/arch/riscv/kernel/hibernate.c
> @@ -0,0 +1,447 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Hibernation support for RISCV
> + *
> + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> + *
> + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> + */
> +
> +#include <asm/barrier.h>
> +#include <asm/cacheflush.h>
> +#include <asm/mmu_context.h>
> +#include <asm/page.h>
> +#include <asm/pgalloc.h>
> +#include <asm/pgtable.h>
> +#include <asm/sections.h>
> +#include <asm/set_memory.h>
> +#include <asm/smp.h>
> +#include <asm/suspend.h>
> +
> +#include <linux/cpu.h>
> +#include <linux/memblock.h>
> +#include <linux/pm.h>
> +#include <linux/sched.h>
> +#include <linux/suspend.h>
> +#include <linux/utsname.h>
> +
> +/* The logical cpu number we should resume on, initialised to a non-cpu number. */
> +static int sleep_cpu = -EINVAL;
> +
> +/* Pointer to the temporary resume page table. */
> +static pgd_t *resume_pg_dir;
> +
> +/* CPU context to be saved. */
> +struct suspend_context *hibernate_cpu_context;
> +EXPORT_SYMBOL_GPL(hibernate_cpu_context);
> +
> +unsigned long relocated_restore_code;
> +EXPORT_SYMBOL_GPL(relocated_restore_code);
> +
> +/**
> + * struct arch_hibernate_hdr_invariants - container to store kernel build version.
> + * @uts_version: to save the build number and date so that the we do not resume with
> + *		a different kernel.
> + */
> +struct arch_hibernate_hdr_invariants {
> +	char		uts_version[__NEW_UTS_LEN + 1];
> +};
> +
> +/**
> + * struct arch_hibernate_hdr - helper parameters that help us to restore the image.
> + * @invariants: container to store kernel build version.
> + * @hartid: to make sure same boot_cpu executes the hibernate/restore code.
> + * @saved_satp: original page table used by the hibernated image.
> + * @restore_cpu_addr: the kernel's image address to restore the CPU context.
> + */
> +static struct arch_hibernate_hdr {
> +	struct arch_hibernate_hdr_invariants invariants;
> +	unsigned long	hartid;
> +	unsigned long	saved_satp;
> +	unsigned long	restore_cpu_addr;
> +} resume_hdr;
> +
> +static inline void arch_hdr_invariants(struct arch_hibernate_hdr_invariants *i)
> +{
> +	memset(i, 0, sizeof(*i));
> +	memcpy(i->uts_version, init_utsname()->version, sizeof(i->uts_version));
> +}
> +
> +/*
> + * Check if the given pfn is in the 'nosave' section.
> + */
> +int pfn_is_nosave(unsigned long pfn)
> +{
> +	unsigned long nosave_begin_pfn = sym_to_pfn(&__nosave_begin);
> +	unsigned long nosave_end_pfn = sym_to_pfn(&__nosave_end - 1);
> +
> +	return ((pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn));
> +}
> +
> +void notrace save_processor_state(void)
> +{
> +	WARN_ON(num_online_cpus() != 1);
> +}
> +
> +void notrace restore_processor_state(void)
> +{
> +}
> +
> +/*
> + * Helper parameters need to be saved to the hibernation image header.
> + */
> +int arch_hibernation_header_save(void *addr, unsigned int max_size)
> +{
> +	struct arch_hibernate_hdr *hdr = addr;
> +
> +	if (max_size < sizeof(*hdr))
> +		return -EOVERFLOW;
> +
> +	arch_hdr_invariants(&hdr->invariants);
> +
> +	hdr->hartid = cpuid_to_hartid_map(sleep_cpu);
> +	hdr->saved_satp = csr_read(CSR_SATP);
> +	hdr->restore_cpu_addr = (unsigned long)__hibernate_cpu_resume;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(arch_hibernation_header_save);
> +
> +/*
> + * Retrieve the helper parameters from the hibernation image header.
> + */
> +int arch_hibernation_header_restore(void *addr)
> +{
> +	struct arch_hibernate_hdr_invariants invariants;
> +	struct arch_hibernate_hdr *hdr = addr;
> +	int ret = 0;
> +
> +	arch_hdr_invariants(&invariants);
> +
> +	if (memcmp(&hdr->invariants, &invariants, sizeof(invariants))) {
> +		pr_crit("Hibernate image not generated by this kernel!\n");
> +		return -EINVAL;
> +	}
> +
> +	sleep_cpu = riscv_hartid_to_cpuid(hdr->hartid);
> +	if (sleep_cpu < 0) {
> +		pr_crit("Hibernated on a CPU not known to this kernel!\n");
> +		sleep_cpu = -EINVAL;
> +		return -EINVAL;
> +	}
> +
> +#ifdef CONFIG_SMP
> +	ret = bringup_hibernate_cpu(sleep_cpu);
> +	if (ret) {
> +		sleep_cpu = -EINVAL;
> +		return ret;
> +	}
> +#endif
> +	resume_hdr = *hdr;
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(arch_hibernation_header_restore);
> +
> +int swsusp_arch_suspend(void)
> +{
> +	int ret = 0;
> +
> +	if (__cpu_suspend_enter(hibernate_cpu_context)) {
> +		sleep_cpu = smp_processor_id();
> +		suspend_save_csrs(hibernate_cpu_context);
> +		ret = swsusp_save();
> +	} else {
> +		suspend_restore_csrs(hibernate_cpu_context);
> +		flush_tlb_all();
> +		flush_icache_all();
> +
> +		/*
> +		 * Tell the hibernation core that we've just restored the memory.
> +		 */
> +		in_suspend = 0;
> +		sleep_cpu = -EINVAL;
> +	}
> +
> +	return ret;
> +}
> +
> +static unsigned long _temp_pgtable_map_pte(pte_t *dst_ptep, pte_t *src_ptep,
> +					   unsigned long addr, pgprot_t prot)
> +{
> +	pte_t pte = READ_ONCE(*src_ptep);
> +
> +	if (pte_present(pte))
> +		set_pte(dst_ptep, __pte(pte_val(pte) | pgprot_val(prot)));
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_map_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp,
> +					  unsigned long start, unsigned long end,
> +					  pgprot_t prot)
> +{
> +	unsigned long addr = start;
> +	pte_t *src_ptep;
> +	pte_t *dst_ptep;
> +
> +	if (pmd_none(READ_ONCE(*dst_pmdp))) {
> +		dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
> +		if (!dst_ptep)
> +			return -ENOMEM;
> +
> +		pmd_populate_kernel(NULL, dst_pmdp, dst_ptep);
> +	}
> +
> +	dst_ptep = pte_offset_kernel(dst_pmdp, start);
> +	src_ptep = pte_offset_kernel(src_pmdp, start);
> +
> +	do {
> +		_temp_pgtable_map_pte(dst_ptep, src_ptep, addr, prot);

I think I'd rather have the body of _temp_pgtable_map_pte() here and drop
the helper, because the helper does (pte_val(pte) | pgprot_val(prot))
which looks strange, until seeing here that 'pte' is only the address
bits, so OR'ing in new prot bits without clearing old prot bits makes
sense.

> +	} while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr < end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_map_pmd(pud_t *dst_pudp, pud_t *src_pudp,
> +					  unsigned long start, unsigned long end,
> +					  pgprot_t prot)
> +{
> +	unsigned long addr = start;
> +	unsigned long next;
> +	unsigned long ret;
> +	pmd_t *src_pmdp;
> +	pmd_t *dst_pmdp;
> +
> +	if (pud_none(READ_ONCE(*dst_pudp))) {
> +		dst_pmdp = (pmd_t *)get_safe_page(GFP_ATOMIC);
> +		if (!dst_pmdp)
> +			return -ENOMEM;
> +
> +		pud_populate(NULL, dst_pudp, dst_pmdp);
> +	}
> +
> +	dst_pmdp = pmd_offset(dst_pudp, start);
> +	src_pmdp = pmd_offset(src_pudp, start);
> +
> +	do {
> +		pmd_t pmd = READ_ONCE(*src_pmdp);
> +
> +		next = pmd_addr_end(addr, end);
> +
> +		if (pmd_none(pmd))
> +			continue;
> +
> +		if (pmd_leaf(pmd)) {
> +			set_pmd(dst_pmdp, __pmd(pmd_val(pmd) | pgprot_val(prot)));
> +		} else {
> +			ret = temp_pgtable_map_pte(dst_pmdp, src_pmdp, addr, next, prot);
> +			if (ret)
> +				return -ENOMEM;
> +		}
> +	} while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_map_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp,
> +					  unsigned long start,
> +					  unsigned long end, pgprot_t prot)
> +{
> +	unsigned long addr = start;
> +	unsigned long next;
> +	unsigned long ret;
> +	pud_t *dst_pudp;
> +	pud_t *src_pudp;
> +
> +	if (p4d_none(READ_ONCE(*dst_p4dp))) {
> +		dst_pudp = (pud_t *)get_safe_page(GFP_ATOMIC);
> +		if (!dst_pudp)
> +			return -ENOMEM;
> +
> +		p4d_populate(NULL, dst_p4dp, dst_pudp);
> +	}
> +
> +	dst_pudp = pud_offset(dst_p4dp, start);
> +	src_pudp = pud_offset(src_p4dp, start);
> +
> +	do {
> +		pud_t pud = READ_ONCE(*src_pudp);
> +
> +		next = pud_addr_end(addr, end);
> +
> +		if (pud_none(pud))
> +			continue;
> +
> +		if (pud_leaf(pud)) {
> +			set_pud(dst_pudp, __pud(pud_val(pud) | pgprot_val(prot)));
> +		} else {
> +			ret = temp_pgtable_map_pmd(dst_pudp, src_pudp, addr, next, prot);
> +			if (ret)
> +				return -ENOMEM;
> +		}
> +	} while (dst_pudp++, src_pudp++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_map_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp,
> +					  unsigned long start, unsigned long end,
> +					  pgprot_t prot)
> +{
> +	unsigned long addr = start;
> +	unsigned long next;
> +	unsigned long ret;
> +	p4d_t *dst_p4dp;
> +	p4d_t *src_p4dp;
> +
> +	if (pgd_none(READ_ONCE(*dst_pgdp))) {
> +		dst_p4dp = (p4d_t *)get_safe_page(GFP_ATOMIC);
> +		if (!dst_p4dp)
> +			return -ENOMEM;
> +
> +		pgd_populate(NULL, dst_pgdp, dst_p4dp);
> +	}
> +
> +	dst_p4dp = p4d_offset(dst_pgdp, start);
> +	src_p4dp = p4d_offset(src_pgdp, start);
> +
> +	do {
> +		p4d_t p4d = READ_ONCE(*src_p4dp);
> +
> +		next = p4d_addr_end(addr, end);
> +
> +		if (p4d_none(READ_ONCE(*src_p4dp)))
> +			continue;
> +
> +		if (p4d_leaf(p4d)) {
> +			set_p4d(dst_p4dp, __p4d(p4d_val(p4d) | pgprot_val(prot)));
> +		} else {
> +			ret = temp_pgtable_map_pud(dst_p4dp, src_p4dp, addr, next, prot);
> +			if (ret)
> +				return -ENOMEM;
> +		}
> +	} while (dst_p4dp++, src_p4dp++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_mapping(pgd_t *pgdp)
> +{
> +	unsigned long end = (unsigned long)pfn_to_virt(max_low_pfn);
> +	unsigned long addr = PAGE_OFFSET;
> +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
> +	pgd_t *src_pgdp = pgd_offset_k(addr);
> +	unsigned long next;
> +
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none(READ_ONCE(*src_pgdp)))
> +			continue;
> +
> +		if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, next, PAGE_KERNEL))
> +			return -ENOMEM;
> +	} while (dst_pgdp++, src_pgdp++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_text_mapping(pgd_t *pgdp, unsigned long addr)
> +{
> +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
> +	pgd_t *src_pgdp = pgd_offset_k(addr);
> +
> +	if (pgd_none(READ_ONCE(*src_pgdp)))
> +		return -EFAULT;
> +
> +	if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, addr, PAGE_KERNEL_EXEC))
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static unsigned long relocate_restore_code(void)
> +{
> +	unsigned long ret;
> +	void *page = (void *)get_safe_page(GFP_ATOMIC);
> +
> +	if (!page)
> +		return -ENOMEM;
> +
> +	copy_page(page, hibernate_core_restore_code);
> +
> +	/* Make the page containing the relocated code executable. */
> +	set_memory_x((unsigned long)page, 1);
> +
> +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)page);
> +	if (ret)
> +		return ret;
> +
> +	return (unsigned long)page;
> +}
> +
> +int swsusp_arch_resume(void)
> +{
> +	unsigned long ret;
> +
> +	/*
> +	 * Memory allocated by get_safe_page() will be dealt with by the hibernation core,
> +	 * we don't need to free it here.
> +	 */
> +	resume_pg_dir = (pgd_t *)get_safe_page(GFP_ATOMIC);
> +	if (!resume_pg_dir)
> +		return -ENOMEM;
> +
> +	/*
> +	 * The pages need to be writable when restoring the image.
> +	 * Create a second copy of page table just for the linear map.
> +	 * Use this temporary page table to restore the image.
> +	 */
> +	ret = temp_pgtable_mapping(resume_pg_dir);
> +	if (ret)
> +		return (int)ret;
> +
> +	/* Move the restore code to a new page so that it doesn't get overwritten by itself. */
> +	relocated_restore_code = relocate_restore_code();
> +	if (relocated_restore_code == -ENOMEM)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Map the __hibernate_cpu_resume() address to the temporary page table so that the
> +	 * restore code can jumps to it after finished restore the image. The next execution
> +	 * code doesn't find itself in a different address space after switching over to the
> +	 * original page table used by the hibernated image.
> +	 */
> +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)resume_hdr.restore_cpu_addr);
> +	if (ret)
> +		return ret;
> +
> +	hibernate_restore_image(resume_hdr.saved_satp, (PFN_DOWN(__pa(resume_pg_dir)) | satp_mode),
> +				resume_hdr.restore_cpu_addr);
> +
> +	return 0;
> +}
> +
> +#ifdef CONFIG_PM_SLEEP_SMP
> +int hibernate_resume_nonboot_cpu_disable(void)
> +{
> +	if (sleep_cpu < 0) {
> +		pr_err("Failing to resume from hibernate on an unknown CPU\n");
> +		return -ENODEV;
> +	}
> +
> +	return freeze_secondary_cpus(sleep_cpu);
> +}
> +#endif
> +
> +static int __init riscv_hibernate_init(void)
> +{
> +	hibernate_cpu_context = kzalloc(sizeof(*hibernate_cpu_context), GFP_KERNEL);
> +
> +	if (WARN_ON(!hibernate_cpu_context))
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +early_initcall(riscv_hibernate_init);
> -- 
> 2.34.1
>

Thanks,
drew
Sia Jee Heng Feb. 24, 2023, 2:05 a.m. UTC | #2
> -----Original Message-----
> From: Andrew Jones <ajones@ventanamicro.com>
> Sent: Friday, 24 February, 2023 2:07 AM
> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> On Tue, Feb 21, 2023 at 10:35:23AM +0800, Sia Jee Heng wrote:
> > Low level Arch functions were created to support hibernation.
> > swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> > cpu state onto the stack, then calling swsusp_save() to save the memory
> > image.
> >
> > Arch specific hibernation header is implemented and is utilized by the
> > arch_hibernation_header_restore() and arch_hibernation_header_save()
> > functions. The arch specific hibernation header consists of satp, hartid,
> > and the cpu_resume address. The kernel built version is also need to be
> > saved into the hibernation image header to making sure only the same
> > kernel is restore when resume.
> >
> > swsusp_arch_resume() creates a temporary page table that covering only
> > the linear map. It copies the restore code to a 'safe' page, then start
> > to restore the memory image. Once completed, it restores the original
> > kernel's page table. It then calls into __hibernate_cpu_resume()
> > to restore the CPU context. Finally, it follows the normal hibernation
> > path back to the hibernation core.
> >
> > To enable hibernation/suspend to disk into RISCV, the below config
> > need to be enabled:
> > - CONFIG_ARCH_HIBERNATION_HEADER
> > - CONFIG_ARCH_HIBERNATION_POSSIBLE
> >
> > Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> > Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> > Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> > ---
> >  arch/riscv/Kconfig                 |   7 +
> >  arch/riscv/include/asm/assembler.h |  20 ++
> >  arch/riscv/include/asm/suspend.h   |  19 ++
> >  arch/riscv/kernel/Makefile         |   1 +
> >  arch/riscv/kernel/asm-offsets.c    |   5 +
> >  arch/riscv/kernel/hibernate-asm.S  |  77 +++++
> >  arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
> >  7 files changed, 576 insertions(+)
> >  create mode 100644 arch/riscv/kernel/hibernate-asm.S
> >  create mode 100644 arch/riscv/kernel/hibernate.c
> >
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index e2b656043abf..4555848a817f 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -690,6 +690,13 @@ menu "Power management options"
> >
> >  source "kernel/power/Kconfig"
> >
> > +config ARCH_HIBERNATION_POSSIBLE
> > +	def_bool y
> > +
> > +config ARCH_HIBERNATION_HEADER
> > +	def_bool y
> > +	depends on HIBERNATION
> 
> nit: I think this can be simplified as def_bool HIBERNATION
good suggestion. will change it.
> 
> > +
> >  endmenu # "Power management options"
> >
> >  menu "CPU Power Management"
> > diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> > index 727a97735493..68c46c0e0ea8 100644
> > --- a/arch/riscv/include/asm/assembler.h
> > +++ b/arch/riscv/include/asm/assembler.h
> > @@ -59,4 +59,24 @@
> >  		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
> >  	.endm
> >
> > +/*
> > + * copy_page - copy 1 page (4KB) of data from source to destination
> > + * @a0 - destination
> > + * @a1 - source
> > + */
> > +	.macro	copy_page a0, a1
> > +		lui	a2, 0x1
> > +		add	a2, a2, a0
> > +1 :
>     ^ please remove this space
can't remove it otherwise checkpatch will throws ERROR: spaces required around that ':'
> 
> > +		REG_L	t0, 0(a1)
> > +		REG_L	t1, SZREG(a1)
> > +
> > +		REG_S	t0, 0(a0)
> > +		REG_S	t1, SZREG(a0)
> > +
> > +		addi	a0, a0, 2 * SZREG
> > +		addi	a1, a1, 2 * SZREG
> > +		bne	a2, a0, 1b
> > +	.endm
> > +
> >  #endif	/* __ASM_ASSEMBLER_H */
> > diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> > index 75419c5ca272..3362da56a9d8 100644
> > --- a/arch/riscv/include/asm/suspend.h
> > +++ b/arch/riscv/include/asm/suspend.h
> > @@ -21,6 +21,11 @@ struct suspend_context {
> >  #endif
> >  };
> >
> > +/*
> > + * Used by hibernation core and cleared during resume sequence
> > + */
> > +extern int in_suspend;
> > +
> >  /* Low-level CPU suspend entry function */
> >  int __cpu_suspend_enter(struct suspend_context *context);
> >
> > @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
> >  /* Used to save and restore the csr */
> >  void suspend_save_csrs(struct suspend_context *context);
> >  void suspend_restore_csrs(struct suspend_context *context);
> > +
> > +/* Low-level API to support hibernation */
> > +int swsusp_arch_suspend(void);
> > +int swsusp_arch_resume(void);
> > +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> > +int arch_hibernation_header_restore(void *addr);
> > +int __hibernate_cpu_resume(void);
> > +
> > +/* Used to resume on the CPU we hibernated on */
> > +int hibernate_resume_nonboot_cpu_disable(void);
> > +
> > +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> > +					unsigned long cpu_resume);
> > +asmlinkage int hibernate_core_restore_code(void);
> >  #endif
> > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > index 4cf303a779ab..daab341d55e4 100644
> > --- a/arch/riscv/kernel/Makefile
> > +++ b/arch/riscv/kernel/Makefile
> > @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
> >  obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> >
> >  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> > +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> >
> >  obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
> >  obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > index df9444397908..d6a75aac1d27 100644
> > --- a/arch/riscv/kernel/asm-offsets.c
> > +++ b/arch/riscv/kernel/asm-offsets.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/kbuild.h>
> >  #include <linux/mm.h>
> >  #include <linux/sched.h>
> > +#include <linux/suspend.h>
> >  #include <asm/kvm_host.h>
> >  #include <asm/thread_info.h>
> >  #include <asm/ptrace.h>
> > @@ -116,6 +117,10 @@ void asm_offsets(void)
> >
> >  	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> >
> > +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> > +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> > +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> > +
> >  	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
> >  	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
> >  	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> > diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> > new file mode 100644
> > index 000000000000..846affe4dced
> > --- /dev/null
> > +++ b/arch/riscv/kernel/hibernate-asm.S
> > @@ -0,0 +1,77 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Hibernation low level support for RISCV.
> > + *
> > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > + *
> > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > + */
> > +
> > +#include <asm/asm.h>
> > +#include <asm/asm-offsets.h>
> > +#include <asm/assembler.h>
> > +#include <asm/csr.h>
> > +
> > +#include <linux/linkage.h>
> > +
> > +/*
> > + * int __hibernate_cpu_resume(void)
> > + * Switch back to the hibernated image's page table prior to restoring the CPU
> > + * context.
> > + *
> > + * Always returns 0
> > + */
> > +ENTRY(__hibernate_cpu_resume)
> > +	/* switch to hibernated image's page table. */
> > +	csrw CSR_SATP, s0
> > +	sfence.vma
> > +
> > +	REG_L	a0, hibernate_cpu_context
> > +
> > +	restore_csr
> > +	restore_reg
> > +
> > +	/* Return zero value. */
> > +	add	a0, zero, zero
> 
> nit: mv a0, zero
sure
> 
> > +
> > +	ret
> > +END(__hibernate_cpu_resume)
> > +
> > +/*
> > + * Prepare to restore the image.
> > + * a0: satp of saved page tables.
> > + * a1: satp of temporary page tables.
> > + * a2: cpu_resume.
> > + */
> > +ENTRY(hibernate_restore_image)
> > +	mv	s0, a0
> > +	mv	s1, a1
> > +	mv	s2, a2
> > +	REG_L	s4, restore_pblist
> > +	REG_L	a1, relocated_restore_code
> > +
> > +	jalr	a1
> > +END(hibernate_restore_image)
> > +
> > +/*
> > + * The below code will be executed from a 'safe' page.
> > + * It first switches to the temporary page table, then starts to copy the pages
> > + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> > + * to restore the CPU context.
> > + */
> > +ENTRY(hibernate_core_restore_code)
> > +	/* switch to temp page table. */
> > +	csrw satp, s1
> > +	sfence.vma
> > +.Lcopy:
> > +	/* The below code will restore the hibernated image. */
> > +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> > +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> 
> Are we sure restore_pblist will never be NULL?
restore_pblist is a link-list, it will be null during initialization or during page clean up by hibernation core. During the initial resume process, the hibernation core will check the header and load the pages. If everything works correctly, the page will be linked to the restore_pblist and then invoke swsusp_arch_resume() else hibernation core will throws error and failed to resume from the hibernated image.
> 
> > +
> > +	copy_page a0, a1
> > +
> > +	REG_L	s4, HIBERN_PBE_NEXT(s4)
> > +	bnez	s4, .Lcopy
> > +
> > +	jalr	s2
> > +END(hibernate_core_restore_code)
> > diff --git a/arch/riscv/kernel/hibernate.c b/arch/riscv/kernel/hibernate.c
> > new file mode 100644
> > index 000000000000..46a2f470db6e
> > --- /dev/null
> > +++ b/arch/riscv/kernel/hibernate.c
> > @@ -0,0 +1,447 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Hibernation support for RISCV
> > + *
> > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > + *
> > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > + */
> > +
> > +#include <asm/barrier.h>
> > +#include <asm/cacheflush.h>
> > +#include <asm/mmu_context.h>
> > +#include <asm/page.h>
> > +#include <asm/pgalloc.h>
> > +#include <asm/pgtable.h>
> > +#include <asm/sections.h>
> > +#include <asm/set_memory.h>
> > +#include <asm/smp.h>
> > +#include <asm/suspend.h>
> > +
> > +#include <linux/cpu.h>
> > +#include <linux/memblock.h>
> > +#include <linux/pm.h>
> > +#include <linux/sched.h>
> > +#include <linux/suspend.h>
> > +#include <linux/utsname.h>
> > +
> > +/* The logical cpu number we should resume on, initialised to a non-cpu number. */
> > +static int sleep_cpu = -EINVAL;
> > +
> > +/* Pointer to the temporary resume page table. */
> > +static pgd_t *resume_pg_dir;
> > +
> > +/* CPU context to be saved. */
> > +struct suspend_context *hibernate_cpu_context;
> > +EXPORT_SYMBOL_GPL(hibernate_cpu_context);
> > +
> > +unsigned long relocated_restore_code;
> > +EXPORT_SYMBOL_GPL(relocated_restore_code);
> > +
> > +/**
> > + * struct arch_hibernate_hdr_invariants - container to store kernel build version.
> > + * @uts_version: to save the build number and date so that the we do not resume with
> > + *		a different kernel.
> > + */
> > +struct arch_hibernate_hdr_invariants {
> > +	char		uts_version[__NEW_UTS_LEN + 1];
> > +};
> > +
> > +/**
> > + * struct arch_hibernate_hdr - helper parameters that help us to restore the image.
> > + * @invariants: container to store kernel build version.
> > + * @hartid: to make sure same boot_cpu executes the hibernate/restore code.
> > + * @saved_satp: original page table used by the hibernated image.
> > + * @restore_cpu_addr: the kernel's image address to restore the CPU context.
> > + */
> > +static struct arch_hibernate_hdr {
> > +	struct arch_hibernate_hdr_invariants invariants;
> > +	unsigned long	hartid;
> > +	unsigned long	saved_satp;
> > +	unsigned long	restore_cpu_addr;
> > +} resume_hdr;
> > +
> > +static inline void arch_hdr_invariants(struct arch_hibernate_hdr_invariants *i)
> > +{
> > +	memset(i, 0, sizeof(*i));
> > +	memcpy(i->uts_version, init_utsname()->version, sizeof(i->uts_version));
> > +}
> > +
> > +/*
> > + * Check if the given pfn is in the 'nosave' section.
> > + */
> > +int pfn_is_nosave(unsigned long pfn)
> > +{
> > +	unsigned long nosave_begin_pfn = sym_to_pfn(&__nosave_begin);
> > +	unsigned long nosave_end_pfn = sym_to_pfn(&__nosave_end - 1);
> > +
> > +	return ((pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn));
> > +}
> > +
> > +void notrace save_processor_state(void)
> > +{
> > +	WARN_ON(num_online_cpus() != 1);
> > +}
> > +
> > +void notrace restore_processor_state(void)
> > +{
> > +}
> > +
> > +/*
> > + * Helper parameters need to be saved to the hibernation image header.
> > + */
> > +int arch_hibernation_header_save(void *addr, unsigned int max_size)
> > +{
> > +	struct arch_hibernate_hdr *hdr = addr;
> > +
> > +	if (max_size < sizeof(*hdr))
> > +		return -EOVERFLOW;
> > +
> > +	arch_hdr_invariants(&hdr->invariants);
> > +
> > +	hdr->hartid = cpuid_to_hartid_map(sleep_cpu);
> > +	hdr->saved_satp = csr_read(CSR_SATP);
> > +	hdr->restore_cpu_addr = (unsigned long)__hibernate_cpu_resume;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(arch_hibernation_header_save);
> > +
> > +/*
> > + * Retrieve the helper parameters from the hibernation image header.
> > + */
> > +int arch_hibernation_header_restore(void *addr)
> > +{
> > +	struct arch_hibernate_hdr_invariants invariants;
> > +	struct arch_hibernate_hdr *hdr = addr;
> > +	int ret = 0;
> > +
> > +	arch_hdr_invariants(&invariants);
> > +
> > +	if (memcmp(&hdr->invariants, &invariants, sizeof(invariants))) {
> > +		pr_crit("Hibernate image not generated by this kernel!\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	sleep_cpu = riscv_hartid_to_cpuid(hdr->hartid);
> > +	if (sleep_cpu < 0) {
> > +		pr_crit("Hibernated on a CPU not known to this kernel!\n");
> > +		sleep_cpu = -EINVAL;
> > +		return -EINVAL;
> > +	}
> > +
> > +#ifdef CONFIG_SMP
> > +	ret = bringup_hibernate_cpu(sleep_cpu);
> > +	if (ret) {
> > +		sleep_cpu = -EINVAL;
> > +		return ret;
> > +	}
> > +#endif
> > +	resume_hdr = *hdr;
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(arch_hibernation_header_restore);
> > +
> > +int swsusp_arch_suspend(void)
> > +{
> > +	int ret = 0;
> > +
> > +	if (__cpu_suspend_enter(hibernate_cpu_context)) {
> > +		sleep_cpu = smp_processor_id();
> > +		suspend_save_csrs(hibernate_cpu_context);
> > +		ret = swsusp_save();
> > +	} else {
> > +		suspend_restore_csrs(hibernate_cpu_context);
> > +		flush_tlb_all();
> > +		flush_icache_all();
> > +
> > +		/*
> > +		 * Tell the hibernation core that we've just restored the memory.
> > +		 */
> > +		in_suspend = 0;
> > +		sleep_cpu = -EINVAL;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static unsigned long _temp_pgtable_map_pte(pte_t *dst_ptep, pte_t *src_ptep,
> > +					   unsigned long addr, pgprot_t prot)
> > +{
> > +	pte_t pte = READ_ONCE(*src_ptep);
> > +
> > +	if (pte_present(pte))
> > +		set_pte(dst_ptep, __pte(pte_val(pte) | pgprot_val(prot)));
> > +
> > +	return 0;
> > +}
> > +
> > +static unsigned long temp_pgtable_map_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp,
> > +					  unsigned long start, unsigned long end,
> > +					  pgprot_t prot)
> > +{
> > +	unsigned long addr = start;
> > +	pte_t *src_ptep;
> > +	pte_t *dst_ptep;
> > +
> > +	if (pmd_none(READ_ONCE(*dst_pmdp))) {
> > +		dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
> > +		if (!dst_ptep)
> > +			return -ENOMEM;
> > +
> > +		pmd_populate_kernel(NULL, dst_pmdp, dst_ptep);
> > +	}
> > +
> > +	dst_ptep = pte_offset_kernel(dst_pmdp, start);
> > +	src_ptep = pte_offset_kernel(src_pmdp, start);
> > +
> > +	do {
> > +		_temp_pgtable_map_pte(dst_ptep, src_ptep, addr, prot);
> 
> I think I'd rather have the body of _temp_pgtable_map_pte() here and drop
> the helper, because the helper does (pte_val(pte) | pgprot_val(prot))
> which looks strange, until seeing here that 'pte' is only the address
> bits, so OR'ing in new prot bits without clearing old prot bits makes
> sense.
we do not need to clear the old bits since we going to keep those bits but add new bits which are required for resume. Let's hold your question here but I will would like to see how Alex view it.
> 
> > +	} while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr < end);
> > +
> > +	return 0;
> > +}
> > +
> > +static unsigned long temp_pgtable_map_pmd(pud_t *dst_pudp, pud_t *src_pudp,
> > +					  unsigned long start, unsigned long end,
> > +					  pgprot_t prot)
> > +{
> > +	unsigned long addr = start;
> > +	unsigned long next;
> > +	unsigned long ret;
> > +	pmd_t *src_pmdp;
> > +	pmd_t *dst_pmdp;
> > +
> > +	if (pud_none(READ_ONCE(*dst_pudp))) {
> > +		dst_pmdp = (pmd_t *)get_safe_page(GFP_ATOMIC);
> > +		if (!dst_pmdp)
> > +			return -ENOMEM;
> > +
> > +		pud_populate(NULL, dst_pudp, dst_pmdp);
> > +	}
> > +
> > +	dst_pmdp = pmd_offset(dst_pudp, start);
> > +	src_pmdp = pmd_offset(src_pudp, start);
> > +
> > +	do {
> > +		pmd_t pmd = READ_ONCE(*src_pmdp);
> > +
> > +		next = pmd_addr_end(addr, end);
> > +
> > +		if (pmd_none(pmd))
> > +			continue;
> > +
> > +		if (pmd_leaf(pmd)) {
> > +			set_pmd(dst_pmdp, __pmd(pmd_val(pmd) | pgprot_val(prot)));
> > +		} else {
> > +			ret = temp_pgtable_map_pte(dst_pmdp, src_pmdp, addr, next, prot);
> > +			if (ret)
> > +				return -ENOMEM;
> > +		}
> > +	} while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
> > +
> > +	return 0;
> > +}
> > +
> > +static unsigned long temp_pgtable_map_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp,
> > +					  unsigned long start,
> > +					  unsigned long end, pgprot_t prot)
> > +{
> > +	unsigned long addr = start;
> > +	unsigned long next;
> > +	unsigned long ret;
> > +	pud_t *dst_pudp;
> > +	pud_t *src_pudp;
> > +
> > +	if (p4d_none(READ_ONCE(*dst_p4dp))) {
> > +		dst_pudp = (pud_t *)get_safe_page(GFP_ATOMIC);
> > +		if (!dst_pudp)
> > +			return -ENOMEM;
> > +
> > +		p4d_populate(NULL, dst_p4dp, dst_pudp);
> > +	}
> > +
> > +	dst_pudp = pud_offset(dst_p4dp, start);
> > +	src_pudp = pud_offset(src_p4dp, start);
> > +
> > +	do {
> > +		pud_t pud = READ_ONCE(*src_pudp);
> > +
> > +		next = pud_addr_end(addr, end);
> > +
> > +		if (pud_none(pud))
> > +			continue;
> > +
> > +		if (pud_leaf(pud)) {
> > +			set_pud(dst_pudp, __pud(pud_val(pud) | pgprot_val(prot)));
> > +		} else {
> > +			ret = temp_pgtable_map_pmd(dst_pudp, src_pudp, addr, next, prot);
> > +			if (ret)
> > +				return -ENOMEM;
> > +		}
> > +	} while (dst_pudp++, src_pudp++, addr = next, addr != end);
> > +
> > +	return 0;
> > +}
> > +
> > +static unsigned long temp_pgtable_map_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp,
> > +					  unsigned long start, unsigned long end,
> > +					  pgprot_t prot)
> > +{
> > +	unsigned long addr = start;
> > +	unsigned long next;
> > +	unsigned long ret;
> > +	p4d_t *dst_p4dp;
> > +	p4d_t *src_p4dp;
> > +
> > +	if (pgd_none(READ_ONCE(*dst_pgdp))) {
> > +		dst_p4dp = (p4d_t *)get_safe_page(GFP_ATOMIC);
> > +		if (!dst_p4dp)
> > +			return -ENOMEM;
> > +
> > +		pgd_populate(NULL, dst_pgdp, dst_p4dp);
> > +	}
> > +
> > +	dst_p4dp = p4d_offset(dst_pgdp, start);
> > +	src_p4dp = p4d_offset(src_pgdp, start);
> > +
> > +	do {
> > +		p4d_t p4d = READ_ONCE(*src_p4dp);
> > +
> > +		next = p4d_addr_end(addr, end);
> > +
> > +		if (p4d_none(READ_ONCE(*src_p4dp)))
> > +			continue;
> > +
> > +		if (p4d_leaf(p4d)) {
> > +			set_p4d(dst_p4dp, __p4d(p4d_val(p4d) | pgprot_val(prot)));
> > +		} else {
> > +			ret = temp_pgtable_map_pud(dst_p4dp, src_p4dp, addr, next, prot);
> > +			if (ret)
> > +				return -ENOMEM;
> > +		}
> > +	} while (dst_p4dp++, src_p4dp++, addr = next, addr != end);
> > +
> > +	return 0;
> > +}
> > +
> > +static unsigned long temp_pgtable_mapping(pgd_t *pgdp)
> > +{
> > +	unsigned long end = (unsigned long)pfn_to_virt(max_low_pfn);
> > +	unsigned long addr = PAGE_OFFSET;
> > +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
> > +	pgd_t *src_pgdp = pgd_offset_k(addr);
> > +	unsigned long next;
> > +
> > +	do {
> > +		next = pgd_addr_end(addr, end);
> > +		if (pgd_none(READ_ONCE(*src_pgdp)))
> > +			continue;
> > +
> > +		if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, next, PAGE_KERNEL))
> > +			return -ENOMEM;
> > +	} while (dst_pgdp++, src_pgdp++, addr = next, addr != end);
> > +
> > +	return 0;
> > +}
> > +
> > +static unsigned long temp_pgtable_text_mapping(pgd_t *pgdp, unsigned long addr)
> > +{
> > +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
> > +	pgd_t *src_pgdp = pgd_offset_k(addr);
> > +
> > +	if (pgd_none(READ_ONCE(*src_pgdp)))
> > +		return -EFAULT;
> > +
> > +	if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, addr, PAGE_KERNEL_EXEC))
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> > +
> > +static unsigned long relocate_restore_code(void)
> > +{
> > +	unsigned long ret;
> > +	void *page = (void *)get_safe_page(GFP_ATOMIC);
> > +
> > +	if (!page)
> > +		return -ENOMEM;
> > +
> > +	copy_page(page, hibernate_core_restore_code);
> > +
> > +	/* Make the page containing the relocated code executable. */
> > +	set_memory_x((unsigned long)page, 1);
> > +
> > +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)page);
> > +	if (ret)
> > +		return ret;
> > +
> > +	return (unsigned long)page;
> > +}
> > +
> > +int swsusp_arch_resume(void)
> > +{
> > +	unsigned long ret;
> > +
> > +	/*
> > +	 * Memory allocated by get_safe_page() will be dealt with by the hibernation core,
> > +	 * we don't need to free it here.
> > +	 */
> > +	resume_pg_dir = (pgd_t *)get_safe_page(GFP_ATOMIC);
> > +	if (!resume_pg_dir)
> > +		return -ENOMEM;
> > +
> > +	/*
> > +	 * The pages need to be writable when restoring the image.
> > +	 * Create a second copy of page table just for the linear map.
> > +	 * Use this temporary page table to restore the image.
> > +	 */
> > +	ret = temp_pgtable_mapping(resume_pg_dir);
> > +	if (ret)
> > +		return (int)ret;
> > +
> > +	/* Move the restore code to a new page so that it doesn't get overwritten by itself. */
> > +	relocated_restore_code = relocate_restore_code();
> > +	if (relocated_restore_code == -ENOMEM)
> > +		return -ENOMEM;
> > +
> > +	/*
> > +	 * Map the __hibernate_cpu_resume() address to the temporary page table so that the
> > +	 * restore code can jumps to it after finished restore the image. The next execution
> > +	 * code doesn't find itself in a different address space after switching over to the
> > +	 * original page table used by the hibernated image.
> > +	 */
> > +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)resume_hdr.restore_cpu_addr);
> > +	if (ret)
> > +		return ret;
> > +
> > +	hibernate_restore_image(resume_hdr.saved_satp, (PFN_DOWN(__pa(resume_pg_dir)) | satp_mode),
> > +				resume_hdr.restore_cpu_addr);
> > +
> > +	return 0;
> > +}
> > +
> > +#ifdef CONFIG_PM_SLEEP_SMP
> > +int hibernate_resume_nonboot_cpu_disable(void)
> > +{
> > +	if (sleep_cpu < 0) {
> > +		pr_err("Failing to resume from hibernate on an unknown CPU\n");
> > +		return -ENODEV;
> > +	}
> > +
> > +	return freeze_secondary_cpus(sleep_cpu);
> > +}
> > +#endif
> > +
> > +static int __init riscv_hibernate_init(void)
> > +{
> > +	hibernate_cpu_context = kzalloc(sizeof(*hibernate_cpu_context), GFP_KERNEL);
> > +
> > +	if (WARN_ON(!hibernate_cpu_context))
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> > +
> > +early_initcall(riscv_hibernate_init);
> > --
> > 2.34.1
> >
> 
> Thanks,
> drew
Sia Jee Heng Feb. 24, 2023, 2:12 a.m. UTC | #3
Hi Alex,

Wondering if you have any comment on the v4 series?

Thanks
Regards
Jee Heng

> -----Original Message-----
> From: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> Sent: Tuesday, 21 February, 2023 10:35 AM
> To: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu
> Cc: linux-riscv@lists.infradead.org; linux-kernel@vger.kernel.org; JeeHeng Sia <jeeheng.sia@starfivetech.com>; Leyfoon Tan
> <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> Subject: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> Low level Arch functions were created to support hibernation.
> swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> cpu state onto the stack, then calling swsusp_save() to save the memory
> image.
> 
> Arch specific hibernation header is implemented and is utilized by the
> arch_hibernation_header_restore() and arch_hibernation_header_save()
> functions. The arch specific hibernation header consists of satp, hartid,
> and the cpu_resume address. The kernel built version is also need to be
> saved into the hibernation image header to making sure only the same
> kernel is restore when resume.
> 
> swsusp_arch_resume() creates a temporary page table that covering only
> the linear map. It copies the restore code to a 'safe' page, then start
> to restore the memory image. Once completed, it restores the original
> kernel's page table. It then calls into __hibernate_cpu_resume()
> to restore the CPU context. Finally, it follows the normal hibernation
> path back to the hibernation core.
> 
> To enable hibernation/suspend to disk into RISCV, the below config
> need to be enabled:
> - CONFIG_ARCH_HIBERNATION_HEADER
> - CONFIG_ARCH_HIBERNATION_POSSIBLE
> 
> Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> ---
>  arch/riscv/Kconfig                 |   7 +
>  arch/riscv/include/asm/assembler.h |  20 ++
>  arch/riscv/include/asm/suspend.h   |  19 ++
>  arch/riscv/kernel/Makefile         |   1 +
>  arch/riscv/kernel/asm-offsets.c    |   5 +
>  arch/riscv/kernel/hibernate-asm.S  |  77 +++++
>  arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
>  7 files changed, 576 insertions(+)
>  create mode 100644 arch/riscv/kernel/hibernate-asm.S
>  create mode 100644 arch/riscv/kernel/hibernate.c
> 
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index e2b656043abf..4555848a817f 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -690,6 +690,13 @@ menu "Power management options"
> 
>  source "kernel/power/Kconfig"
> 
> +config ARCH_HIBERNATION_POSSIBLE
> +	def_bool y
> +
> +config ARCH_HIBERNATION_HEADER
> +	def_bool y
> +	depends on HIBERNATION
> +
>  endmenu # "Power management options"
> 
>  menu "CPU Power Management"
> diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> index 727a97735493..68c46c0e0ea8 100644
> --- a/arch/riscv/include/asm/assembler.h
> +++ b/arch/riscv/include/asm/assembler.h
> @@ -59,4 +59,24 @@
>  		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
>  	.endm
> 
> +/*
> + * copy_page - copy 1 page (4KB) of data from source to destination
> + * @a0 - destination
> + * @a1 - source
> + */
> +	.macro	copy_page a0, a1
> +		lui	a2, 0x1
> +		add	a2, a2, a0
> +1 :
> +		REG_L	t0, 0(a1)
> +		REG_L	t1, SZREG(a1)
> +
> +		REG_S	t0, 0(a0)
> +		REG_S	t1, SZREG(a0)
> +
> +		addi	a0, a0, 2 * SZREG
> +		addi	a1, a1, 2 * SZREG
> +		bne	a2, a0, 1b
> +	.endm
> +
>  #endif	/* __ASM_ASSEMBLER_H */
> diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> index 75419c5ca272..3362da56a9d8 100644
> --- a/arch/riscv/include/asm/suspend.h
> +++ b/arch/riscv/include/asm/suspend.h
> @@ -21,6 +21,11 @@ struct suspend_context {
>  #endif
>  };
> 
> +/*
> + * Used by hibernation core and cleared during resume sequence
> + */
> +extern int in_suspend;
> +
>  /* Low-level CPU suspend entry function */
>  int __cpu_suspend_enter(struct suspend_context *context);
> 
> @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
>  /* Used to save and restore the csr */
>  void suspend_save_csrs(struct suspend_context *context);
>  void suspend_restore_csrs(struct suspend_context *context);
> +
> +/* Low-level API to support hibernation */
> +int swsusp_arch_suspend(void);
> +int swsusp_arch_resume(void);
> +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> +int arch_hibernation_header_restore(void *addr);
> +int __hibernate_cpu_resume(void);
> +
> +/* Used to resume on the CPU we hibernated on */
> +int hibernate_resume_nonboot_cpu_disable(void);
> +
> +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> +					unsigned long cpu_resume);
> +asmlinkage int hibernate_core_restore_code(void);
>  #endif
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index 4cf303a779ab..daab341d55e4 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
>  obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> 
>  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> 
>  obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
>  obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> index df9444397908..d6a75aac1d27 100644
> --- a/arch/riscv/kernel/asm-offsets.c
> +++ b/arch/riscv/kernel/asm-offsets.c
> @@ -9,6 +9,7 @@
>  #include <linux/kbuild.h>
>  #include <linux/mm.h>
>  #include <linux/sched.h>
> +#include <linux/suspend.h>
>  #include <asm/kvm_host.h>
>  #include <asm/thread_info.h>
>  #include <asm/ptrace.h>
> @@ -116,6 +117,10 @@ void asm_offsets(void)
> 
>  	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> 
> +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> +
>  	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
>  	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
>  	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> new file mode 100644
> index 000000000000..846affe4dced
> --- /dev/null
> +++ b/arch/riscv/kernel/hibernate-asm.S
> @@ -0,0 +1,77 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Hibernation low level support for RISCV.
> + *
> + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> + *
> + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> + */
> +
> +#include <asm/asm.h>
> +#include <asm/asm-offsets.h>
> +#include <asm/assembler.h>
> +#include <asm/csr.h>
> +
> +#include <linux/linkage.h>
> +
> +/*
> + * int __hibernate_cpu_resume(void)
> + * Switch back to the hibernated image's page table prior to restoring the CPU
> + * context.
> + *
> + * Always returns 0
> + */
> +ENTRY(__hibernate_cpu_resume)
> +	/* switch to hibernated image's page table. */
> +	csrw CSR_SATP, s0
> +	sfence.vma
> +
> +	REG_L	a0, hibernate_cpu_context
> +
> +	restore_csr
> +	restore_reg
> +
> +	/* Return zero value. */
> +	add	a0, zero, zero
> +
> +	ret
> +END(__hibernate_cpu_resume)
> +
> +/*
> + * Prepare to restore the image.
> + * a0: satp of saved page tables.
> + * a1: satp of temporary page tables.
> + * a2: cpu_resume.
> + */
> +ENTRY(hibernate_restore_image)
> +	mv	s0, a0
> +	mv	s1, a1
> +	mv	s2, a2
> +	REG_L	s4, restore_pblist
> +	REG_L	a1, relocated_restore_code
> +
> +	jalr	a1
> +END(hibernate_restore_image)
> +
> +/*
> + * The below code will be executed from a 'safe' page.
> + * It first switches to the temporary page table, then starts to copy the pages
> + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> + * to restore the CPU context.
> + */
> +ENTRY(hibernate_core_restore_code)
> +	/* switch to temp page table. */
> +	csrw satp, s1
> +	sfence.vma
> +.Lcopy:
> +	/* The below code will restore the hibernated image. */
> +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> +
> +	copy_page a0, a1
> +
> +	REG_L	s4, HIBERN_PBE_NEXT(s4)
> +	bnez	s4, .Lcopy
> +
> +	jalr	s2
> +END(hibernate_core_restore_code)
> diff --git a/arch/riscv/kernel/hibernate.c b/arch/riscv/kernel/hibernate.c
> new file mode 100644
> index 000000000000..46a2f470db6e
> --- /dev/null
> +++ b/arch/riscv/kernel/hibernate.c
> @@ -0,0 +1,447 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Hibernation support for RISCV
> + *
> + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> + *
> + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> + */
> +
> +#include <asm/barrier.h>
> +#include <asm/cacheflush.h>
> +#include <asm/mmu_context.h>
> +#include <asm/page.h>
> +#include <asm/pgalloc.h>
> +#include <asm/pgtable.h>
> +#include <asm/sections.h>
> +#include <asm/set_memory.h>
> +#include <asm/smp.h>
> +#include <asm/suspend.h>
> +
> +#include <linux/cpu.h>
> +#include <linux/memblock.h>
> +#include <linux/pm.h>
> +#include <linux/sched.h>
> +#include <linux/suspend.h>
> +#include <linux/utsname.h>
> +
> +/* The logical cpu number we should resume on, initialised to a non-cpu number. */
> +static int sleep_cpu = -EINVAL;
> +
> +/* Pointer to the temporary resume page table. */
> +static pgd_t *resume_pg_dir;
> +
> +/* CPU context to be saved. */
> +struct suspend_context *hibernate_cpu_context;
> +EXPORT_SYMBOL_GPL(hibernate_cpu_context);
> +
> +unsigned long relocated_restore_code;
> +EXPORT_SYMBOL_GPL(relocated_restore_code);
> +
> +/**
> + * struct arch_hibernate_hdr_invariants - container to store kernel build version.
> + * @uts_version: to save the build number and date so that the we do not resume with
> + *		a different kernel.
> + */
> +struct arch_hibernate_hdr_invariants {
> +	char		uts_version[__NEW_UTS_LEN + 1];
> +};
> +
> +/**
> + * struct arch_hibernate_hdr - helper parameters that help us to restore the image.
> + * @invariants: container to store kernel build version.
> + * @hartid: to make sure same boot_cpu executes the hibernate/restore code.
> + * @saved_satp: original page table used by the hibernated image.
> + * @restore_cpu_addr: the kernel's image address to restore the CPU context.
> + */
> +static struct arch_hibernate_hdr {
> +	struct arch_hibernate_hdr_invariants invariants;
> +	unsigned long	hartid;
> +	unsigned long	saved_satp;
> +	unsigned long	restore_cpu_addr;
> +} resume_hdr;
> +
> +static inline void arch_hdr_invariants(struct arch_hibernate_hdr_invariants *i)
> +{
> +	memset(i, 0, sizeof(*i));
> +	memcpy(i->uts_version, init_utsname()->version, sizeof(i->uts_version));
> +}
> +
> +/*
> + * Check if the given pfn is in the 'nosave' section.
> + */
> +int pfn_is_nosave(unsigned long pfn)
> +{
> +	unsigned long nosave_begin_pfn = sym_to_pfn(&__nosave_begin);
> +	unsigned long nosave_end_pfn = sym_to_pfn(&__nosave_end - 1);
> +
> +	return ((pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn));
> +}
> +
> +void notrace save_processor_state(void)
> +{
> +	WARN_ON(num_online_cpus() != 1);
> +}
> +
> +void notrace restore_processor_state(void)
> +{
> +}
> +
> +/*
> + * Helper parameters need to be saved to the hibernation image header.
> + */
> +int arch_hibernation_header_save(void *addr, unsigned int max_size)
> +{
> +	struct arch_hibernate_hdr *hdr = addr;
> +
> +	if (max_size < sizeof(*hdr))
> +		return -EOVERFLOW;
> +
> +	arch_hdr_invariants(&hdr->invariants);
> +
> +	hdr->hartid = cpuid_to_hartid_map(sleep_cpu);
> +	hdr->saved_satp = csr_read(CSR_SATP);
> +	hdr->restore_cpu_addr = (unsigned long)__hibernate_cpu_resume;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(arch_hibernation_header_save);
> +
> +/*
> + * Retrieve the helper parameters from the hibernation image header.
> + */
> +int arch_hibernation_header_restore(void *addr)
> +{
> +	struct arch_hibernate_hdr_invariants invariants;
> +	struct arch_hibernate_hdr *hdr = addr;
> +	int ret = 0;
> +
> +	arch_hdr_invariants(&invariants);
> +
> +	if (memcmp(&hdr->invariants, &invariants, sizeof(invariants))) {
> +		pr_crit("Hibernate image not generated by this kernel!\n");
> +		return -EINVAL;
> +	}
> +
> +	sleep_cpu = riscv_hartid_to_cpuid(hdr->hartid);
> +	if (sleep_cpu < 0) {
> +		pr_crit("Hibernated on a CPU not known to this kernel!\n");
> +		sleep_cpu = -EINVAL;
> +		return -EINVAL;
> +	}
> +
> +#ifdef CONFIG_SMP
> +	ret = bringup_hibernate_cpu(sleep_cpu);
> +	if (ret) {
> +		sleep_cpu = -EINVAL;
> +		return ret;
> +	}
> +#endif
> +	resume_hdr = *hdr;
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(arch_hibernation_header_restore);
> +
> +int swsusp_arch_suspend(void)
> +{
> +	int ret = 0;
> +
> +	if (__cpu_suspend_enter(hibernate_cpu_context)) {
> +		sleep_cpu = smp_processor_id();
> +		suspend_save_csrs(hibernate_cpu_context);
> +		ret = swsusp_save();
> +	} else {
> +		suspend_restore_csrs(hibernate_cpu_context);
> +		flush_tlb_all();
> +		flush_icache_all();
> +
> +		/*
> +		 * Tell the hibernation core that we've just restored the memory.
> +		 */
> +		in_suspend = 0;
> +		sleep_cpu = -EINVAL;
> +	}
> +
> +	return ret;
> +}
> +
> +static unsigned long _temp_pgtable_map_pte(pte_t *dst_ptep, pte_t *src_ptep,
> +					   unsigned long addr, pgprot_t prot)
> +{
> +	pte_t pte = READ_ONCE(*src_ptep);
> +
> +	if (pte_present(pte))
> +		set_pte(dst_ptep, __pte(pte_val(pte) | pgprot_val(prot)));
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_map_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp,
> +					  unsigned long start, unsigned long end,
> +					  pgprot_t prot)
> +{
> +	unsigned long addr = start;
> +	pte_t *src_ptep;
> +	pte_t *dst_ptep;
> +
> +	if (pmd_none(READ_ONCE(*dst_pmdp))) {
> +		dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
> +		if (!dst_ptep)
> +			return -ENOMEM;
> +
> +		pmd_populate_kernel(NULL, dst_pmdp, dst_ptep);
> +	}
> +
> +	dst_ptep = pte_offset_kernel(dst_pmdp, start);
> +	src_ptep = pte_offset_kernel(src_pmdp, start);
> +
> +	do {
> +		_temp_pgtable_map_pte(dst_ptep, src_ptep, addr, prot);
> +	} while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr < end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_map_pmd(pud_t *dst_pudp, pud_t *src_pudp,
> +					  unsigned long start, unsigned long end,
> +					  pgprot_t prot)
> +{
> +	unsigned long addr = start;
> +	unsigned long next;
> +	unsigned long ret;
> +	pmd_t *src_pmdp;
> +	pmd_t *dst_pmdp;
> +
> +	if (pud_none(READ_ONCE(*dst_pudp))) {
> +		dst_pmdp = (pmd_t *)get_safe_page(GFP_ATOMIC);
> +		if (!dst_pmdp)
> +			return -ENOMEM;
> +
> +		pud_populate(NULL, dst_pudp, dst_pmdp);
> +	}
> +
> +	dst_pmdp = pmd_offset(dst_pudp, start);
> +	src_pmdp = pmd_offset(src_pudp, start);
> +
> +	do {
> +		pmd_t pmd = READ_ONCE(*src_pmdp);
> +
> +		next = pmd_addr_end(addr, end);
> +
> +		if (pmd_none(pmd))
> +			continue;
> +
> +		if (pmd_leaf(pmd)) {
> +			set_pmd(dst_pmdp, __pmd(pmd_val(pmd) | pgprot_val(prot)));
> +		} else {
> +			ret = temp_pgtable_map_pte(dst_pmdp, src_pmdp, addr, next, prot);
> +			if (ret)
> +				return -ENOMEM;
> +		}
> +	} while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_map_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp,
> +					  unsigned long start,
> +					  unsigned long end, pgprot_t prot)
> +{
> +	unsigned long addr = start;
> +	unsigned long next;
> +	unsigned long ret;
> +	pud_t *dst_pudp;
> +	pud_t *src_pudp;
> +
> +	if (p4d_none(READ_ONCE(*dst_p4dp))) {
> +		dst_pudp = (pud_t *)get_safe_page(GFP_ATOMIC);
> +		if (!dst_pudp)
> +			return -ENOMEM;
> +
> +		p4d_populate(NULL, dst_p4dp, dst_pudp);
> +	}
> +
> +	dst_pudp = pud_offset(dst_p4dp, start);
> +	src_pudp = pud_offset(src_p4dp, start);
> +
> +	do {
> +		pud_t pud = READ_ONCE(*src_pudp);
> +
> +		next = pud_addr_end(addr, end);
> +
> +		if (pud_none(pud))
> +			continue;
> +
> +		if (pud_leaf(pud)) {
> +			set_pud(dst_pudp, __pud(pud_val(pud) | pgprot_val(prot)));
> +		} else {
> +			ret = temp_pgtable_map_pmd(dst_pudp, src_pudp, addr, next, prot);
> +			if (ret)
> +				return -ENOMEM;
> +		}
> +	} while (dst_pudp++, src_pudp++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_map_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp,
> +					  unsigned long start, unsigned long end,
> +					  pgprot_t prot)
> +{
> +	unsigned long addr = start;
> +	unsigned long next;
> +	unsigned long ret;
> +	p4d_t *dst_p4dp;
> +	p4d_t *src_p4dp;
> +
> +	if (pgd_none(READ_ONCE(*dst_pgdp))) {
> +		dst_p4dp = (p4d_t *)get_safe_page(GFP_ATOMIC);
> +		if (!dst_p4dp)
> +			return -ENOMEM;
> +
> +		pgd_populate(NULL, dst_pgdp, dst_p4dp);
> +	}
> +
> +	dst_p4dp = p4d_offset(dst_pgdp, start);
> +	src_p4dp = p4d_offset(src_pgdp, start);
> +
> +	do {
> +		p4d_t p4d = READ_ONCE(*src_p4dp);
> +
> +		next = p4d_addr_end(addr, end);
> +
> +		if (p4d_none(READ_ONCE(*src_p4dp)))
> +			continue;
> +
> +		if (p4d_leaf(p4d)) {
> +			set_p4d(dst_p4dp, __p4d(p4d_val(p4d) | pgprot_val(prot)));
> +		} else {
> +			ret = temp_pgtable_map_pud(dst_p4dp, src_p4dp, addr, next, prot);
> +			if (ret)
> +				return -ENOMEM;
> +		}
> +	} while (dst_p4dp++, src_p4dp++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_mapping(pgd_t *pgdp)
> +{
> +	unsigned long end = (unsigned long)pfn_to_virt(max_low_pfn);
> +	unsigned long addr = PAGE_OFFSET;
> +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
> +	pgd_t *src_pgdp = pgd_offset_k(addr);
> +	unsigned long next;
> +
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none(READ_ONCE(*src_pgdp)))
> +			continue;
> +
> +		if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, next, PAGE_KERNEL))
> +			return -ENOMEM;
> +	} while (dst_pgdp++, src_pgdp++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_text_mapping(pgd_t *pgdp, unsigned long addr)
> +{
> +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
> +	pgd_t *src_pgdp = pgd_offset_k(addr);
> +
> +	if (pgd_none(READ_ONCE(*src_pgdp)))
> +		return -EFAULT;
> +
> +	if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, addr, PAGE_KERNEL_EXEC))
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static unsigned long relocate_restore_code(void)
> +{
> +	unsigned long ret;
> +	void *page = (void *)get_safe_page(GFP_ATOMIC);
> +
> +	if (!page)
> +		return -ENOMEM;
> +
> +	copy_page(page, hibernate_core_restore_code);
> +
> +	/* Make the page containing the relocated code executable. */
> +	set_memory_x((unsigned long)page, 1);
> +
> +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)page);
> +	if (ret)
> +		return ret;
> +
> +	return (unsigned long)page;
> +}
> +
> +int swsusp_arch_resume(void)
> +{
> +	unsigned long ret;
> +
> +	/*
> +	 * Memory allocated by get_safe_page() will be dealt with by the hibernation core,
> +	 * we don't need to free it here.
> +	 */
> +	resume_pg_dir = (pgd_t *)get_safe_page(GFP_ATOMIC);
> +	if (!resume_pg_dir)
> +		return -ENOMEM;
> +
> +	/*
> +	 * The pages need to be writable when restoring the image.
> +	 * Create a second copy of page table just for the linear map.
> +	 * Use this temporary page table to restore the image.
> +	 */
> +	ret = temp_pgtable_mapping(resume_pg_dir);
> +	if (ret)
> +		return (int)ret;
> +
> +	/* Move the restore code to a new page so that it doesn't get overwritten by itself. */
> +	relocated_restore_code = relocate_restore_code();
> +	if (relocated_restore_code == -ENOMEM)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Map the __hibernate_cpu_resume() address to the temporary page table so that the
> +	 * restore code can jumps to it after finished restore the image. The next execution
> +	 * code doesn't find itself in a different address space after switching over to the
> +	 * original page table used by the hibernated image.
> +	 */
> +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)resume_hdr.restore_cpu_addr);
> +	if (ret)
> +		return ret;
> +
> +	hibernate_restore_image(resume_hdr.saved_satp, (PFN_DOWN(__pa(resume_pg_dir)) | satp_mode),
> +				resume_hdr.restore_cpu_addr);
> +
> +	return 0;
> +}
> +
> +#ifdef CONFIG_PM_SLEEP_SMP
> +int hibernate_resume_nonboot_cpu_disable(void)
> +{
> +	if (sleep_cpu < 0) {
> +		pr_err("Failing to resume from hibernate on an unknown CPU\n");
> +		return -ENODEV;
> +	}
> +
> +	return freeze_secondary_cpus(sleep_cpu);
> +}
> +#endif
> +
> +static int __init riscv_hibernate_init(void)
> +{
> +	hibernate_cpu_context = kzalloc(sizeof(*hibernate_cpu_context), GFP_KERNEL);
> +
> +	if (WARN_ON(!hibernate_cpu_context))
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +early_initcall(riscv_hibernate_init);
> --
> 2.34.1
Andrew Jones Feb. 24, 2023, 9 a.m. UTC | #4
On Fri, Feb 24, 2023 at 02:05:43AM +0000, JeeHeng Sia wrote:
> 
> 
> > -----Original Message-----
> > From: Andrew Jones <ajones@ventanamicro.com>
> > Sent: Friday, 24 February, 2023 2:07 AM
> > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > 
> > On Tue, Feb 21, 2023 at 10:35:23AM +0800, Sia Jee Heng wrote:
> > > Low level Arch functions were created to support hibernation.
> > > swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> > > cpu state onto the stack, then calling swsusp_save() to save the memory
> > > image.
> > >
> > > Arch specific hibernation header is implemented and is utilized by the
> > > arch_hibernation_header_restore() and arch_hibernation_header_save()
> > > functions. The arch specific hibernation header consists of satp, hartid,
> > > and the cpu_resume address. The kernel built version is also need to be
> > > saved into the hibernation image header to making sure only the same
> > > kernel is restore when resume.
> > >
> > > swsusp_arch_resume() creates a temporary page table that covering only
> > > the linear map. It copies the restore code to a 'safe' page, then start
> > > to restore the memory image. Once completed, it restores the original
> > > kernel's page table. It then calls into __hibernate_cpu_resume()
> > > to restore the CPU context. Finally, it follows the normal hibernation
> > > path back to the hibernation core.
> > >
> > > To enable hibernation/suspend to disk into RISCV, the below config
> > > need to be enabled:
> > > - CONFIG_ARCH_HIBERNATION_HEADER
> > > - CONFIG_ARCH_HIBERNATION_POSSIBLE
> > >
> > > Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> > > Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> > > Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> > > ---
> > >  arch/riscv/Kconfig                 |   7 +
> > >  arch/riscv/include/asm/assembler.h |  20 ++
> > >  arch/riscv/include/asm/suspend.h   |  19 ++
> > >  arch/riscv/kernel/Makefile         |   1 +
> > >  arch/riscv/kernel/asm-offsets.c    |   5 +
> > >  arch/riscv/kernel/hibernate-asm.S  |  77 +++++
> > >  arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
> > >  7 files changed, 576 insertions(+)
> > >  create mode 100644 arch/riscv/kernel/hibernate-asm.S
> > >  create mode 100644 arch/riscv/kernel/hibernate.c
> > >
> > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > index e2b656043abf..4555848a817f 100644
> > > --- a/arch/riscv/Kconfig
> > > +++ b/arch/riscv/Kconfig
> > > @@ -690,6 +690,13 @@ menu "Power management options"
> > >
> > >  source "kernel/power/Kconfig"
> > >
> > > +config ARCH_HIBERNATION_POSSIBLE
> > > +	def_bool y
> > > +
> > > +config ARCH_HIBERNATION_HEADER
> > > +	def_bool y
> > > +	depends on HIBERNATION
> > 
> > nit: I think this can be simplified as def_bool HIBERNATION
> good suggestion. will change it.
> > 
> > > +
> > >  endmenu # "Power management options"
> > >
> > >  menu "CPU Power Management"
> > > diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> > > index 727a97735493..68c46c0e0ea8 100644
> > > --- a/arch/riscv/include/asm/assembler.h
> > > +++ b/arch/riscv/include/asm/assembler.h
> > > @@ -59,4 +59,24 @@
> > >  		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
> > >  	.endm
> > >
> > > +/*
> > > + * copy_page - copy 1 page (4KB) of data from source to destination
> > > + * @a0 - destination
> > > + * @a1 - source
> > > + */
> > > +	.macro	copy_page a0, a1
> > > +		lui	a2, 0x1
> > > +		add	a2, a2, a0
> > > +1 :
> >     ^ please remove this space
> can't remove it otherwise checkpatch will throws ERROR: spaces required around that ':'

Oh, right, labels in macros have this requirement.

> > 
> > > +		REG_L	t0, 0(a1)
> > > +		REG_L	t1, SZREG(a1)
> > > +
> > > +		REG_S	t0, 0(a0)
> > > +		REG_S	t1, SZREG(a0)
> > > +
> > > +		addi	a0, a0, 2 * SZREG
> > > +		addi	a1, a1, 2 * SZREG
> > > +		bne	a2, a0, 1b
> > > +	.endm
> > > +
> > >  #endif	/* __ASM_ASSEMBLER_H */
> > > diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> > > index 75419c5ca272..3362da56a9d8 100644
> > > --- a/arch/riscv/include/asm/suspend.h
> > > +++ b/arch/riscv/include/asm/suspend.h
> > > @@ -21,6 +21,11 @@ struct suspend_context {
> > >  #endif
> > >  };
> > >
> > > +/*
> > > + * Used by hibernation core and cleared during resume sequence
> > > + */
> > > +extern int in_suspend;
> > > +
> > >  /* Low-level CPU suspend entry function */
> > >  int __cpu_suspend_enter(struct suspend_context *context);
> > >
> > > @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
> > >  /* Used to save and restore the csr */
> > >  void suspend_save_csrs(struct suspend_context *context);
> > >  void suspend_restore_csrs(struct suspend_context *context);
> > > +
> > > +/* Low-level API to support hibernation */
> > > +int swsusp_arch_suspend(void);
> > > +int swsusp_arch_resume(void);
> > > +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> > > +int arch_hibernation_header_restore(void *addr);
> > > +int __hibernate_cpu_resume(void);
> > > +
> > > +/* Used to resume on the CPU we hibernated on */
> > > +int hibernate_resume_nonboot_cpu_disable(void);
> > > +
> > > +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> > > +					unsigned long cpu_resume);
> > > +asmlinkage int hibernate_core_restore_code(void);
> > >  #endif
> > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > index 4cf303a779ab..daab341d55e4 100644
> > > --- a/arch/riscv/kernel/Makefile
> > > +++ b/arch/riscv/kernel/Makefile
> > > @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
> > >  obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> > >
> > >  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> > > +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> > >
> > >  obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
> > >  obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > index df9444397908..d6a75aac1d27 100644
> > > --- a/arch/riscv/kernel/asm-offsets.c
> > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > @@ -9,6 +9,7 @@
> > >  #include <linux/kbuild.h>
> > >  #include <linux/mm.h>
> > >  #include <linux/sched.h>
> > > +#include <linux/suspend.h>
> > >  #include <asm/kvm_host.h>
> > >  #include <asm/thread_info.h>
> > >  #include <asm/ptrace.h>
> > > @@ -116,6 +117,10 @@ void asm_offsets(void)
> > >
> > >  	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> > >
> > > +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> > > +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> > > +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> > > +
> > >  	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
> > >  	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
> > >  	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> > > diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> > > new file mode 100644
> > > index 000000000000..846affe4dced
> > > --- /dev/null
> > > +++ b/arch/riscv/kernel/hibernate-asm.S
> > > @@ -0,0 +1,77 @@
> > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > +/*
> > > + * Hibernation low level support for RISCV.
> > > + *
> > > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > > + *
> > > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > > + */
> > > +
> > > +#include <asm/asm.h>
> > > +#include <asm/asm-offsets.h>
> > > +#include <asm/assembler.h>
> > > +#include <asm/csr.h>
> > > +
> > > +#include <linux/linkage.h>
> > > +
> > > +/*
> > > + * int __hibernate_cpu_resume(void)
> > > + * Switch back to the hibernated image's page table prior to restoring the CPU
> > > + * context.
> > > + *
> > > + * Always returns 0
> > > + */
> > > +ENTRY(__hibernate_cpu_resume)
> > > +	/* switch to hibernated image's page table. */
> > > +	csrw CSR_SATP, s0
> > > +	sfence.vma
> > > +
> > > +	REG_L	a0, hibernate_cpu_context
> > > +
> > > +	restore_csr
> > > +	restore_reg
> > > +
> > > +	/* Return zero value. */
> > > +	add	a0, zero, zero
> > 
> > nit: mv a0, zero
> sure
> > 
> > > +
> > > +	ret
> > > +END(__hibernate_cpu_resume)
> > > +
> > > +/*
> > > + * Prepare to restore the image.
> > > + * a0: satp of saved page tables.
> > > + * a1: satp of temporary page tables.
> > > + * a2: cpu_resume.
> > > + */
> > > +ENTRY(hibernate_restore_image)
> > > +	mv	s0, a0
> > > +	mv	s1, a1
> > > +	mv	s2, a2
> > > +	REG_L	s4, restore_pblist
> > > +	REG_L	a1, relocated_restore_code
> > > +
> > > +	jalr	a1
> > > +END(hibernate_restore_image)
> > > +
> > > +/*
> > > + * The below code will be executed from a 'safe' page.
> > > + * It first switches to the temporary page table, then starts to copy the pages
> > > + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> > > + * to restore the CPU context.
> > > + */
> > > +ENTRY(hibernate_core_restore_code)
> > > +	/* switch to temp page table. */
> > > +	csrw satp, s1
> > > +	sfence.vma
> > > +.Lcopy:
> > > +	/* The below code will restore the hibernated image. */
> > > +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> > > +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> > 
> > Are we sure restore_pblist will never be NULL?
> restore_pblist is a link-list, it will be null during initialization or during page clean up by hibernation core. During the initial resume process, the hibernation core will check the header and load the pages. If everything works correctly, the page will be linked to the restore_pblist and then invoke swsusp_arch_resume() else hibernation core will throws error and failed to resume from the hibernated image.

I know restore_pblist is a linked-list and this doesn't answer the
question. The comment above restore_pblist says

/*
 * List of PBEs needed for restoring the pages that were allocated before
 * the suspend and included in the suspend image, but have also been
 * allocated by the "resume" kernel, so their contents cannot be written
 * directly to their "original" page frames.
 */

which implies the pages that end up on this list are "special". My
question is whether or not we're guaranteed to have at least one
of these special pages. If not, we shouldn't assume s4 is non-null.
If so, then a comment stating why that's guaranteed would be nice.

> > 
> > > +
> > > +	copy_page a0, a1
> > > +
> > > +	REG_L	s4, HIBERN_PBE_NEXT(s4)
> > > +	bnez	s4, .Lcopy
> > > +
> > > +	jalr	s2
> > > +END(hibernate_core_restore_code)
> > > diff --git a/arch/riscv/kernel/hibernate.c b/arch/riscv/kernel/hibernate.c
> > > new file mode 100644
> > > index 000000000000..46a2f470db6e
> > > --- /dev/null
> > > +++ b/arch/riscv/kernel/hibernate.c
> > > @@ -0,0 +1,447 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * Hibernation support for RISCV
> > > + *
> > > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > > + *
> > > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > > + */
> > > +
> > > +#include <asm/barrier.h>
> > > +#include <asm/cacheflush.h>
> > > +#include <asm/mmu_context.h>
> > > +#include <asm/page.h>
> > > +#include <asm/pgalloc.h>
> > > +#include <asm/pgtable.h>
> > > +#include <asm/sections.h>
> > > +#include <asm/set_memory.h>
> > > +#include <asm/smp.h>
> > > +#include <asm/suspend.h>
> > > +
> > > +#include <linux/cpu.h>
> > > +#include <linux/memblock.h>
> > > +#include <linux/pm.h>
> > > +#include <linux/sched.h>
> > > +#include <linux/suspend.h>
> > > +#include <linux/utsname.h>
> > > +
> > > +/* The logical cpu number we should resume on, initialised to a non-cpu number. */
> > > +static int sleep_cpu = -EINVAL;
> > > +
> > > +/* Pointer to the temporary resume page table. */
> > > +static pgd_t *resume_pg_dir;
> > > +
> > > +/* CPU context to be saved. */
> > > +struct suspend_context *hibernate_cpu_context;
> > > +EXPORT_SYMBOL_GPL(hibernate_cpu_context);
> > > +
> > > +unsigned long relocated_restore_code;
> > > +EXPORT_SYMBOL_GPL(relocated_restore_code);
> > > +
> > > +/**
> > > + * struct arch_hibernate_hdr_invariants - container to store kernel build version.
> > > + * @uts_version: to save the build number and date so that the we do not resume with
> > > + *		a different kernel.
> > > + */
> > > +struct arch_hibernate_hdr_invariants {
> > > +	char		uts_version[__NEW_UTS_LEN + 1];
> > > +};
> > > +
> > > +/**
> > > + * struct arch_hibernate_hdr - helper parameters that help us to restore the image.
> > > + * @invariants: container to store kernel build version.
> > > + * @hartid: to make sure same boot_cpu executes the hibernate/restore code.
> > > + * @saved_satp: original page table used by the hibernated image.
> > > + * @restore_cpu_addr: the kernel's image address to restore the CPU context.
> > > + */
> > > +static struct arch_hibernate_hdr {
> > > +	struct arch_hibernate_hdr_invariants invariants;
> > > +	unsigned long	hartid;
> > > +	unsigned long	saved_satp;
> > > +	unsigned long	restore_cpu_addr;
> > > +} resume_hdr;
> > > +
> > > +static inline void arch_hdr_invariants(struct arch_hibernate_hdr_invariants *i)
> > > +{
> > > +	memset(i, 0, sizeof(*i));
> > > +	memcpy(i->uts_version, init_utsname()->version, sizeof(i->uts_version));
> > > +}
> > > +
> > > +/*
> > > + * Check if the given pfn is in the 'nosave' section.
> > > + */
> > > +int pfn_is_nosave(unsigned long pfn)
> > > +{
> > > +	unsigned long nosave_begin_pfn = sym_to_pfn(&__nosave_begin);
> > > +	unsigned long nosave_end_pfn = sym_to_pfn(&__nosave_end - 1);
> > > +
> > > +	return ((pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn));
> > > +}
> > > +
> > > +void notrace save_processor_state(void)
> > > +{
> > > +	WARN_ON(num_online_cpus() != 1);
> > > +}
> > > +
> > > +void notrace restore_processor_state(void)
> > > +{
> > > +}
> > > +
> > > +/*
> > > + * Helper parameters need to be saved to the hibernation image header.
> > > + */
> > > +int arch_hibernation_header_save(void *addr, unsigned int max_size)
> > > +{
> > > +	struct arch_hibernate_hdr *hdr = addr;
> > > +
> > > +	if (max_size < sizeof(*hdr))
> > > +		return -EOVERFLOW;
> > > +
> > > +	arch_hdr_invariants(&hdr->invariants);
> > > +
> > > +	hdr->hartid = cpuid_to_hartid_map(sleep_cpu);
> > > +	hdr->saved_satp = csr_read(CSR_SATP);
> > > +	hdr->restore_cpu_addr = (unsigned long)__hibernate_cpu_resume;
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(arch_hibernation_header_save);
> > > +
> > > +/*
> > > + * Retrieve the helper parameters from the hibernation image header.
> > > + */
> > > +int arch_hibernation_header_restore(void *addr)
> > > +{
> > > +	struct arch_hibernate_hdr_invariants invariants;
> > > +	struct arch_hibernate_hdr *hdr = addr;
> > > +	int ret = 0;
> > > +
> > > +	arch_hdr_invariants(&invariants);
> > > +
> > > +	if (memcmp(&hdr->invariants, &invariants, sizeof(invariants))) {
> > > +		pr_crit("Hibernate image not generated by this kernel!\n");
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	sleep_cpu = riscv_hartid_to_cpuid(hdr->hartid);
> > > +	if (sleep_cpu < 0) {
> > > +		pr_crit("Hibernated on a CPU not known to this kernel!\n");
> > > +		sleep_cpu = -EINVAL;
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +#ifdef CONFIG_SMP
> > > +	ret = bringup_hibernate_cpu(sleep_cpu);
> > > +	if (ret) {
> > > +		sleep_cpu = -EINVAL;
> > > +		return ret;
> > > +	}
> > > +#endif
> > > +	resume_hdr = *hdr;
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(arch_hibernation_header_restore);
> > > +
> > > +int swsusp_arch_suspend(void)
> > > +{
> > > +	int ret = 0;
> > > +
> > > +	if (__cpu_suspend_enter(hibernate_cpu_context)) {
> > > +		sleep_cpu = smp_processor_id();
> > > +		suspend_save_csrs(hibernate_cpu_context);
> > > +		ret = swsusp_save();
> > > +	} else {
> > > +		suspend_restore_csrs(hibernate_cpu_context);
> > > +		flush_tlb_all();
> > > +		flush_icache_all();
> > > +
> > > +		/*
> > > +		 * Tell the hibernation core that we've just restored the memory.
> > > +		 */
> > > +		in_suspend = 0;
> > > +		sleep_cpu = -EINVAL;
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static unsigned long _temp_pgtable_map_pte(pte_t *dst_ptep, pte_t *src_ptep,
> > > +					   unsigned long addr, pgprot_t prot)
> > > +{
> > > +	pte_t pte = READ_ONCE(*src_ptep);
> > > +
> > > +	if (pte_present(pte))
> > > +		set_pte(dst_ptep, __pte(pte_val(pte) | pgprot_val(prot)));
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static unsigned long temp_pgtable_map_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp,
> > > +					  unsigned long start, unsigned long end,
> > > +					  pgprot_t prot)
> > > +{
> > > +	unsigned long addr = start;
> > > +	pte_t *src_ptep;
> > > +	pte_t *dst_ptep;
> > > +
> > > +	if (pmd_none(READ_ONCE(*dst_pmdp))) {
> > > +		dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
> > > +		if (!dst_ptep)
> > > +			return -ENOMEM;
> > > +
> > > +		pmd_populate_kernel(NULL, dst_pmdp, dst_ptep);
> > > +	}
> > > +
> > > +	dst_ptep = pte_offset_kernel(dst_pmdp, start);
> > > +	src_ptep = pte_offset_kernel(src_pmdp, start);
> > > +
> > > +	do {
> > > +		_temp_pgtable_map_pte(dst_ptep, src_ptep, addr, prot);
> > 
> > I think I'd rather have the body of _temp_pgtable_map_pte() here and drop
> > the helper, because the helper does (pte_val(pte) | pgprot_val(prot))
> > which looks strange, until seeing here that 'pte' is only the address
> > bits, so OR'ing in new prot bits without clearing old prot bits makes
> > sense.
> we do not need to clear the old bits since we going to keep those bits but add new bits which are required for resume. Let's hold your question here but I will would like to see how Alex view it.

I confused myself a bit in my first read, so some of what I said isn't
relevant, but I still wonder why we don't want to be more explicit about
what prot bits are present in the end, and I still wonder why we need such
a simple helper function which is used in exactly one place. Indeed, the
pattern of all the other pgtable functions below is to put the set_p*
calls directly in the loop.

Thanks,
drew
Sia Jee Heng Feb. 24, 2023, 9:33 a.m. UTC | #5
> -----Original Message-----
> From: Andrew Jones <ajones@ventanamicro.com>
> Sent: Friday, 24 February, 2023 5:00 PM
> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> On Fri, Feb 24, 2023 at 02:05:43AM +0000, JeeHeng Sia wrote:
> >
> >
> > > -----Original Message-----
> > > From: Andrew Jones <ajones@ventanamicro.com>
> > > Sent: Friday, 24 February, 2023 2:07 AM
> > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > >
> > > On Tue, Feb 21, 2023 at 10:35:23AM +0800, Sia Jee Heng wrote:
> > > > Low level Arch functions were created to support hibernation.
> > > > swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> > > > cpu state onto the stack, then calling swsusp_save() to save the memory
> > > > image.
> > > >
> > > > Arch specific hibernation header is implemented and is utilized by the
> > > > arch_hibernation_header_restore() and arch_hibernation_header_save()
> > > > functions. The arch specific hibernation header consists of satp, hartid,
> > > > and the cpu_resume address. The kernel built version is also need to be
> > > > saved into the hibernation image header to making sure only the same
> > > > kernel is restore when resume.
> > > >
> > > > swsusp_arch_resume() creates a temporary page table that covering only
> > > > the linear map. It copies the restore code to a 'safe' page, then start
> > > > to restore the memory image. Once completed, it restores the original
> > > > kernel's page table. It then calls into __hibernate_cpu_resume()
> > > > to restore the CPU context. Finally, it follows the normal hibernation
> > > > path back to the hibernation core.
> > > >
> > > > To enable hibernation/suspend to disk into RISCV, the below config
> > > > need to be enabled:
> > > > - CONFIG_ARCH_HIBERNATION_HEADER
> > > > - CONFIG_ARCH_HIBERNATION_POSSIBLE
> > > >
> > > > Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> > > > Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> > > > Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> > > > ---
> > > >  arch/riscv/Kconfig                 |   7 +
> > > >  arch/riscv/include/asm/assembler.h |  20 ++
> > > >  arch/riscv/include/asm/suspend.h   |  19 ++
> > > >  arch/riscv/kernel/Makefile         |   1 +
> > > >  arch/riscv/kernel/asm-offsets.c    |   5 +
> > > >  arch/riscv/kernel/hibernate-asm.S  |  77 +++++
> > > >  arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
> > > >  7 files changed, 576 insertions(+)
> > > >  create mode 100644 arch/riscv/kernel/hibernate-asm.S
> > > >  create mode 100644 arch/riscv/kernel/hibernate.c
> > > >
> > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > > index e2b656043abf..4555848a817f 100644
> > > > --- a/arch/riscv/Kconfig
> > > > +++ b/arch/riscv/Kconfig
> > > > @@ -690,6 +690,13 @@ menu "Power management options"
> > > >
> > > >  source "kernel/power/Kconfig"
> > > >
> > > > +config ARCH_HIBERNATION_POSSIBLE
> > > > +	def_bool y
> > > > +
> > > > +config ARCH_HIBERNATION_HEADER
> > > > +	def_bool y
> > > > +	depends on HIBERNATION
> > >
> > > nit: I think this can be simplified as def_bool HIBERNATION
> > good suggestion. will change it.
> > >
> > > > +
> > > >  endmenu # "Power management options"
> > > >
> > > >  menu "CPU Power Management"
> > > > diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> > > > index 727a97735493..68c46c0e0ea8 100644
> > > > --- a/arch/riscv/include/asm/assembler.h
> > > > +++ b/arch/riscv/include/asm/assembler.h
> > > > @@ -59,4 +59,24 @@
> > > >  		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
> > > >  	.endm
> > > >
> > > > +/*
> > > > + * copy_page - copy 1 page (4KB) of data from source to destination
> > > > + * @a0 - destination
> > > > + * @a1 - source
> > > > + */
> > > > +	.macro	copy_page a0, a1
> > > > +		lui	a2, 0x1
> > > > +		add	a2, a2, a0
> > > > +1 :
> > >     ^ please remove this space
> > can't remove it otherwise checkpatch will throws ERROR: spaces required around that ':'
> 
> Oh, right, labels in macros have this requirement.
> 
> > >
> > > > +		REG_L	t0, 0(a1)
> > > > +		REG_L	t1, SZREG(a1)
> > > > +
> > > > +		REG_S	t0, 0(a0)
> > > > +		REG_S	t1, SZREG(a0)
> > > > +
> > > > +		addi	a0, a0, 2 * SZREG
> > > > +		addi	a1, a1, 2 * SZREG
> > > > +		bne	a2, a0, 1b
> > > > +	.endm
> > > > +
> > > >  #endif	/* __ASM_ASSEMBLER_H */
> > > > diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> > > > index 75419c5ca272..3362da56a9d8 100644
> > > > --- a/arch/riscv/include/asm/suspend.h
> > > > +++ b/arch/riscv/include/asm/suspend.h
> > > > @@ -21,6 +21,11 @@ struct suspend_context {
> > > >  #endif
> > > >  };
> > > >
> > > > +/*
> > > > + * Used by hibernation core and cleared during resume sequence
> > > > + */
> > > > +extern int in_suspend;
> > > > +
> > > >  /* Low-level CPU suspend entry function */
> > > >  int __cpu_suspend_enter(struct suspend_context *context);
> > > >
> > > > @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
> > > >  /* Used to save and restore the csr */
> > > >  void suspend_save_csrs(struct suspend_context *context);
> > > >  void suspend_restore_csrs(struct suspend_context *context);
> > > > +
> > > > +/* Low-level API to support hibernation */
> > > > +int swsusp_arch_suspend(void);
> > > > +int swsusp_arch_resume(void);
> > > > +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> > > > +int arch_hibernation_header_restore(void *addr);
> > > > +int __hibernate_cpu_resume(void);
> > > > +
> > > > +/* Used to resume on the CPU we hibernated on */
> > > > +int hibernate_resume_nonboot_cpu_disable(void);
> > > > +
> > > > +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> > > > +					unsigned long cpu_resume);
> > > > +asmlinkage int hibernate_core_restore_code(void);
> > > >  #endif
> > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > > index 4cf303a779ab..daab341d55e4 100644
> > > > --- a/arch/riscv/kernel/Makefile
> > > > +++ b/arch/riscv/kernel/Makefile
> > > > @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
> > > >  obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> > > >
> > > >  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> > > > +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> > > >
> > > >  obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
> > > >  obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > > index df9444397908..d6a75aac1d27 100644
> > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > > @@ -9,6 +9,7 @@
> > > >  #include <linux/kbuild.h>
> > > >  #include <linux/mm.h>
> > > >  #include <linux/sched.h>
> > > > +#include <linux/suspend.h>
> > > >  #include <asm/kvm_host.h>
> > > >  #include <asm/thread_info.h>
> > > >  #include <asm/ptrace.h>
> > > > @@ -116,6 +117,10 @@ void asm_offsets(void)
> > > >
> > > >  	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> > > >
> > > > +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> > > > +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> > > > +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> > > > +
> > > >  	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
> > > >  	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
> > > >  	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> > > > diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> > > > new file mode 100644
> > > > index 000000000000..846affe4dced
> > > > --- /dev/null
> > > > +++ b/arch/riscv/kernel/hibernate-asm.S
> > > > @@ -0,0 +1,77 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > > +/*
> > > > + * Hibernation low level support for RISCV.
> > > > + *
> > > > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > > > + *
> > > > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > > > + */
> > > > +
> > > > +#include <asm/asm.h>
> > > > +#include <asm/asm-offsets.h>
> > > > +#include <asm/assembler.h>
> > > > +#include <asm/csr.h>
> > > > +
> > > > +#include <linux/linkage.h>
> > > > +
> > > > +/*
> > > > + * int __hibernate_cpu_resume(void)
> > > > + * Switch back to the hibernated image's page table prior to restoring the CPU
> > > > + * context.
> > > > + *
> > > > + * Always returns 0
> > > > + */
> > > > +ENTRY(__hibernate_cpu_resume)
> > > > +	/* switch to hibernated image's page table. */
> > > > +	csrw CSR_SATP, s0
> > > > +	sfence.vma
> > > > +
> > > > +	REG_L	a0, hibernate_cpu_context
> > > > +
> > > > +	restore_csr
> > > > +	restore_reg
> > > > +
> > > > +	/* Return zero value. */
> > > > +	add	a0, zero, zero
> > >
> > > nit: mv a0, zero
> > sure
> > >
> > > > +
> > > > +	ret
> > > > +END(__hibernate_cpu_resume)
> > > > +
> > > > +/*
> > > > + * Prepare to restore the image.
> > > > + * a0: satp of saved page tables.
> > > > + * a1: satp of temporary page tables.
> > > > + * a2: cpu_resume.
> > > > + */
> > > > +ENTRY(hibernate_restore_image)
> > > > +	mv	s0, a0
> > > > +	mv	s1, a1
> > > > +	mv	s2, a2
> > > > +	REG_L	s4, restore_pblist
> > > > +	REG_L	a1, relocated_restore_code
> > > > +
> > > > +	jalr	a1
> > > > +END(hibernate_restore_image)
> > > > +
> > > > +/*
> > > > + * The below code will be executed from a 'safe' page.
> > > > + * It first switches to the temporary page table, then starts to copy the pages
> > > > + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> > > > + * to restore the CPU context.
> > > > + */
> > > > +ENTRY(hibernate_core_restore_code)
> > > > +	/* switch to temp page table. */
> > > > +	csrw satp, s1
> > > > +	sfence.vma
> > > > +.Lcopy:
> > > > +	/* The below code will restore the hibernated image. */
> > > > +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> > > > +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> > >
> > > Are we sure restore_pblist will never be NULL?
> > restore_pblist is a link-list, it will be null during initialization or during page clean up by hibernation core. During the initial resume
> process, the hibernation core will check the header and load the pages. If everything works correctly, the page will be linked to the
> restore_pblist and then invoke swsusp_arch_resume() else hibernation core will throws error and failed to resume from the
> hibernated image.
> 
> I know restore_pblist is a linked-list and this doesn't answer the
> question. The comment above restore_pblist says
> 
> /*
>  * List of PBEs needed for restoring the pages that were allocated before
>  * the suspend and included in the suspend image, but have also been
>  * allocated by the "resume" kernel, so their contents cannot be written
>  * directly to their "original" page frames.
>  */
> 
> which implies the pages that end up on this list are "special". My
> question is whether or not we're guaranteed to have at least one
> of these special pages. If not, we shouldn't assume s4 is non-null.
> If so, then a comment stating why that's guaranteed would be nice.
The restore_pblist will not be null otherwise swsusp_arch_resume wouldn't get invoked. you can find how the link-list are link and how it checks against validity at https://elixir.bootlin.com/linux/v6.2-rc8/source/kernel/power/snapshot.c . " A comment stating why that's guaranteed would be nice" ? Hmm, perhaps this is out of my scope but I do believe in the page validity checking in the link I shared.
> 
> > >
> > > > +
> > > > +	copy_page a0, a1
> > > > +
> > > > +	REG_L	s4, HIBERN_PBE_NEXT(s4)
> > > > +	bnez	s4, .Lcopy
> > > > +
> > > > +	jalr	s2
> > > > +END(hibernate_core_restore_code)
> > > > diff --git a/arch/riscv/kernel/hibernate.c b/arch/riscv/kernel/hibernate.c
> > > > new file mode 100644
> > > > index 000000000000..46a2f470db6e
> > > > --- /dev/null
> > > > +++ b/arch/riscv/kernel/hibernate.c
> > > > @@ -0,0 +1,447 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > +/*
> > > > + * Hibernation support for RISCV
> > > > + *
> > > > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > > > + *
> > > > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > > > + */
> > > > +
> > > > +#include <asm/barrier.h>
> > > > +#include <asm/cacheflush.h>
> > > > +#include <asm/mmu_context.h>
> > > > +#include <asm/page.h>
> > > > +#include <asm/pgalloc.h>
> > > > +#include <asm/pgtable.h>
> > > > +#include <asm/sections.h>
> > > > +#include <asm/set_memory.h>
> > > > +#include <asm/smp.h>
> > > > +#include <asm/suspend.h>
> > > > +
> > > > +#include <linux/cpu.h>
> > > > +#include <linux/memblock.h>
> > > > +#include <linux/pm.h>
> > > > +#include <linux/sched.h>
> > > > +#include <linux/suspend.h>
> > > > +#include <linux/utsname.h>
> > > > +
> > > > +/* The logical cpu number we should resume on, initialised to a non-cpu number. */
> > > > +static int sleep_cpu = -EINVAL;
> > > > +
> > > > +/* Pointer to the temporary resume page table. */
> > > > +static pgd_t *resume_pg_dir;
> > > > +
> > > > +/* CPU context to be saved. */
> > > > +struct suspend_context *hibernate_cpu_context;
> > > > +EXPORT_SYMBOL_GPL(hibernate_cpu_context);
> > > > +
> > > > +unsigned long relocated_restore_code;
> > > > +EXPORT_SYMBOL_GPL(relocated_restore_code);
> > > > +
> > > > +/**
> > > > + * struct arch_hibernate_hdr_invariants - container to store kernel build version.
> > > > + * @uts_version: to save the build number and date so that the we do not resume with
> > > > + *		a different kernel.
> > > > + */
> > > > +struct arch_hibernate_hdr_invariants {
> > > > +	char		uts_version[__NEW_UTS_LEN + 1];
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct arch_hibernate_hdr - helper parameters that help us to restore the image.
> > > > + * @invariants: container to store kernel build version.
> > > > + * @hartid: to make sure same boot_cpu executes the hibernate/restore code.
> > > > + * @saved_satp: original page table used by the hibernated image.
> > > > + * @restore_cpu_addr: the kernel's image address to restore the CPU context.
> > > > + */
> > > > +static struct arch_hibernate_hdr {
> > > > +	struct arch_hibernate_hdr_invariants invariants;
> > > > +	unsigned long	hartid;
> > > > +	unsigned long	saved_satp;
> > > > +	unsigned long	restore_cpu_addr;
> > > > +} resume_hdr;
> > > > +
> > > > +static inline void arch_hdr_invariants(struct arch_hibernate_hdr_invariants *i)
> > > > +{
> > > > +	memset(i, 0, sizeof(*i));
> > > > +	memcpy(i->uts_version, init_utsname()->version, sizeof(i->uts_version));
> > > > +}
> > > > +
> > > > +/*
> > > > + * Check if the given pfn is in the 'nosave' section.
> > > > + */
> > > > +int pfn_is_nosave(unsigned long pfn)
> > > > +{
> > > > +	unsigned long nosave_begin_pfn = sym_to_pfn(&__nosave_begin);
> > > > +	unsigned long nosave_end_pfn = sym_to_pfn(&__nosave_end - 1);
> > > > +
> > > > +	return ((pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn));
> > > > +}
> > > > +
> > > > +void notrace save_processor_state(void)
> > > > +{
> > > > +	WARN_ON(num_online_cpus() != 1);
> > > > +}
> > > > +
> > > > +void notrace restore_processor_state(void)
> > > > +{
> > > > +}
> > > > +
> > > > +/*
> > > > + * Helper parameters need to be saved to the hibernation image header.
> > > > + */
> > > > +int arch_hibernation_header_save(void *addr, unsigned int max_size)
> > > > +{
> > > > +	struct arch_hibernate_hdr *hdr = addr;
> > > > +
> > > > +	if (max_size < sizeof(*hdr))
> > > > +		return -EOVERFLOW;
> > > > +
> > > > +	arch_hdr_invariants(&hdr->invariants);
> > > > +
> > > > +	hdr->hartid = cpuid_to_hartid_map(sleep_cpu);
> > > > +	hdr->saved_satp = csr_read(CSR_SATP);
> > > > +	hdr->restore_cpu_addr = (unsigned long)__hibernate_cpu_resume;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(arch_hibernation_header_save);
> > > > +
> > > > +/*
> > > > + * Retrieve the helper parameters from the hibernation image header.
> > > > + */
> > > > +int arch_hibernation_header_restore(void *addr)
> > > > +{
> > > > +	struct arch_hibernate_hdr_invariants invariants;
> > > > +	struct arch_hibernate_hdr *hdr = addr;
> > > > +	int ret = 0;
> > > > +
> > > > +	arch_hdr_invariants(&invariants);
> > > > +
> > > > +	if (memcmp(&hdr->invariants, &invariants, sizeof(invariants))) {
> > > > +		pr_crit("Hibernate image not generated by this kernel!\n");
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > > > +	sleep_cpu = riscv_hartid_to_cpuid(hdr->hartid);
> > > > +	if (sleep_cpu < 0) {
> > > > +		pr_crit("Hibernated on a CPU not known to this kernel!\n");
> > > > +		sleep_cpu = -EINVAL;
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > > > +#ifdef CONFIG_SMP
> > > > +	ret = bringup_hibernate_cpu(sleep_cpu);
> > > > +	if (ret) {
> > > > +		sleep_cpu = -EINVAL;
> > > > +		return ret;
> > > > +	}
> > > > +#endif
> > > > +	resume_hdr = *hdr;
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(arch_hibernation_header_restore);
> > > > +
> > > > +int swsusp_arch_suspend(void)
> > > > +{
> > > > +	int ret = 0;
> > > > +
> > > > +	if (__cpu_suspend_enter(hibernate_cpu_context)) {
> > > > +		sleep_cpu = smp_processor_id();
> > > > +		suspend_save_csrs(hibernate_cpu_context);
> > > > +		ret = swsusp_save();
> > > > +	} else {
> > > > +		suspend_restore_csrs(hibernate_cpu_context);
> > > > +		flush_tlb_all();
> > > > +		flush_icache_all();
> > > > +
> > > > +		/*
> > > > +		 * Tell the hibernation core that we've just restored the memory.
> > > > +		 */
> > > > +		in_suspend = 0;
> > > > +		sleep_cpu = -EINVAL;
> > > > +	}
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static unsigned long _temp_pgtable_map_pte(pte_t *dst_ptep, pte_t *src_ptep,
> > > > +					   unsigned long addr, pgprot_t prot)
> > > > +{
> > > > +	pte_t pte = READ_ONCE(*src_ptep);
> > > > +
> > > > +	if (pte_present(pte))
> > > > +		set_pte(dst_ptep, __pte(pte_val(pte) | pgprot_val(prot)));
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static unsigned long temp_pgtable_map_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp,
> > > > +					  unsigned long start, unsigned long end,
> > > > +					  pgprot_t prot)
> > > > +{
> > > > +	unsigned long addr = start;
> > > > +	pte_t *src_ptep;
> > > > +	pte_t *dst_ptep;
> > > > +
> > > > +	if (pmd_none(READ_ONCE(*dst_pmdp))) {
> > > > +		dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
> > > > +		if (!dst_ptep)
> > > > +			return -ENOMEM;
> > > > +
> > > > +		pmd_populate_kernel(NULL, dst_pmdp, dst_ptep);
> > > > +	}
> > > > +
> > > > +	dst_ptep = pte_offset_kernel(dst_pmdp, start);
> > > > +	src_ptep = pte_offset_kernel(src_pmdp, start);
> > > > +
> > > > +	do {
> > > > +		_temp_pgtable_map_pte(dst_ptep, src_ptep, addr, prot);
> > >
> > > I think I'd rather have the body of _temp_pgtable_map_pte() here and drop
> > > the helper, because the helper does (pte_val(pte) | pgprot_val(prot))
> > > which looks strange, until seeing here that 'pte' is only the address
> > > bits, so OR'ing in new prot bits without clearing old prot bits makes
> > > sense.
> > we do not need to clear the old bits since we going to keep those bits but add new bits which are required for resume. Let's hold
> your question here but I will would like to see how Alex view it.
> 
> I confused myself a bit in my first read, so some of what I said isn't
> relevant, but I still wonder why we don't want to be more explicit about
> what prot bits are present in the end, and I still wonder why we need such
> a simple helper function which is used in exactly one place. Indeed, the
> pattern of all the other pgtable functions below is to put the set_p*
> calls directly in the loop.
I am sorry if I confused you but what I meant is that I would like to consolidate all comments from other reviewers before provide the best solution. There is no doubt that your comment is valid.
> 
> Thanks,
> drew
Andrew Jones Feb. 24, 2023, 9:55 a.m. UTC | #6
On Fri, Feb 24, 2023 at 09:33:31AM +0000, JeeHeng Sia wrote:
> 
> 
> > -----Original Message-----
> > From: Andrew Jones <ajones@ventanamicro.com>
> > Sent: Friday, 24 February, 2023 5:00 PM
> > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > 
> > On Fri, Feb 24, 2023 at 02:05:43AM +0000, JeeHeng Sia wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > Sent: Friday, 24 February, 2023 2:07 AM
> > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > >
> > > > On Tue, Feb 21, 2023 at 10:35:23AM +0800, Sia Jee Heng wrote:
> > > > > Low level Arch functions were created to support hibernation.
> > > > > swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> > > > > cpu state onto the stack, then calling swsusp_save() to save the memory
> > > > > image.
> > > > >
> > > > > Arch specific hibernation header is implemented and is utilized by the
> > > > > arch_hibernation_header_restore() and arch_hibernation_header_save()
> > > > > functions. The arch specific hibernation header consists of satp, hartid,
> > > > > and the cpu_resume address. The kernel built version is also need to be
> > > > > saved into the hibernation image header to making sure only the same
> > > > > kernel is restore when resume.
> > > > >
> > > > > swsusp_arch_resume() creates a temporary page table that covering only
> > > > > the linear map. It copies the restore code to a 'safe' page, then start
> > > > > to restore the memory image. Once completed, it restores the original
> > > > > kernel's page table. It then calls into __hibernate_cpu_resume()
> > > > > to restore the CPU context. Finally, it follows the normal hibernation
> > > > > path back to the hibernation core.
> > > > >
> > > > > To enable hibernation/suspend to disk into RISCV, the below config
> > > > > need to be enabled:
> > > > > - CONFIG_ARCH_HIBERNATION_HEADER
> > > > > - CONFIG_ARCH_HIBERNATION_POSSIBLE
> > > > >
> > > > > Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> > > > > Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> > > > > Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> > > > > ---
> > > > >  arch/riscv/Kconfig                 |   7 +
> > > > >  arch/riscv/include/asm/assembler.h |  20 ++
> > > > >  arch/riscv/include/asm/suspend.h   |  19 ++
> > > > >  arch/riscv/kernel/Makefile         |   1 +
> > > > >  arch/riscv/kernel/asm-offsets.c    |   5 +
> > > > >  arch/riscv/kernel/hibernate-asm.S  |  77 +++++
> > > > >  arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
> > > > >  7 files changed, 576 insertions(+)
> > > > >  create mode 100644 arch/riscv/kernel/hibernate-asm.S
> > > > >  create mode 100644 arch/riscv/kernel/hibernate.c
> > > > >
> > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > > > index e2b656043abf..4555848a817f 100644
> > > > > --- a/arch/riscv/Kconfig
> > > > > +++ b/arch/riscv/Kconfig
> > > > > @@ -690,6 +690,13 @@ menu "Power management options"
> > > > >
> > > > >  source "kernel/power/Kconfig"
> > > > >
> > > > > +config ARCH_HIBERNATION_POSSIBLE
> > > > > +	def_bool y
> > > > > +
> > > > > +config ARCH_HIBERNATION_HEADER
> > > > > +	def_bool y
> > > > > +	depends on HIBERNATION
> > > >
> > > > nit: I think this can be simplified as def_bool HIBERNATION
> > > good suggestion. will change it.
> > > >
> > > > > +
> > > > >  endmenu # "Power management options"
> > > > >
> > > > >  menu "CPU Power Management"
> > > > > diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> > > > > index 727a97735493..68c46c0e0ea8 100644
> > > > > --- a/arch/riscv/include/asm/assembler.h
> > > > > +++ b/arch/riscv/include/asm/assembler.h
> > > > > @@ -59,4 +59,24 @@
> > > > >  		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
> > > > >  	.endm
> > > > >
> > > > > +/*
> > > > > + * copy_page - copy 1 page (4KB) of data from source to destination
> > > > > + * @a0 - destination
> > > > > + * @a1 - source
> > > > > + */
> > > > > +	.macro	copy_page a0, a1
> > > > > +		lui	a2, 0x1
> > > > > +		add	a2, a2, a0
> > > > > +1 :
> > > >     ^ please remove this space
> > > can't remove it otherwise checkpatch will throws ERROR: spaces required around that ':'
> > 
> > Oh, right, labels in macros have this requirement.
> > 
> > > >
> > > > > +		REG_L	t0, 0(a1)
> > > > > +		REG_L	t1, SZREG(a1)
> > > > > +
> > > > > +		REG_S	t0, 0(a0)
> > > > > +		REG_S	t1, SZREG(a0)
> > > > > +
> > > > > +		addi	a0, a0, 2 * SZREG
> > > > > +		addi	a1, a1, 2 * SZREG
> > > > > +		bne	a2, a0, 1b
> > > > > +	.endm
> > > > > +
> > > > >  #endif	/* __ASM_ASSEMBLER_H */
> > > > > diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> > > > > index 75419c5ca272..3362da56a9d8 100644
> > > > > --- a/arch/riscv/include/asm/suspend.h
> > > > > +++ b/arch/riscv/include/asm/suspend.h
> > > > > @@ -21,6 +21,11 @@ struct suspend_context {
> > > > >  #endif
> > > > >  };
> > > > >
> > > > > +/*
> > > > > + * Used by hibernation core and cleared during resume sequence
> > > > > + */
> > > > > +extern int in_suspend;
> > > > > +
> > > > >  /* Low-level CPU suspend entry function */
> > > > >  int __cpu_suspend_enter(struct suspend_context *context);
> > > > >
> > > > > @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
> > > > >  /* Used to save and restore the csr */
> > > > >  void suspend_save_csrs(struct suspend_context *context);
> > > > >  void suspend_restore_csrs(struct suspend_context *context);
> > > > > +
> > > > > +/* Low-level API to support hibernation */
> > > > > +int swsusp_arch_suspend(void);
> > > > > +int swsusp_arch_resume(void);
> > > > > +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> > > > > +int arch_hibernation_header_restore(void *addr);
> > > > > +int __hibernate_cpu_resume(void);
> > > > > +
> > > > > +/* Used to resume on the CPU we hibernated on */
> > > > > +int hibernate_resume_nonboot_cpu_disable(void);
> > > > > +
> > > > > +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> > > > > +					unsigned long cpu_resume);
> > > > > +asmlinkage int hibernate_core_restore_code(void);
> > > > >  #endif
> > > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > > > index 4cf303a779ab..daab341d55e4 100644
> > > > > --- a/arch/riscv/kernel/Makefile
> > > > > +++ b/arch/riscv/kernel/Makefile
> > > > > @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
> > > > >  obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> > > > >
> > > > >  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> > > > > +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> > > > >
> > > > >  obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
> > > > >  obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> > > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > > > index df9444397908..d6a75aac1d27 100644
> > > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > > > @@ -9,6 +9,7 @@
> > > > >  #include <linux/kbuild.h>
> > > > >  #include <linux/mm.h>
> > > > >  #include <linux/sched.h>
> > > > > +#include <linux/suspend.h>
> > > > >  #include <asm/kvm_host.h>
> > > > >  #include <asm/thread_info.h>
> > > > >  #include <asm/ptrace.h>
> > > > > @@ -116,6 +117,10 @@ void asm_offsets(void)
> > > > >
> > > > >  	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> > > > >
> > > > > +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> > > > > +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> > > > > +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> > > > > +
> > > > >  	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
> > > > >  	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
> > > > >  	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> > > > > diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> > > > > new file mode 100644
> > > > > index 000000000000..846affe4dced
> > > > > --- /dev/null
> > > > > +++ b/arch/riscv/kernel/hibernate-asm.S
> > > > > @@ -0,0 +1,77 @@
> > > > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > > > +/*
> > > > > + * Hibernation low level support for RISCV.
> > > > > + *
> > > > > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > > > > + *
> > > > > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > > > > + */
> > > > > +
> > > > > +#include <asm/asm.h>
> > > > > +#include <asm/asm-offsets.h>
> > > > > +#include <asm/assembler.h>
> > > > > +#include <asm/csr.h>
> > > > > +
> > > > > +#include <linux/linkage.h>
> > > > > +
> > > > > +/*
> > > > > + * int __hibernate_cpu_resume(void)
> > > > > + * Switch back to the hibernated image's page table prior to restoring the CPU
> > > > > + * context.
> > > > > + *
> > > > > + * Always returns 0
> > > > > + */
> > > > > +ENTRY(__hibernate_cpu_resume)
> > > > > +	/* switch to hibernated image's page table. */
> > > > > +	csrw CSR_SATP, s0
> > > > > +	sfence.vma
> > > > > +
> > > > > +	REG_L	a0, hibernate_cpu_context
> > > > > +
> > > > > +	restore_csr
> > > > > +	restore_reg
> > > > > +
> > > > > +	/* Return zero value. */
> > > > > +	add	a0, zero, zero
> > > >
> > > > nit: mv a0, zero
> > > sure
> > > >
> > > > > +
> > > > > +	ret
> > > > > +END(__hibernate_cpu_resume)
> > > > > +
> > > > > +/*
> > > > > + * Prepare to restore the image.
> > > > > + * a0: satp of saved page tables.
> > > > > + * a1: satp of temporary page tables.
> > > > > + * a2: cpu_resume.
> > > > > + */
> > > > > +ENTRY(hibernate_restore_image)
> > > > > +	mv	s0, a0
> > > > > +	mv	s1, a1
> > > > > +	mv	s2, a2
> > > > > +	REG_L	s4, restore_pblist
> > > > > +	REG_L	a1, relocated_restore_code
> > > > > +
> > > > > +	jalr	a1
> > > > > +END(hibernate_restore_image)
> > > > > +
> > > > > +/*
> > > > > + * The below code will be executed from a 'safe' page.
> > > > > + * It first switches to the temporary page table, then starts to copy the pages
> > > > > + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> > > > > + * to restore the CPU context.
> > > > > + */
> > > > > +ENTRY(hibernate_core_restore_code)
> > > > > +	/* switch to temp page table. */
> > > > > +	csrw satp, s1
> > > > > +	sfence.vma
> > > > > +.Lcopy:
> > > > > +	/* The below code will restore the hibernated image. */
> > > > > +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> > > > > +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> > > >
> > > > Are we sure restore_pblist will never be NULL?
> > > restore_pblist is a link-list, it will be null during initialization or during page clean up by hibernation core. During the initial resume
> > process, the hibernation core will check the header and load the pages. If everything works correctly, the page will be linked to the
> > restore_pblist and then invoke swsusp_arch_resume() else hibernation core will throws error and failed to resume from the
> > hibernated image.
> > 
> > I know restore_pblist is a linked-list and this doesn't answer the
> > question. The comment above restore_pblist says
> > 
> > /*
> >  * List of PBEs needed for restoring the pages that were allocated before
> >  * the suspend and included in the suspend image, but have also been
> >  * allocated by the "resume" kernel, so their contents cannot be written
> >  * directly to their "original" page frames.
> >  */
> > 
> > which implies the pages that end up on this list are "special". My
> > question is whether or not we're guaranteed to have at least one
> > of these special pages. If not, we shouldn't assume s4 is non-null.
> > If so, then a comment stating why that's guaranteed would be nice.
> The restore_pblist will not be null otherwise swsusp_arch_resume wouldn't get invoked. you can find how the link-list are link and how it checks against validity at https://elixir.bootlin.com/linux/v6.2-rc8/source/kernel/power/snapshot.c . " A comment stating why that's guaranteed would be nice" ? Hmm, perhaps this is out of my scope but I do believe in the page validity checking in the link I shared.

Sorry, but pointing to an entire source file (one that I've obviously
already looked at, since I quoted a comment from it...) is not helpful.
I don't see where restore_pblist is being checked before
swsusp_arch_resume() is issued (from its callsite in hibernate.c).

Thanks,
drew
Sia Jee Heng Feb. 24, 2023, 10:30 a.m. UTC | #7
> -----Original Message-----
> From: Andrew Jones <ajones@ventanamicro.com>
> Sent: Friday, 24 February, 2023 5:55 PM
> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> On Fri, Feb 24, 2023 at 09:33:31AM +0000, JeeHeng Sia wrote:
> >
> >
> > > -----Original Message-----
> > > From: Andrew Jones <ajones@ventanamicro.com>
> > > Sent: Friday, 24 February, 2023 5:00 PM
> > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > >
> > > On Fri, Feb 24, 2023 at 02:05:43AM +0000, JeeHeng Sia wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > Sent: Friday, 24 February, 2023 2:07 AM
> > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > >
> > > > > On Tue, Feb 21, 2023 at 10:35:23AM +0800, Sia Jee Heng wrote:
> > > > > > Low level Arch functions were created to support hibernation.
> > > > > > swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> > > > > > cpu state onto the stack, then calling swsusp_save() to save the memory
> > > > > > image.
> > > > > >
> > > > > > Arch specific hibernation header is implemented and is utilized by the
> > > > > > arch_hibernation_header_restore() and arch_hibernation_header_save()
> > > > > > functions. The arch specific hibernation header consists of satp, hartid,
> > > > > > and the cpu_resume address. The kernel built version is also need to be
> > > > > > saved into the hibernation image header to making sure only the same
> > > > > > kernel is restore when resume.
> > > > > >
> > > > > > swsusp_arch_resume() creates a temporary page table that covering only
> > > > > > the linear map. It copies the restore code to a 'safe' page, then start
> > > > > > to restore the memory image. Once completed, it restores the original
> > > > > > kernel's page table. It then calls into __hibernate_cpu_resume()
> > > > > > to restore the CPU context. Finally, it follows the normal hibernation
> > > > > > path back to the hibernation core.
> > > > > >
> > > > > > To enable hibernation/suspend to disk into RISCV, the below config
> > > > > > need to be enabled:
> > > > > > - CONFIG_ARCH_HIBERNATION_HEADER
> > > > > > - CONFIG_ARCH_HIBERNATION_POSSIBLE
> > > > > >
> > > > > > Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> > > > > > Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> > > > > > Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> > > > > > ---
> > > > > >  arch/riscv/Kconfig                 |   7 +
> > > > > >  arch/riscv/include/asm/assembler.h |  20 ++
> > > > > >  arch/riscv/include/asm/suspend.h   |  19 ++
> > > > > >  arch/riscv/kernel/Makefile         |   1 +
> > > > > >  arch/riscv/kernel/asm-offsets.c    |   5 +
> > > > > >  arch/riscv/kernel/hibernate-asm.S  |  77 +++++
> > > > > >  arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
> > > > > >  7 files changed, 576 insertions(+)
> > > > > >  create mode 100644 arch/riscv/kernel/hibernate-asm.S
> > > > > >  create mode 100644 arch/riscv/kernel/hibernate.c
> > > > > >
> > > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > > > > index e2b656043abf..4555848a817f 100644
> > > > > > --- a/arch/riscv/Kconfig
> > > > > > +++ b/arch/riscv/Kconfig
> > > > > > @@ -690,6 +690,13 @@ menu "Power management options"
> > > > > >
> > > > > >  source "kernel/power/Kconfig"
> > > > > >
> > > > > > +config ARCH_HIBERNATION_POSSIBLE
> > > > > > +	def_bool y
> > > > > > +
> > > > > > +config ARCH_HIBERNATION_HEADER
> > > > > > +	def_bool y
> > > > > > +	depends on HIBERNATION
> > > > >
> > > > > nit: I think this can be simplified as def_bool HIBERNATION
> > > > good suggestion. will change it.
> > > > >
> > > > > > +
> > > > > >  endmenu # "Power management options"
> > > > > >
> > > > > >  menu "CPU Power Management"
> > > > > > diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> > > > > > index 727a97735493..68c46c0e0ea8 100644
> > > > > > --- a/arch/riscv/include/asm/assembler.h
> > > > > > +++ b/arch/riscv/include/asm/assembler.h
> > > > > > @@ -59,4 +59,24 @@
> > > > > >  		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
> > > > > >  	.endm
> > > > > >
> > > > > > +/*
> > > > > > + * copy_page - copy 1 page (4KB) of data from source to destination
> > > > > > + * @a0 - destination
> > > > > > + * @a1 - source
> > > > > > + */
> > > > > > +	.macro	copy_page a0, a1
> > > > > > +		lui	a2, 0x1
> > > > > > +		add	a2, a2, a0
> > > > > > +1 :
> > > > >     ^ please remove this space
> > > > can't remove it otherwise checkpatch will throws ERROR: spaces required around that ':'
> > >
> > > Oh, right, labels in macros have this requirement.
> > >
> > > > >
> > > > > > +		REG_L	t0, 0(a1)
> > > > > > +		REG_L	t1, SZREG(a1)
> > > > > > +
> > > > > > +		REG_S	t0, 0(a0)
> > > > > > +		REG_S	t1, SZREG(a0)
> > > > > > +
> > > > > > +		addi	a0, a0, 2 * SZREG
> > > > > > +		addi	a1, a1, 2 * SZREG
> > > > > > +		bne	a2, a0, 1b
> > > > > > +	.endm
> > > > > > +
> > > > > >  #endif	/* __ASM_ASSEMBLER_H */
> > > > > > diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> > > > > > index 75419c5ca272..3362da56a9d8 100644
> > > > > > --- a/arch/riscv/include/asm/suspend.h
> > > > > > +++ b/arch/riscv/include/asm/suspend.h
> > > > > > @@ -21,6 +21,11 @@ struct suspend_context {
> > > > > >  #endif
> > > > > >  };
> > > > > >
> > > > > > +/*
> > > > > > + * Used by hibernation core and cleared during resume sequence
> > > > > > + */
> > > > > > +extern int in_suspend;
> > > > > > +
> > > > > >  /* Low-level CPU suspend entry function */
> > > > > >  int __cpu_suspend_enter(struct suspend_context *context);
> > > > > >
> > > > > > @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
> > > > > >  /* Used to save and restore the csr */
> > > > > >  void suspend_save_csrs(struct suspend_context *context);
> > > > > >  void suspend_restore_csrs(struct suspend_context *context);
> > > > > > +
> > > > > > +/* Low-level API to support hibernation */
> > > > > > +int swsusp_arch_suspend(void);
> > > > > > +int swsusp_arch_resume(void);
> > > > > > +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> > > > > > +int arch_hibernation_header_restore(void *addr);
> > > > > > +int __hibernate_cpu_resume(void);
> > > > > > +
> > > > > > +/* Used to resume on the CPU we hibernated on */
> > > > > > +int hibernate_resume_nonboot_cpu_disable(void);
> > > > > > +
> > > > > > +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> > > > > > +					unsigned long cpu_resume);
> > > > > > +asmlinkage int hibernate_core_restore_code(void);
> > > > > >  #endif
> > > > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > > > > index 4cf303a779ab..daab341d55e4 100644
> > > > > > --- a/arch/riscv/kernel/Makefile
> > > > > > +++ b/arch/riscv/kernel/Makefile
> > > > > > @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
> > > > > >  obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> > > > > >
> > > > > >  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> > > > > > +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> > > > > >
> > > > > >  obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
> > > > > >  obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> > > > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > > > > index df9444397908..d6a75aac1d27 100644
> > > > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > > > > @@ -9,6 +9,7 @@
> > > > > >  #include <linux/kbuild.h>
> > > > > >  #include <linux/mm.h>
> > > > > >  #include <linux/sched.h>
> > > > > > +#include <linux/suspend.h>
> > > > > >  #include <asm/kvm_host.h>
> > > > > >  #include <asm/thread_info.h>
> > > > > >  #include <asm/ptrace.h>
> > > > > > @@ -116,6 +117,10 @@ void asm_offsets(void)
> > > > > >
> > > > > >  	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> > > > > >
> > > > > > +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> > > > > > +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> > > > > > +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> > > > > > +
> > > > > >  	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
> > > > > >  	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
> > > > > >  	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> > > > > > diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> > > > > > new file mode 100644
> > > > > > index 000000000000..846affe4dced
> > > > > > --- /dev/null
> > > > > > +++ b/arch/riscv/kernel/hibernate-asm.S
> > > > > > @@ -0,0 +1,77 @@
> > > > > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > > > > +/*
> > > > > > + * Hibernation low level support for RISCV.
> > > > > > + *
> > > > > > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > > > > > + *
> > > > > > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > > > > > + */
> > > > > > +
> > > > > > +#include <asm/asm.h>
> > > > > > +#include <asm/asm-offsets.h>
> > > > > > +#include <asm/assembler.h>
> > > > > > +#include <asm/csr.h>
> > > > > > +
> > > > > > +#include <linux/linkage.h>
> > > > > > +
> > > > > > +/*
> > > > > > + * int __hibernate_cpu_resume(void)
> > > > > > + * Switch back to the hibernated image's page table prior to restoring the CPU
> > > > > > + * context.
> > > > > > + *
> > > > > > + * Always returns 0
> > > > > > + */
> > > > > > +ENTRY(__hibernate_cpu_resume)
> > > > > > +	/* switch to hibernated image's page table. */
> > > > > > +	csrw CSR_SATP, s0
> > > > > > +	sfence.vma
> > > > > > +
> > > > > > +	REG_L	a0, hibernate_cpu_context
> > > > > > +
> > > > > > +	restore_csr
> > > > > > +	restore_reg
> > > > > > +
> > > > > > +	/* Return zero value. */
> > > > > > +	add	a0, zero, zero
> > > > >
> > > > > nit: mv a0, zero
> > > > sure
> > > > >
> > > > > > +
> > > > > > +	ret
> > > > > > +END(__hibernate_cpu_resume)
> > > > > > +
> > > > > > +/*
> > > > > > + * Prepare to restore the image.
> > > > > > + * a0: satp of saved page tables.
> > > > > > + * a1: satp of temporary page tables.
> > > > > > + * a2: cpu_resume.
> > > > > > + */
> > > > > > +ENTRY(hibernate_restore_image)
> > > > > > +	mv	s0, a0
> > > > > > +	mv	s1, a1
> > > > > > +	mv	s2, a2
> > > > > > +	REG_L	s4, restore_pblist
> > > > > > +	REG_L	a1, relocated_restore_code
> > > > > > +
> > > > > > +	jalr	a1
> > > > > > +END(hibernate_restore_image)
> > > > > > +
> > > > > > +/*
> > > > > > + * The below code will be executed from a 'safe' page.
> > > > > > + * It first switches to the temporary page table, then starts to copy the pages
> > > > > > + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> > > > > > + * to restore the CPU context.
> > > > > > + */
> > > > > > +ENTRY(hibernate_core_restore_code)
> > > > > > +	/* switch to temp page table. */
> > > > > > +	csrw satp, s1
> > > > > > +	sfence.vma
> > > > > > +.Lcopy:
> > > > > > +	/* The below code will restore the hibernated image. */
> > > > > > +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> > > > > > +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> > > > >
> > > > > Are we sure restore_pblist will never be NULL?
> > > > restore_pblist is a link-list, it will be null during initialization or during page clean up by hibernation core. During the initial
> resume
> > > process, the hibernation core will check the header and load the pages. If everything works correctly, the page will be linked to the
> > > restore_pblist and then invoke swsusp_arch_resume() else hibernation core will throws error and failed to resume from the
> > > hibernated image.
> > >
> > > I know restore_pblist is a linked-list and this doesn't answer the
> > > question. The comment above restore_pblist says
> > >
> > > /*
> > >  * List of PBEs needed for restoring the pages that were allocated before
> > >  * the suspend and included in the suspend image, but have also been
> > >  * allocated by the "resume" kernel, so their contents cannot be written
> > >  * directly to their "original" page frames.
> > >  */
> > >
> > > which implies the pages that end up on this list are "special". My
> > > question is whether or not we're guaranteed to have at least one
> > > of these special pages. If not, we shouldn't assume s4 is non-null.
> > > If so, then a comment stating why that's guaranteed would be nice.
> > The restore_pblist will not be null otherwise swsusp_arch_resume wouldn't get invoked. you can find how the link-list are link and
> how it checks against validity at https://elixir.bootlin.com/linux/v6.2-rc8/source/kernel/power/snapshot.c . " A comment stating why
> that's guaranteed would be nice" ? Hmm, perhaps this is out of my scope but I do believe in the page validity checking in the link I
> shared.
> 
> Sorry, but pointing to an entire source file (one that I've obviously
> already looked at, since I quoted a comment from it...) is not helpful.
> I don't see where restore_pblist is being checked before
> swsusp_arch_resume() is issued (from its callsite in hibernate.c).
Sure, below shows the hibernation flow for your reference. The link-list creation and checking found at: https://elixir.bootlin.com/linux/v6.2/source/kernel/power/snapshot.c#L2576
software_resume()
	load_image_and_restore()
		swsusp_read()
			load_image()
 				snapshot_write_next()
					get_buffer() <-- This is the function checks and links the pages to the restore_pblist
		hibernation_restore()
			resume_target_kernel()
				swsusp_arch_resume()
> 
> Thanks,
> drew
Andrew Jones Feb. 24, 2023, 12:07 p.m. UTC | #8
On Fri, Feb 24, 2023 at 10:30:19AM +0000, JeeHeng Sia wrote:
> 
> 
> > -----Original Message-----
> > From: Andrew Jones <ajones@ventanamicro.com>
> > Sent: Friday, 24 February, 2023 5:55 PM
> > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > 
> > On Fri, Feb 24, 2023 at 09:33:31AM +0000, JeeHeng Sia wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > Sent: Friday, 24 February, 2023 5:00 PM
> > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > >
> > > > On Fri, Feb 24, 2023 at 02:05:43AM +0000, JeeHeng Sia wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > > Sent: Friday, 24 February, 2023 2:07 AM
> > > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > > >
> > > > > > On Tue, Feb 21, 2023 at 10:35:23AM +0800, Sia Jee Heng wrote:
> > > > > > > Low level Arch functions were created to support hibernation.
> > > > > > > swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> > > > > > > cpu state onto the stack, then calling swsusp_save() to save the memory
> > > > > > > image.
> > > > > > >
> > > > > > > Arch specific hibernation header is implemented and is utilized by the
> > > > > > > arch_hibernation_header_restore() and arch_hibernation_header_save()
> > > > > > > functions. The arch specific hibernation header consists of satp, hartid,
> > > > > > > and the cpu_resume address. The kernel built version is also need to be
> > > > > > > saved into the hibernation image header to making sure only the same
> > > > > > > kernel is restore when resume.
> > > > > > >
> > > > > > > swsusp_arch_resume() creates a temporary page table that covering only
> > > > > > > the linear map. It copies the restore code to a 'safe' page, then start
> > > > > > > to restore the memory image. Once completed, it restores the original
> > > > > > > kernel's page table. It then calls into __hibernate_cpu_resume()
> > > > > > > to restore the CPU context. Finally, it follows the normal hibernation
> > > > > > > path back to the hibernation core.
> > > > > > >
> > > > > > > To enable hibernation/suspend to disk into RISCV, the below config
> > > > > > > need to be enabled:
> > > > > > > - CONFIG_ARCH_HIBERNATION_HEADER
> > > > > > > - CONFIG_ARCH_HIBERNATION_POSSIBLE
> > > > > > >
> > > > > > > Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> > > > > > > Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> > > > > > > Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> > > > > > > ---
> > > > > > >  arch/riscv/Kconfig                 |   7 +
> > > > > > >  arch/riscv/include/asm/assembler.h |  20 ++
> > > > > > >  arch/riscv/include/asm/suspend.h   |  19 ++
> > > > > > >  arch/riscv/kernel/Makefile         |   1 +
> > > > > > >  arch/riscv/kernel/asm-offsets.c    |   5 +
> > > > > > >  arch/riscv/kernel/hibernate-asm.S  |  77 +++++
> > > > > > >  arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
> > > > > > >  7 files changed, 576 insertions(+)
> > > > > > >  create mode 100644 arch/riscv/kernel/hibernate-asm.S
> > > > > > >  create mode 100644 arch/riscv/kernel/hibernate.c
> > > > > > >
> > > > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > > > > > index e2b656043abf..4555848a817f 100644
> > > > > > > --- a/arch/riscv/Kconfig
> > > > > > > +++ b/arch/riscv/Kconfig
> > > > > > > @@ -690,6 +690,13 @@ menu "Power management options"
> > > > > > >
> > > > > > >  source "kernel/power/Kconfig"
> > > > > > >
> > > > > > > +config ARCH_HIBERNATION_POSSIBLE
> > > > > > > +	def_bool y
> > > > > > > +
> > > > > > > +config ARCH_HIBERNATION_HEADER
> > > > > > > +	def_bool y
> > > > > > > +	depends on HIBERNATION
> > > > > >
> > > > > > nit: I think this can be simplified as def_bool HIBERNATION
> > > > > good suggestion. will change it.
> > > > > >
> > > > > > > +
> > > > > > >  endmenu # "Power management options"
> > > > > > >
> > > > > > >  menu "CPU Power Management"
> > > > > > > diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> > > > > > > index 727a97735493..68c46c0e0ea8 100644
> > > > > > > --- a/arch/riscv/include/asm/assembler.h
> > > > > > > +++ b/arch/riscv/include/asm/assembler.h
> > > > > > > @@ -59,4 +59,24 @@
> > > > > > >  		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
> > > > > > >  	.endm
> > > > > > >
> > > > > > > +/*
> > > > > > > + * copy_page - copy 1 page (4KB) of data from source to destination
> > > > > > > + * @a0 - destination
> > > > > > > + * @a1 - source
> > > > > > > + */
> > > > > > > +	.macro	copy_page a0, a1
> > > > > > > +		lui	a2, 0x1
> > > > > > > +		add	a2, a2, a0
> > > > > > > +1 :
> > > > > >     ^ please remove this space
> > > > > can't remove it otherwise checkpatch will throws ERROR: spaces required around that ':'
> > > >
> > > > Oh, right, labels in macros have this requirement.
> > > >
> > > > > >
> > > > > > > +		REG_L	t0, 0(a1)
> > > > > > > +		REG_L	t1, SZREG(a1)
> > > > > > > +
> > > > > > > +		REG_S	t0, 0(a0)
> > > > > > > +		REG_S	t1, SZREG(a0)
> > > > > > > +
> > > > > > > +		addi	a0, a0, 2 * SZREG
> > > > > > > +		addi	a1, a1, 2 * SZREG
> > > > > > > +		bne	a2, a0, 1b
> > > > > > > +	.endm
> > > > > > > +
> > > > > > >  #endif	/* __ASM_ASSEMBLER_H */
> > > > > > > diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> > > > > > > index 75419c5ca272..3362da56a9d8 100644
> > > > > > > --- a/arch/riscv/include/asm/suspend.h
> > > > > > > +++ b/arch/riscv/include/asm/suspend.h
> > > > > > > @@ -21,6 +21,11 @@ struct suspend_context {
> > > > > > >  #endif
> > > > > > >  };
> > > > > > >
> > > > > > > +/*
> > > > > > > + * Used by hibernation core and cleared during resume sequence
> > > > > > > + */
> > > > > > > +extern int in_suspend;
> > > > > > > +
> > > > > > >  /* Low-level CPU suspend entry function */
> > > > > > >  int __cpu_suspend_enter(struct suspend_context *context);
> > > > > > >
> > > > > > > @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
> > > > > > >  /* Used to save and restore the csr */
> > > > > > >  void suspend_save_csrs(struct suspend_context *context);
> > > > > > >  void suspend_restore_csrs(struct suspend_context *context);
> > > > > > > +
> > > > > > > +/* Low-level API to support hibernation */
> > > > > > > +int swsusp_arch_suspend(void);
> > > > > > > +int swsusp_arch_resume(void);
> > > > > > > +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> > > > > > > +int arch_hibernation_header_restore(void *addr);
> > > > > > > +int __hibernate_cpu_resume(void);
> > > > > > > +
> > > > > > > +/* Used to resume on the CPU we hibernated on */
> > > > > > > +int hibernate_resume_nonboot_cpu_disable(void);
> > > > > > > +
> > > > > > > +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> > > > > > > +					unsigned long cpu_resume);
> > > > > > > +asmlinkage int hibernate_core_restore_code(void);
> > > > > > >  #endif
> > > > > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > > > > > index 4cf303a779ab..daab341d55e4 100644
> > > > > > > --- a/arch/riscv/kernel/Makefile
> > > > > > > +++ b/arch/riscv/kernel/Makefile
> > > > > > > @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
> > > > > > >  obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> > > > > > >
> > > > > > >  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> > > > > > > +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> > > > > > >
> > > > > > >  obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
> > > > > > >  obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> > > > > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > > > > > index df9444397908..d6a75aac1d27 100644
> > > > > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > > > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > > > > > @@ -9,6 +9,7 @@
> > > > > > >  #include <linux/kbuild.h>
> > > > > > >  #include <linux/mm.h>
> > > > > > >  #include <linux/sched.h>
> > > > > > > +#include <linux/suspend.h>
> > > > > > >  #include <asm/kvm_host.h>
> > > > > > >  #include <asm/thread_info.h>
> > > > > > >  #include <asm/ptrace.h>
> > > > > > > @@ -116,6 +117,10 @@ void asm_offsets(void)
> > > > > > >
> > > > > > >  	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> > > > > > >
> > > > > > > +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> > > > > > > +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> > > > > > > +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> > > > > > > +
> > > > > > >  	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
> > > > > > >  	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
> > > > > > >  	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> > > > > > > diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..846affe4dced
> > > > > > > --- /dev/null
> > > > > > > +++ b/arch/riscv/kernel/hibernate-asm.S
> > > > > > > @@ -0,0 +1,77 @@
> > > > > > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > > > > > +/*
> > > > > > > + * Hibernation low level support for RISCV.
> > > > > > > + *
> > > > > > > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > > > > > > + *
> > > > > > > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > + */
> > > > > > > +
> > > > > > > +#include <asm/asm.h>
> > > > > > > +#include <asm/asm-offsets.h>
> > > > > > > +#include <asm/assembler.h>
> > > > > > > +#include <asm/csr.h>
> > > > > > > +
> > > > > > > +#include <linux/linkage.h>
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * int __hibernate_cpu_resume(void)
> > > > > > > + * Switch back to the hibernated image's page table prior to restoring the CPU
> > > > > > > + * context.
> > > > > > > + *
> > > > > > > + * Always returns 0
> > > > > > > + */
> > > > > > > +ENTRY(__hibernate_cpu_resume)
> > > > > > > +	/* switch to hibernated image's page table. */
> > > > > > > +	csrw CSR_SATP, s0
> > > > > > > +	sfence.vma
> > > > > > > +
> > > > > > > +	REG_L	a0, hibernate_cpu_context
> > > > > > > +
> > > > > > > +	restore_csr
> > > > > > > +	restore_reg
> > > > > > > +
> > > > > > > +	/* Return zero value. */
> > > > > > > +	add	a0, zero, zero
> > > > > >
> > > > > > nit: mv a0, zero
> > > > > sure
> > > > > >
> > > > > > > +
> > > > > > > +	ret
> > > > > > > +END(__hibernate_cpu_resume)
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Prepare to restore the image.
> > > > > > > + * a0: satp of saved page tables.
> > > > > > > + * a1: satp of temporary page tables.
> > > > > > > + * a2: cpu_resume.
> > > > > > > + */
> > > > > > > +ENTRY(hibernate_restore_image)
> > > > > > > +	mv	s0, a0
> > > > > > > +	mv	s1, a1
> > > > > > > +	mv	s2, a2
> > > > > > > +	REG_L	s4, restore_pblist
> > > > > > > +	REG_L	a1, relocated_restore_code
> > > > > > > +
> > > > > > > +	jalr	a1
> > > > > > > +END(hibernate_restore_image)
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * The below code will be executed from a 'safe' page.
> > > > > > > + * It first switches to the temporary page table, then starts to copy the pages
> > > > > > > + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> > > > > > > + * to restore the CPU context.
> > > > > > > + */
> > > > > > > +ENTRY(hibernate_core_restore_code)
> > > > > > > +	/* switch to temp page table. */
> > > > > > > +	csrw satp, s1
> > > > > > > +	sfence.vma
> > > > > > > +.Lcopy:
> > > > > > > +	/* The below code will restore the hibernated image. */
> > > > > > > +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> > > > > > > +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> > > > > >
> > > > > > Are we sure restore_pblist will never be NULL?
> > > > > restore_pblist is a link-list, it will be null during initialization or during page clean up by hibernation core. During the initial
> > resume
> > > > process, the hibernation core will check the header and load the pages. If everything works correctly, the page will be linked to the
> > > > restore_pblist and then invoke swsusp_arch_resume() else hibernation core will throws error and failed to resume from the
> > > > hibernated image.
> > > >
> > > > I know restore_pblist is a linked-list and this doesn't answer the
> > > > question. The comment above restore_pblist says
> > > >
> > > > /*
> > > >  * List of PBEs needed for restoring the pages that were allocated before
> > > >  * the suspend and included in the suspend image, but have also been
> > > >  * allocated by the "resume" kernel, so their contents cannot be written
> > > >  * directly to their "original" page frames.
> > > >  */
> > > >
> > > > which implies the pages that end up on this list are "special". My
> > > > question is whether or not we're guaranteed to have at least one
> > > > of these special pages. If not, we shouldn't assume s4 is non-null.
> > > > If so, then a comment stating why that's guaranteed would be nice.
> > > The restore_pblist will not be null otherwise swsusp_arch_resume wouldn't get invoked. you can find how the link-list are link and
> > how it checks against validity at https://elixir.bootlin.com/linux/v6.2-rc8/source/kernel/power/snapshot.c . " A comment stating why
> > that's guaranteed would be nice" ? Hmm, perhaps this is out of my scope but I do believe in the page validity checking in the link I
> > shared.
> > 
> > Sorry, but pointing to an entire source file (one that I've obviously
> > already looked at, since I quoted a comment from it...) is not helpful.
> > I don't see where restore_pblist is being checked before
> > swsusp_arch_resume() is issued (from its callsite in hibernate.c).
> Sure, below shows the hibernation flow for your reference. The link-list creation and checking found at: https://elixir.bootlin.com/linux/v6.2/source/kernel/power/snapshot.c#L2576
> software_resume()
> 	load_image_and_restore()
> 		swsusp_read()
> 			load_image()
>  				snapshot_write_next()
> 					get_buffer() <-- This is the function checks and links the pages to the restore_pblist

Yup, I've read this path, including get_buffer(), where I saw that
get_buffer() can return an address without allocating a PBE. Where is the
check that restore_pblist isn't NULL, i.e. we see that at least one PBE
has been allocated by get_buffer(), before we call swsusp_arch_resume()?

Or, is known that at least one or more pages match the criteria pointed
out in the comment below (copied from get_buffer())?

        /*
         * The "original" page frame has not been allocated and we have to
         * use a "safe" page frame to store the loaded page.
         */

If so, then which ones? And where does it state that?

Thanks,
drew


> 		hibernation_restore()
> 			resume_target_kernel()
> 				swsusp_arch_resume()
> > 
> > Thanks,
> > drew
Alexandre Ghiti Feb. 24, 2023, 12:29 p.m. UTC | #9
On 2/21/23 03:35, Sia Jee Heng wrote:
> Low level Arch functions were created to support hibernation.
> swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> cpu state onto the stack, then calling swsusp_save() to save the memory
> image.
>
> Arch specific hibernation header is implemented and is utilized by the
> arch_hibernation_header_restore() and arch_hibernation_header_save()
> functions. The arch specific hibernation header consists of satp, hartid,
> and the cpu_resume address. The kernel built version is also need to be
> saved into the hibernation image header to making sure only the same
> kernel is restore when resume.
>
> swsusp_arch_resume() creates a temporary page table that covering only
> the linear map. It copies the restore code to a 'safe' page, then start
> to restore the memory image. Once completed, it restores the original
> kernel's page table. It then calls into __hibernate_cpu_resume()
> to restore the CPU context. Finally, it follows the normal hibernation
> path back to the hibernation core.
>
> To enable hibernation/suspend to disk into RISCV, the below config
> need to be enabled:
> - CONFIG_ARCH_HIBERNATION_HEADER
> - CONFIG_ARCH_HIBERNATION_POSSIBLE
>
> Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> ---
>   arch/riscv/Kconfig                 |   7 +
>   arch/riscv/include/asm/assembler.h |  20 ++
>   arch/riscv/include/asm/suspend.h   |  19 ++
>   arch/riscv/kernel/Makefile         |   1 +
>   arch/riscv/kernel/asm-offsets.c    |   5 +
>   arch/riscv/kernel/hibernate-asm.S  |  77 +++++
>   arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
>   7 files changed, 576 insertions(+)
>   create mode 100644 arch/riscv/kernel/hibernate-asm.S
>   create mode 100644 arch/riscv/kernel/hibernate.c
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index e2b656043abf..4555848a817f 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -690,6 +690,13 @@ menu "Power management options"
>   
>   source "kernel/power/Kconfig"
>   
> +config ARCH_HIBERNATION_POSSIBLE
> +	def_bool y
> +
> +config ARCH_HIBERNATION_HEADER
> +	def_bool y
> +	depends on HIBERNATION
> +
>   endmenu # "Power management options"
>   
>   menu "CPU Power Management"
> diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> index 727a97735493..68c46c0e0ea8 100644
> --- a/arch/riscv/include/asm/assembler.h
> +++ b/arch/riscv/include/asm/assembler.h
> @@ -59,4 +59,24 @@
>   		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
>   	.endm
>   
> +/*
> + * copy_page - copy 1 page (4KB) of data from source to destination
> + * @a0 - destination
> + * @a1 - source
> + */
> +	.macro	copy_page a0, a1
> +		lui	a2, 0x1
> +		add	a2, a2, a0
> +1 :
> +		REG_L	t0, 0(a1)
> +		REG_L	t1, SZREG(a1)
> +
> +		REG_S	t0, 0(a0)
> +		REG_S	t1, SZREG(a0)
> +
> +		addi	a0, a0, 2 * SZREG
> +		addi	a1, a1, 2 * SZREG
> +		bne	a2, a0, 1b
> +	.endm
> +
>   #endif	/* __ASM_ASSEMBLER_H */
> diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> index 75419c5ca272..3362da56a9d8 100644
> --- a/arch/riscv/include/asm/suspend.h
> +++ b/arch/riscv/include/asm/suspend.h
> @@ -21,6 +21,11 @@ struct suspend_context {
>   #endif
>   };
>   
> +/*
> + * Used by hibernation core and cleared during resume sequence
> + */
> +extern int in_suspend;
> +
>   /* Low-level CPU suspend entry function */
>   int __cpu_suspend_enter(struct suspend_context *context);
>   
> @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
>   /* Used to save and restore the csr */
>   void suspend_save_csrs(struct suspend_context *context);
>   void suspend_restore_csrs(struct suspend_context *context);
> +
> +/* Low-level API to support hibernation */
> +int swsusp_arch_suspend(void);
> +int swsusp_arch_resume(void);
> +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> +int arch_hibernation_header_restore(void *addr);
> +int __hibernate_cpu_resume(void);
> +
> +/* Used to resume on the CPU we hibernated on */
> +int hibernate_resume_nonboot_cpu_disable(void);
> +
> +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> +					unsigned long cpu_resume);
> +asmlinkage int hibernate_core_restore_code(void);
>   #endif
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index 4cf303a779ab..daab341d55e4 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
>   obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
>   
>   obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
>   
>   obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
>   obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> index df9444397908..d6a75aac1d27 100644
> --- a/arch/riscv/kernel/asm-offsets.c
> +++ b/arch/riscv/kernel/asm-offsets.c
> @@ -9,6 +9,7 @@
>   #include <linux/kbuild.h>
>   #include <linux/mm.h>
>   #include <linux/sched.h>
> +#include <linux/suspend.h>
>   #include <asm/kvm_host.h>
>   #include <asm/thread_info.h>
>   #include <asm/ptrace.h>
> @@ -116,6 +117,10 @@ void asm_offsets(void)
>   
>   	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
>   
> +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> +
>   	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
>   	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
>   	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> new file mode 100644
> index 000000000000..846affe4dced
> --- /dev/null
> +++ b/arch/riscv/kernel/hibernate-asm.S
> @@ -0,0 +1,77 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Hibernation low level support for RISCV.
> + *
> + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> + *
> + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> + */
> +
> +#include <asm/asm.h>
> +#include <asm/asm-offsets.h>
> +#include <asm/assembler.h>
> +#include <asm/csr.h>
> +
> +#include <linux/linkage.h>
> +
> +/*
> + * int __hibernate_cpu_resume(void)
> + * Switch back to the hibernated image's page table prior to restoring the CPU
> + * context.
> + *
> + * Always returns 0
> + */
> +ENTRY(__hibernate_cpu_resume)
> +	/* switch to hibernated image's page table. */
> +	csrw CSR_SATP, s0
> +	sfence.vma
> +
> +	REG_L	a0, hibernate_cpu_context
> +
> +	restore_csr
> +	restore_reg
> +
> +	/* Return zero value. */
> +	add	a0, zero, zero
> +
> +	ret
> +END(__hibernate_cpu_resume)
> +
> +/*
> + * Prepare to restore the image.
> + * a0: satp of saved page tables.
> + * a1: satp of temporary page tables.
> + * a2: cpu_resume.
> + */
> +ENTRY(hibernate_restore_image)
> +	mv	s0, a0
> +	mv	s1, a1
> +	mv	s2, a2
> +	REG_L	s4, restore_pblist
> +	REG_L	a1, relocated_restore_code
> +
> +	jalr	a1
> +END(hibernate_restore_image)
> +
> +/*
> + * The below code will be executed from a 'safe' page.
> + * It first switches to the temporary page table, then starts to copy the pages
> + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> + * to restore the CPU context.
> + */
> +ENTRY(hibernate_core_restore_code)
> +	/* switch to temp page table. */
> +	csrw satp, s1
> +	sfence.vma
> +.Lcopy:
> +	/* The below code will restore the hibernated image. */
> +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> +
> +	copy_page a0, a1
> +
> +	REG_L	s4, HIBERN_PBE_NEXT(s4)
> +	bnez	s4, .Lcopy
> +
> +	jalr	s2
> +END(hibernate_core_restore_code)
> diff --git a/arch/riscv/kernel/hibernate.c b/arch/riscv/kernel/hibernate.c
> new file mode 100644
> index 000000000000..46a2f470db6e
> --- /dev/null
> +++ b/arch/riscv/kernel/hibernate.c
> @@ -0,0 +1,447 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Hibernation support for RISCV
> + *
> + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> + *
> + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> + */
> +
> +#include <asm/barrier.h>
> +#include <asm/cacheflush.h>
> +#include <asm/mmu_context.h>
> +#include <asm/page.h>
> +#include <asm/pgalloc.h>
> +#include <asm/pgtable.h>
> +#include <asm/sections.h>
> +#include <asm/set_memory.h>
> +#include <asm/smp.h>
> +#include <asm/suspend.h>
> +
> +#include <linux/cpu.h>
> +#include <linux/memblock.h>
> +#include <linux/pm.h>
> +#include <linux/sched.h>
> +#include <linux/suspend.h>
> +#include <linux/utsname.h>
> +
> +/* The logical cpu number we should resume on, initialised to a non-cpu number. */
> +static int sleep_cpu = -EINVAL;
> +
> +/* Pointer to the temporary resume page table. */
> +static pgd_t *resume_pg_dir;
> +
> +/* CPU context to be saved. */
> +struct suspend_context *hibernate_cpu_context;
> +EXPORT_SYMBOL_GPL(hibernate_cpu_context);
> +
> +unsigned long relocated_restore_code;
> +EXPORT_SYMBOL_GPL(relocated_restore_code);
> +
> +/**
> + * struct arch_hibernate_hdr_invariants - container to store kernel build version.
> + * @uts_version: to save the build number and date so that the we do not resume with
> + *		a different kernel.
> + */
> +struct arch_hibernate_hdr_invariants {
> +	char		uts_version[__NEW_UTS_LEN + 1];
> +};
> +
> +/**
> + * struct arch_hibernate_hdr - helper parameters that help us to restore the image.
> + * @invariants: container to store kernel build version.
> + * @hartid: to make sure same boot_cpu executes the hibernate/restore code.
> + * @saved_satp: original page table used by the hibernated image.
> + * @restore_cpu_addr: the kernel's image address to restore the CPU context.
> + */
> +static struct arch_hibernate_hdr {
> +	struct arch_hibernate_hdr_invariants invariants;
> +	unsigned long	hartid;
> +	unsigned long	saved_satp;
> +	unsigned long	restore_cpu_addr;
> +} resume_hdr;
> +
> +static inline void arch_hdr_invariants(struct arch_hibernate_hdr_invariants *i)
> +{
> +	memset(i, 0, sizeof(*i));
> +	memcpy(i->uts_version, init_utsname()->version, sizeof(i->uts_version));
> +}
> +
> +/*
> + * Check if the given pfn is in the 'nosave' section.
> + */
> +int pfn_is_nosave(unsigned long pfn)
> +{
> +	unsigned long nosave_begin_pfn = sym_to_pfn(&__nosave_begin);
> +	unsigned long nosave_end_pfn = sym_to_pfn(&__nosave_end - 1);
> +
> +	return ((pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn));
> +}
> +
> +void notrace save_processor_state(void)
> +{
> +	WARN_ON(num_online_cpus() != 1);
> +}
> +
> +void notrace restore_processor_state(void)
> +{
> +}
> +
> +/*
> + * Helper parameters need to be saved to the hibernation image header.
> + */
> +int arch_hibernation_header_save(void *addr, unsigned int max_size)
> +{
> +	struct arch_hibernate_hdr *hdr = addr;
> +
> +	if (max_size < sizeof(*hdr))
> +		return -EOVERFLOW;
> +
> +	arch_hdr_invariants(&hdr->invariants);
> +
> +	hdr->hartid = cpuid_to_hartid_map(sleep_cpu);
> +	hdr->saved_satp = csr_read(CSR_SATP);
> +	hdr->restore_cpu_addr = (unsigned long)__hibernate_cpu_resume;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(arch_hibernation_header_save);
> +
> +/*
> + * Retrieve the helper parameters from the hibernation image header.
> + */
> +int arch_hibernation_header_restore(void *addr)
> +{
> +	struct arch_hibernate_hdr_invariants invariants;
> +	struct arch_hibernate_hdr *hdr = addr;
> +	int ret = 0;
> +
> +	arch_hdr_invariants(&invariants);
> +
> +	if (memcmp(&hdr->invariants, &invariants, sizeof(invariants))) {
> +		pr_crit("Hibernate image not generated by this kernel!\n");
> +		return -EINVAL;
> +	}
> +
> +	sleep_cpu = riscv_hartid_to_cpuid(hdr->hartid);
> +	if (sleep_cpu < 0) {
> +		pr_crit("Hibernated on a CPU not known to this kernel!\n");
> +		sleep_cpu = -EINVAL;
> +		return -EINVAL;
> +	}
> +
> +#ifdef CONFIG_SMP
> +	ret = bringup_hibernate_cpu(sleep_cpu);
> +	if (ret) {
> +		sleep_cpu = -EINVAL;
> +		return ret;
> +	}
> +#endif
> +	resume_hdr = *hdr;
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(arch_hibernation_header_restore);
> +
> +int swsusp_arch_suspend(void)
> +{
> +	int ret = 0;
> +
> +	if (__cpu_suspend_enter(hibernate_cpu_context)) {
> +		sleep_cpu = smp_processor_id();
> +		suspend_save_csrs(hibernate_cpu_context);
> +		ret = swsusp_save();
> +	} else {
> +		suspend_restore_csrs(hibernate_cpu_context);
> +		flush_tlb_all();
> +		flush_icache_all();
> +
> +		/*
> +		 * Tell the hibernation core that we've just restored the memory.
> +		 */
> +		in_suspend = 0;
> +		sleep_cpu = -EINVAL;
> +	}
> +
> +	return ret;
> +}
> +
> +static unsigned long _temp_pgtable_map_pte(pte_t *dst_ptep, pte_t *src_ptep,
> +					   unsigned long addr, pgprot_t prot)
> +{
> +	pte_t pte = READ_ONCE(*src_ptep);
> +
> +	if (pte_present(pte))
> +		set_pte(dst_ptep, __pte(pte_val(pte) | pgprot_val(prot)));
> +
> +	return 0;
> +}


I don't see the need for this function


> +
> +static unsigned long temp_pgtable_map_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp,
> +					  unsigned long start, unsigned long end,
> +					  pgprot_t prot)
> +{
> +	unsigned long addr = start;
> +	pte_t *src_ptep;
> +	pte_t *dst_ptep;
> +
> +	if (pmd_none(READ_ONCE(*dst_pmdp))) {
> +		dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
> +		if (!dst_ptep)
> +			return -ENOMEM;
> +
> +		pmd_populate_kernel(NULL, dst_pmdp, dst_ptep);
> +	}
> +
> +	dst_ptep = pte_offset_kernel(dst_pmdp, start);
> +	src_ptep = pte_offset_kernel(src_pmdp, start);
> +
> +	do {
> +		_temp_pgtable_map_pte(dst_ptep, src_ptep, addr, prot);
> +	} while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr < end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_map_pmd(pud_t *dst_pudp, pud_t *src_pudp,
> +					  unsigned long start, unsigned long end,
> +					  pgprot_t prot)
> +{
> +	unsigned long addr = start;
> +	unsigned long next;
> +	unsigned long ret;
> +	pmd_t *src_pmdp;
> +	pmd_t *dst_pmdp;
> +
> +	if (pud_none(READ_ONCE(*dst_pudp))) {
> +		dst_pmdp = (pmd_t *)get_safe_page(GFP_ATOMIC);
> +		if (!dst_pmdp)
> +			return -ENOMEM;
> +
> +		pud_populate(NULL, dst_pudp, dst_pmdp);
> +	}
> +
> +	dst_pmdp = pmd_offset(dst_pudp, start);
> +	src_pmdp = pmd_offset(src_pudp, start);
> +
> +	do {
> +		pmd_t pmd = READ_ONCE(*src_pmdp);
> +
> +		next = pmd_addr_end(addr, end);
> +
> +		if (pmd_none(pmd))
> +			continue;
> +
> +		if (pmd_leaf(pmd)) {
> +			set_pmd(dst_pmdp, __pmd(pmd_val(pmd) | pgprot_val(prot)));
> +		} else {
> +			ret = temp_pgtable_map_pte(dst_pmdp, src_pmdp, addr, next, prot);
> +			if (ret)
> +				return -ENOMEM;
> +		}
> +	} while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_map_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp,
> +					  unsigned long start,
> +					  unsigned long end, pgprot_t prot)
> +{
> +	unsigned long addr = start;
> +	unsigned long next;
> +	unsigned long ret;
> +	pud_t *dst_pudp;
> +	pud_t *src_pudp;
> +
> +	if (p4d_none(READ_ONCE(*dst_p4dp))) {
> +		dst_pudp = (pud_t *)get_safe_page(GFP_ATOMIC);
> +		if (!dst_pudp)
> +			return -ENOMEM;
> +
> +		p4d_populate(NULL, dst_p4dp, dst_pudp);
> +	}
> +
> +	dst_pudp = pud_offset(dst_p4dp, start);
> +	src_pudp = pud_offset(src_p4dp, start);
> +
> +	do {
> +		pud_t pud = READ_ONCE(*src_pudp);
> +
> +		next = pud_addr_end(addr, end);
> +
> +		if (pud_none(pud))
> +			continue;
> +
> +		if (pud_leaf(pud)) {
> +			set_pud(dst_pudp, __pud(pud_val(pud) | pgprot_val(prot)));
> +		} else {
> +			ret = temp_pgtable_map_pmd(dst_pudp, src_pudp, addr, next, prot);
> +			if (ret)
> +				return -ENOMEM;
> +		}
> +	} while (dst_pudp++, src_pudp++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_map_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp,
> +					  unsigned long start, unsigned long end,
> +					  pgprot_t prot)
> +{
> +	unsigned long addr = start;


Nit: you don't need the addr variable, you can rename start into addr 
and directly work with it.


> +	unsigned long next;
> +	unsigned long ret;
> +	p4d_t *dst_p4dp;
> +	p4d_t *src_p4dp;
> +
> +	if (pgd_none(READ_ONCE(*dst_pgdp))) {
> +		dst_p4dp = (p4d_t *)get_safe_page(GFP_ATOMIC);
> +		if (!dst_p4dp)
> +			return -ENOMEM;
> +
> +		pgd_populate(NULL, dst_pgdp, dst_p4dp);
> +	}
> +
> +	dst_p4dp = p4d_offset(dst_pgdp, start);
> +	src_p4dp = p4d_offset(src_pgdp, start);
> +
> +	do {
> +		p4d_t p4d = READ_ONCE(*src_p4dp);
> +
> +		next = p4d_addr_end(addr, end);
> +
> +		if (p4d_none(READ_ONCE(*src_p4dp)))


You should use p4d here: p4d_none(p4d)


> +			continue;
> +
> +		if (p4d_leaf(p4d)) {
> +			set_p4d(dst_p4dp, __p4d(p4d_val(p4d) | pgprot_val(prot)));


The "| pgprot_val(prot)" happens to work because PAGE_KERNEL will add 
the PAGE_WRITE bit: I'd rather make it more clear by explicitly add 
PAGE_WRITE.


> +		} else {
> +			ret = temp_pgtable_map_pud(dst_p4dp, src_p4dp, addr, next, prot);
> +			if (ret)
> +				return -ENOMEM;
> +		}
> +	} while (dst_p4dp++, src_p4dp++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_mapping(pgd_t *pgdp)
> +{
> +	unsigned long end = (unsigned long)pfn_to_virt(max_low_pfn);
> +	unsigned long addr = PAGE_OFFSET;
> +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
> +	pgd_t *src_pgdp = pgd_offset_k(addr);
> +	unsigned long next;
> +
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none(READ_ONCE(*src_pgdp)))
> +			continue;
> +


We added the pgd_leaf test in kernel_page_present, let's add it here too.


> +		if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, next, PAGE_KERNEL))
> +			return -ENOMEM;
> +	} while (dst_pgdp++, src_pgdp++, addr = next, addr != end);
> +
> +	return 0;
> +}
> +
> +static unsigned long temp_pgtable_text_mapping(pgd_t *pgdp, unsigned long addr)
> +{
> +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
> +	pgd_t *src_pgdp = pgd_offset_k(addr);
> +
> +	if (pgd_none(READ_ONCE(*src_pgdp)))
> +		return -EFAULT;
> +
> +	if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, addr, PAGE_KERNEL_EXEC))
> +		return -ENOMEM;
> +
> +	return 0;
> +}


Ok so if we fall into a huge mapping, you add the exec permission to the 
whole range, which could easily be 1GB. I think that either we can avoid 
this step by mapping the whole linear mapping as executable, or we 
actually use another pgd entry for this page that is not in the linear 
mapping. The latter seems cleaner, what do you think?


> +
> +static unsigned long relocate_restore_code(void)
> +{
> +	unsigned long ret;
> +	void *page = (void *)get_safe_page(GFP_ATOMIC);
> +
> +	if (!page)
> +		return -ENOMEM;
> +
> +	copy_page(page, hibernate_core_restore_code);
> +
> +	/* Make the page containing the relocated code executable. */
> +	set_memory_x((unsigned long)page, 1);
> +
> +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)page);
> +	if (ret)
> +		return ret;
> +
> +	return (unsigned long)page;
> +}
> +
> +int swsusp_arch_resume(void)
> +{
> +	unsigned long ret;
> +
> +	/*
> +	 * Memory allocated by get_safe_page() will be dealt with by the hibernation core,
> +	 * we don't need to free it here.
> +	 */
> +	resume_pg_dir = (pgd_t *)get_safe_page(GFP_ATOMIC);
> +	if (!resume_pg_dir)
> +		return -ENOMEM;
> +
> +	/*
> +	 * The pages need to be writable when restoring the image.
> +	 * Create a second copy of page table just for the linear map.
> +	 * Use this temporary page table to restore the image.
> +	 */
> +	ret = temp_pgtable_mapping(resume_pg_dir);
> +	if (ret)
> +		return (int)ret;


The temp_pgtable* functions should return an int to avoid this cast.


> +
> +	/* Move the restore code to a new page so that it doesn't get overwritten by itself. */
> +	relocated_restore_code = relocate_restore_code();
> +	if (relocated_restore_code == -ENOMEM)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Map the __hibernate_cpu_resume() address to the temporary page table so that the
> +	 * restore code can jumps to it after finished restore the image. The next execution
> +	 * code doesn't find itself in a different address space after switching over to the
> +	 * original page table used by the hibernated image.
> +	 */
> +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)resume_hdr.restore_cpu_addr);
> +	if (ret)
> +		return ret;
> +
> +	hibernate_restore_image(resume_hdr.saved_satp, (PFN_DOWN(__pa(resume_pg_dir)) | satp_mode),
> +				resume_hdr.restore_cpu_addr);
> +
> +	return 0;
> +}
> +
> +#ifdef CONFIG_PM_SLEEP_SMP
> +int hibernate_resume_nonboot_cpu_disable(void)
> +{
> +	if (sleep_cpu < 0) {
> +		pr_err("Failing to resume from hibernate on an unknown CPU\n");
> +		return -ENODEV;
> +	}
> +
> +	return freeze_secondary_cpus(sleep_cpu);
> +}
> +#endif
> +
> +static int __init riscv_hibernate_init(void)
> +{
> +	hibernate_cpu_context = kzalloc(sizeof(*hibernate_cpu_context), GFP_KERNEL);
> +
> +	if (WARN_ON(!hibernate_cpu_context))
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +early_initcall(riscv_hibernate_init);


Overall, it is now nicer with the the proper page table walk: but we can 
now see that the code is exactly the same as arm64, what prevents us 
from merging both somewhere in mm/?
Sia Jee Heng Feb. 27, 2023, 2:14 a.m. UTC | #10
> -----Original Message-----
> From: Andrew Jones <ajones@ventanamicro.com>
> Sent: Friday, 24 February, 2023 8:07 PM
> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> On Fri, Feb 24, 2023 at 10:30:19AM +0000, JeeHeng Sia wrote:
> >
> >
> > > -----Original Message-----
> > > From: Andrew Jones <ajones@ventanamicro.com>
> > > Sent: Friday, 24 February, 2023 5:55 PM
> > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > >
> > > On Fri, Feb 24, 2023 at 09:33:31AM +0000, JeeHeng Sia wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > Sent: Friday, 24 February, 2023 5:00 PM
> > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > >
> > > > > On Fri, Feb 24, 2023 at 02:05:43AM +0000, JeeHeng Sia wrote:
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > > > Sent: Friday, 24 February, 2023 2:07 AM
> > > > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > > > >
> > > > > > > On Tue, Feb 21, 2023 at 10:35:23AM +0800, Sia Jee Heng wrote:
> > > > > > > > Low level Arch functions were created to support hibernation.
> > > > > > > > swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> > > > > > > > cpu state onto the stack, then calling swsusp_save() to save the memory
> > > > > > > > image.
> > > > > > > >
> > > > > > > > Arch specific hibernation header is implemented and is utilized by the
> > > > > > > > arch_hibernation_header_restore() and arch_hibernation_header_save()
> > > > > > > > functions. The arch specific hibernation header consists of satp, hartid,
> > > > > > > > and the cpu_resume address. The kernel built version is also need to be
> > > > > > > > saved into the hibernation image header to making sure only the same
> > > > > > > > kernel is restore when resume.
> > > > > > > >
> > > > > > > > swsusp_arch_resume() creates a temporary page table that covering only
> > > > > > > > the linear map. It copies the restore code to a 'safe' page, then start
> > > > > > > > to restore the memory image. Once completed, it restores the original
> > > > > > > > kernel's page table. It then calls into __hibernate_cpu_resume()
> > > > > > > > to restore the CPU context. Finally, it follows the normal hibernation
> > > > > > > > path back to the hibernation core.
> > > > > > > >
> > > > > > > > To enable hibernation/suspend to disk into RISCV, the below config
> > > > > > > > need to be enabled:
> > > > > > > > - CONFIG_ARCH_HIBERNATION_HEADER
> > > > > > > > - CONFIG_ARCH_HIBERNATION_POSSIBLE
> > > > > > > >
> > > > > > > > Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> > > > > > > > Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> > > > > > > > Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> > > > > > > > ---
> > > > > > > >  arch/riscv/Kconfig                 |   7 +
> > > > > > > >  arch/riscv/include/asm/assembler.h |  20 ++
> > > > > > > >  arch/riscv/include/asm/suspend.h   |  19 ++
> > > > > > > >  arch/riscv/kernel/Makefile         |   1 +
> > > > > > > >  arch/riscv/kernel/asm-offsets.c    |   5 +
> > > > > > > >  arch/riscv/kernel/hibernate-asm.S  |  77 +++++
> > > > > > > >  arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
> > > > > > > >  7 files changed, 576 insertions(+)
> > > > > > > >  create mode 100644 arch/riscv/kernel/hibernate-asm.S
> > > > > > > >  create mode 100644 arch/riscv/kernel/hibernate.c
> > > > > > > >
> > > > > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > > > > > > index e2b656043abf..4555848a817f 100644
> > > > > > > > --- a/arch/riscv/Kconfig
> > > > > > > > +++ b/arch/riscv/Kconfig
> > > > > > > > @@ -690,6 +690,13 @@ menu "Power management options"
> > > > > > > >
> > > > > > > >  source "kernel/power/Kconfig"
> > > > > > > >
> > > > > > > > +config ARCH_HIBERNATION_POSSIBLE
> > > > > > > > +	def_bool y
> > > > > > > > +
> > > > > > > > +config ARCH_HIBERNATION_HEADER
> > > > > > > > +	def_bool y
> > > > > > > > +	depends on HIBERNATION
> > > > > > >
> > > > > > > nit: I think this can be simplified as def_bool HIBERNATION
> > > > > > good suggestion. will change it.
> > > > > > >
> > > > > > > > +
> > > > > > > >  endmenu # "Power management options"
> > > > > > > >
> > > > > > > >  menu "CPU Power Management"
> > > > > > > > diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> > > > > > > > index 727a97735493..68c46c0e0ea8 100644
> > > > > > > > --- a/arch/riscv/include/asm/assembler.h
> > > > > > > > +++ b/arch/riscv/include/asm/assembler.h
> > > > > > > > @@ -59,4 +59,24 @@
> > > > > > > >  		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
> > > > > > > >  	.endm
> > > > > > > >
> > > > > > > > +/*
> > > > > > > > + * copy_page - copy 1 page (4KB) of data from source to destination
> > > > > > > > + * @a0 - destination
> > > > > > > > + * @a1 - source
> > > > > > > > + */
> > > > > > > > +	.macro	copy_page a0, a1
> > > > > > > > +		lui	a2, 0x1
> > > > > > > > +		add	a2, a2, a0
> > > > > > > > +1 :
> > > > > > >     ^ please remove this space
> > > > > > can't remove it otherwise checkpatch will throws ERROR: spaces required around that ':'
> > > > >
> > > > > Oh, right, labels in macros have this requirement.
> > > > >
> > > > > > >
> > > > > > > > +		REG_L	t0, 0(a1)
> > > > > > > > +		REG_L	t1, SZREG(a1)
> > > > > > > > +
> > > > > > > > +		REG_S	t0, 0(a0)
> > > > > > > > +		REG_S	t1, SZREG(a0)
> > > > > > > > +
> > > > > > > > +		addi	a0, a0, 2 * SZREG
> > > > > > > > +		addi	a1, a1, 2 * SZREG
> > > > > > > > +		bne	a2, a0, 1b
> > > > > > > > +	.endm
> > > > > > > > +
> > > > > > > >  #endif	/* __ASM_ASSEMBLER_H */
> > > > > > > > diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> > > > > > > > index 75419c5ca272..3362da56a9d8 100644
> > > > > > > > --- a/arch/riscv/include/asm/suspend.h
> > > > > > > > +++ b/arch/riscv/include/asm/suspend.h
> > > > > > > > @@ -21,6 +21,11 @@ struct suspend_context {
> > > > > > > >  #endif
> > > > > > > >  };
> > > > > > > >
> > > > > > > > +/*
> > > > > > > > + * Used by hibernation core and cleared during resume sequence
> > > > > > > > + */
> > > > > > > > +extern int in_suspend;
> > > > > > > > +
> > > > > > > >  /* Low-level CPU suspend entry function */
> > > > > > > >  int __cpu_suspend_enter(struct suspend_context *context);
> > > > > > > >
> > > > > > > > @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
> > > > > > > >  /* Used to save and restore the csr */
> > > > > > > >  void suspend_save_csrs(struct suspend_context *context);
> > > > > > > >  void suspend_restore_csrs(struct suspend_context *context);
> > > > > > > > +
> > > > > > > > +/* Low-level API to support hibernation */
> > > > > > > > +int swsusp_arch_suspend(void);
> > > > > > > > +int swsusp_arch_resume(void);
> > > > > > > > +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> > > > > > > > +int arch_hibernation_header_restore(void *addr);
> > > > > > > > +int __hibernate_cpu_resume(void);
> > > > > > > > +
> > > > > > > > +/* Used to resume on the CPU we hibernated on */
> > > > > > > > +int hibernate_resume_nonboot_cpu_disable(void);
> > > > > > > > +
> > > > > > > > +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> > > > > > > > +					unsigned long cpu_resume);
> > > > > > > > +asmlinkage int hibernate_core_restore_code(void);
> > > > > > > >  #endif
> > > > > > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > > > > > > index 4cf303a779ab..daab341d55e4 100644
> > > > > > > > --- a/arch/riscv/kernel/Makefile
> > > > > > > > +++ b/arch/riscv/kernel/Makefile
> > > > > > > > @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
> > > > > > > >  obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> > > > > > > >
> > > > > > > >  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> > > > > > > > +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> > > > > > > >
> > > > > > > >  obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
> > > > > > > >  obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> > > > > > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > > > > > > index df9444397908..d6a75aac1d27 100644
> > > > > > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > > > > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > > > > > > @@ -9,6 +9,7 @@
> > > > > > > >  #include <linux/kbuild.h>
> > > > > > > >  #include <linux/mm.h>
> > > > > > > >  #include <linux/sched.h>
> > > > > > > > +#include <linux/suspend.h>
> > > > > > > >  #include <asm/kvm_host.h>
> > > > > > > >  #include <asm/thread_info.h>
> > > > > > > >  #include <asm/ptrace.h>
> > > > > > > > @@ -116,6 +117,10 @@ void asm_offsets(void)
> > > > > > > >
> > > > > > > >  	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> > > > > > > >
> > > > > > > > +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> > > > > > > > +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> > > > > > > > +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> > > > > > > > +
> > > > > > > >  	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
> > > > > > > >  	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
> > > > > > > >  	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> > > > > > > > diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> > > > > > > > new file mode 100644
> > > > > > > > index 000000000000..846affe4dced
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/arch/riscv/kernel/hibernate-asm.S
> > > > > > > > @@ -0,0 +1,77 @@
> > > > > > > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > > > > > > +/*
> > > > > > > > + * Hibernation low level support for RISCV.
> > > > > > > > + *
> > > > > > > > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > > > > > > > + *
> > > > > > > > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > > + */
> > > > > > > > +
> > > > > > > > +#include <asm/asm.h>
> > > > > > > > +#include <asm/asm-offsets.h>
> > > > > > > > +#include <asm/assembler.h>
> > > > > > > > +#include <asm/csr.h>
> > > > > > > > +
> > > > > > > > +#include <linux/linkage.h>
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * int __hibernate_cpu_resume(void)
> > > > > > > > + * Switch back to the hibernated image's page table prior to restoring the CPU
> > > > > > > > + * context.
> > > > > > > > + *
> > > > > > > > + * Always returns 0
> > > > > > > > + */
> > > > > > > > +ENTRY(__hibernate_cpu_resume)
> > > > > > > > +	/* switch to hibernated image's page table. */
> > > > > > > > +	csrw CSR_SATP, s0
> > > > > > > > +	sfence.vma
> > > > > > > > +
> > > > > > > > +	REG_L	a0, hibernate_cpu_context
> > > > > > > > +
> > > > > > > > +	restore_csr
> > > > > > > > +	restore_reg
> > > > > > > > +
> > > > > > > > +	/* Return zero value. */
> > > > > > > > +	add	a0, zero, zero
> > > > > > >
> > > > > > > nit: mv a0, zero
> > > > > > sure
> > > > > > >
> > > > > > > > +
> > > > > > > > +	ret
> > > > > > > > +END(__hibernate_cpu_resume)
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * Prepare to restore the image.
> > > > > > > > + * a0: satp of saved page tables.
> > > > > > > > + * a1: satp of temporary page tables.
> > > > > > > > + * a2: cpu_resume.
> > > > > > > > + */
> > > > > > > > +ENTRY(hibernate_restore_image)
> > > > > > > > +	mv	s0, a0
> > > > > > > > +	mv	s1, a1
> > > > > > > > +	mv	s2, a2
> > > > > > > > +	REG_L	s4, restore_pblist
> > > > > > > > +	REG_L	a1, relocated_restore_code
> > > > > > > > +
> > > > > > > > +	jalr	a1
> > > > > > > > +END(hibernate_restore_image)
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * The below code will be executed from a 'safe' page.
> > > > > > > > + * It first switches to the temporary page table, then starts to copy the pages
> > > > > > > > + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> > > > > > > > + * to restore the CPU context.
> > > > > > > > + */
> > > > > > > > +ENTRY(hibernate_core_restore_code)
> > > > > > > > +	/* switch to temp page table. */
> > > > > > > > +	csrw satp, s1
> > > > > > > > +	sfence.vma
> > > > > > > > +.Lcopy:
> > > > > > > > +	/* The below code will restore the hibernated image. */
> > > > > > > > +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> > > > > > > > +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> > > > > > >
> > > > > > > Are we sure restore_pblist will never be NULL?
> > > > > > restore_pblist is a link-list, it will be null during initialization or during page clean up by hibernation core. During the initial
> > > resume
> > > > > process, the hibernation core will check the header and load the pages. If everything works correctly, the page will be linked to
> the
> > > > > restore_pblist and then invoke swsusp_arch_resume() else hibernation core will throws error and failed to resume from the
> > > > > hibernated image.
> > > > >
> > > > > I know restore_pblist is a linked-list and this doesn't answer the
> > > > > question. The comment above restore_pblist says
> > > > >
> > > > > /*
> > > > >  * List of PBEs needed for restoring the pages that were allocated before
> > > > >  * the suspend and included in the suspend image, but have also been
> > > > >  * allocated by the "resume" kernel, so their contents cannot be written
> > > > >  * directly to their "original" page frames.
> > > > >  */
> > > > >
> > > > > which implies the pages that end up on this list are "special". My
> > > > > question is whether or not we're guaranteed to have at least one
> > > > > of these special pages. If not, we shouldn't assume s4 is non-null.
> > > > > If so, then a comment stating why that's guaranteed would be nice.
> > > > The restore_pblist will not be null otherwise swsusp_arch_resume wouldn't get invoked. you can find how the link-list are link
> and
> > > how it checks against validity at https://elixir.bootlin.com/linux/v6.2-rc8/source/kernel/power/snapshot.c . " A comment stating
> why
> > > that's guaranteed would be nice" ? Hmm, perhaps this is out of my scope but I do believe in the page validity checking in the link I
> > > shared.
> > >
> > > Sorry, but pointing to an entire source file (one that I've obviously
> > > already looked at, since I quoted a comment from it...) is not helpful.
> > > I don't see where restore_pblist is being checked before
> > > swsusp_arch_resume() is issued (from its callsite in hibernate.c).
> > Sure, below shows the hibernation flow for your reference. The link-list creation and checking found at:
> https://elixir.bootlin.com/linux/v6.2/source/kernel/power/snapshot.c#L2576
> > software_resume()
> > 	load_image_and_restore()
> > 		swsusp_read()
> > 			load_image()
> >  				snapshot_write_next()
> > 					get_buffer() <-- This is the function checks and links the pages to the restore_pblist
> 
> Yup, I've read this path, including get_buffer(), where I saw that
> get_buffer() can return an address without allocating a PBE. Where is the
> check that restore_pblist isn't NULL, i.e. we see that at least one PBE
> has been allocated by get_buffer(), before we call swsusp_arch_resume()?
> 
> Or, is known that at least one or more pages match the criteria pointed
> out in the comment below (copied from get_buffer())?
> 
>         /*
>          * The "original" page frame has not been allocated and we have to
>          * use a "safe" page frame to store the loaded page.
>          */
> 
> If so, then which ones? And where does it state that?
Let's look at the below pseudocode and hope it clear your doubt. restore_pblist depends on safe_page_list and pbe and both pointers are checked. I couldn't find from where the restore_pblist will be null..
	//Pseudocode to illustrate the image loading
	initialize restore_pblist to null;
	initialize safe_pages_list to null;
	Allocate safe page list, return error if failed;
	load image;
loop:	Create pbe chain, return error if failed;
	assign orig_addr and safe_page to pbe;
	link pbe to restore_pblist;
	return pbe to handle->buffer;
	check handle->buffer;
	goto loop if no error else return with error;
> 
> Thanks,
> drew
> 
> 
> > 		hibernation_restore()
> > 			resume_target_kernel()
> > 				swsusp_arch_resume()
> > >
> > > Thanks,
> > > drew
Sia Jee Heng Feb. 27, 2023, 3:11 a.m. UTC | #11
> -----Original Message-----
> From: Alexandre Ghiti <alex@ghiti.fr>
> Sent: Friday, 24 February, 2023 8:29 PM
> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>; paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu
> Cc: linux-riscv@lists.infradead.org; linux-kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo
> <mason.huo@starfivetech.com>
> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> On 2/21/23 03:35, Sia Jee Heng wrote:
> > Low level Arch functions were created to support hibernation.
> > swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> > cpu state onto the stack, then calling swsusp_save() to save the memory
> > image.
> >
> > Arch specific hibernation header is implemented and is utilized by the
> > arch_hibernation_header_restore() and arch_hibernation_header_save()
> > functions. The arch specific hibernation header consists of satp, hartid,
> > and the cpu_resume address. The kernel built version is also need to be
> > saved into the hibernation image header to making sure only the same
> > kernel is restore when resume.
> >
> > swsusp_arch_resume() creates a temporary page table that covering only
> > the linear map. It copies the restore code to a 'safe' page, then start
> > to restore the memory image. Once completed, it restores the original
> > kernel's page table. It then calls into __hibernate_cpu_resume()
> > to restore the CPU context. Finally, it follows the normal hibernation
> > path back to the hibernation core.
> >
> > To enable hibernation/suspend to disk into RISCV, the below config
> > need to be enabled:
> > - CONFIG_ARCH_HIBERNATION_HEADER
> > - CONFIG_ARCH_HIBERNATION_POSSIBLE
> >
> > Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> > Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> > Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> > ---
> >   arch/riscv/Kconfig                 |   7 +
> >   arch/riscv/include/asm/assembler.h |  20 ++
> >   arch/riscv/include/asm/suspend.h   |  19 ++
> >   arch/riscv/kernel/Makefile         |   1 +
> >   arch/riscv/kernel/asm-offsets.c    |   5 +
> >   arch/riscv/kernel/hibernate-asm.S  |  77 +++++
> >   arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
> >   7 files changed, 576 insertions(+)
> >   create mode 100644 arch/riscv/kernel/hibernate-asm.S
> >   create mode 100644 arch/riscv/kernel/hibernate.c
> >
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index e2b656043abf..4555848a817f 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -690,6 +690,13 @@ menu "Power management options"
> >
> >   source "kernel/power/Kconfig"
> >
> > +config ARCH_HIBERNATION_POSSIBLE
> > +	def_bool y
> > +
> > +config ARCH_HIBERNATION_HEADER
> > +	def_bool y
> > +	depends on HIBERNATION
> > +
> >   endmenu # "Power management options"
> >
> >   menu "CPU Power Management"
> > diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> > index 727a97735493..68c46c0e0ea8 100644
> > --- a/arch/riscv/include/asm/assembler.h
> > +++ b/arch/riscv/include/asm/assembler.h
> > @@ -59,4 +59,24 @@
> >   		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
> >   	.endm
> >
> > +/*
> > + * copy_page - copy 1 page (4KB) of data from source to destination
> > + * @a0 - destination
> > + * @a1 - source
> > + */
> > +	.macro	copy_page a0, a1
> > +		lui	a2, 0x1
> > +		add	a2, a2, a0
> > +1 :
> > +		REG_L	t0, 0(a1)
> > +		REG_L	t1, SZREG(a1)
> > +
> > +		REG_S	t0, 0(a0)
> > +		REG_S	t1, SZREG(a0)
> > +
> > +		addi	a0, a0, 2 * SZREG
> > +		addi	a1, a1, 2 * SZREG
> > +		bne	a2, a0, 1b
> > +	.endm
> > +
> >   #endif	/* __ASM_ASSEMBLER_H */
> > diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> > index 75419c5ca272..3362da56a9d8 100644
> > --- a/arch/riscv/include/asm/suspend.h
> > +++ b/arch/riscv/include/asm/suspend.h
> > @@ -21,6 +21,11 @@ struct suspend_context {
> >   #endif
> >   };
> >
> > +/*
> > + * Used by hibernation core and cleared during resume sequence
> > + */
> > +extern int in_suspend;
> > +
> >   /* Low-level CPU suspend entry function */
> >   int __cpu_suspend_enter(struct suspend_context *context);
> >
> > @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
> >   /* Used to save and restore the csr */
> >   void suspend_save_csrs(struct suspend_context *context);
> >   void suspend_restore_csrs(struct suspend_context *context);
> > +
> > +/* Low-level API to support hibernation */
> > +int swsusp_arch_suspend(void);
> > +int swsusp_arch_resume(void);
> > +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> > +int arch_hibernation_header_restore(void *addr);
> > +int __hibernate_cpu_resume(void);
> > +
> > +/* Used to resume on the CPU we hibernated on */
> > +int hibernate_resume_nonboot_cpu_disable(void);
> > +
> > +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> > +					unsigned long cpu_resume);
> > +asmlinkage int hibernate_core_restore_code(void);
> >   #endif
> > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > index 4cf303a779ab..daab341d55e4 100644
> > --- a/arch/riscv/kernel/Makefile
> > +++ b/arch/riscv/kernel/Makefile
> > @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
> >   obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> >
> >   obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> > +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> >
> >   obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
> >   obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > index df9444397908..d6a75aac1d27 100644
> > --- a/arch/riscv/kernel/asm-offsets.c
> > +++ b/arch/riscv/kernel/asm-offsets.c
> > @@ -9,6 +9,7 @@
> >   #include <linux/kbuild.h>
> >   #include <linux/mm.h>
> >   #include <linux/sched.h>
> > +#include <linux/suspend.h>
> >   #include <asm/kvm_host.h>
> >   #include <asm/thread_info.h>
> >   #include <asm/ptrace.h>
> > @@ -116,6 +117,10 @@ void asm_offsets(void)
> >
> >   	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> >
> > +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> > +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> > +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> > +
> >   	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
> >   	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
> >   	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> > diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> > new file mode 100644
> > index 000000000000..846affe4dced
> > --- /dev/null
> > +++ b/arch/riscv/kernel/hibernate-asm.S
> > @@ -0,0 +1,77 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Hibernation low level support for RISCV.
> > + *
> > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > + *
> > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > + */
> > +
> > +#include <asm/asm.h>
> > +#include <asm/asm-offsets.h>
> > +#include <asm/assembler.h>
> > +#include <asm/csr.h>
> > +
> > +#include <linux/linkage.h>
> > +
> > +/*
> > + * int __hibernate_cpu_resume(void)
> > + * Switch back to the hibernated image's page table prior to restoring the CPU
> > + * context.
> > + *
> > + * Always returns 0
> > + */
> > +ENTRY(__hibernate_cpu_resume)
> > +	/* switch to hibernated image's page table. */
> > +	csrw CSR_SATP, s0
> > +	sfence.vma
> > +
> > +	REG_L	a0, hibernate_cpu_context
> > +
> > +	restore_csr
> > +	restore_reg
> > +
> > +	/* Return zero value. */
> > +	add	a0, zero, zero
> > +
> > +	ret
> > +END(__hibernate_cpu_resume)
> > +
> > +/*
> > + * Prepare to restore the image.
> > + * a0: satp of saved page tables.
> > + * a1: satp of temporary page tables.
> > + * a2: cpu_resume.
> > + */
> > +ENTRY(hibernate_restore_image)
> > +	mv	s0, a0
> > +	mv	s1, a1
> > +	mv	s2, a2
> > +	REG_L	s4, restore_pblist
> > +	REG_L	a1, relocated_restore_code
> > +
> > +	jalr	a1
> > +END(hibernate_restore_image)
> > +
> > +/*
> > + * The below code will be executed from a 'safe' page.
> > + * It first switches to the temporary page table, then starts to copy the pages
> > + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> > + * to restore the CPU context.
> > + */
> > +ENTRY(hibernate_core_restore_code)
> > +	/* switch to temp page table. */
> > +	csrw satp, s1
> > +	sfence.vma
> > +.Lcopy:
> > +	/* The below code will restore the hibernated image. */
> > +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> > +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> > +
> > +	copy_page a0, a1
> > +
> > +	REG_L	s4, HIBERN_PBE_NEXT(s4)
> > +	bnez	s4, .Lcopy
> > +
> > +	jalr	s2
> > +END(hibernate_core_restore_code)
> > diff --git a/arch/riscv/kernel/hibernate.c b/arch/riscv/kernel/hibernate.c
> > new file mode 100644
> > index 000000000000..46a2f470db6e
> > --- /dev/null
> > +++ b/arch/riscv/kernel/hibernate.c
> > @@ -0,0 +1,447 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Hibernation support for RISCV
> > + *
> > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > + *
> > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > + */
> > +
> > +#include <asm/barrier.h>
> > +#include <asm/cacheflush.h>
> > +#include <asm/mmu_context.h>
> > +#include <asm/page.h>
> > +#include <asm/pgalloc.h>
> > +#include <asm/pgtable.h>
> > +#include <asm/sections.h>
> > +#include <asm/set_memory.h>
> > +#include <asm/smp.h>
> > +#include <asm/suspend.h>
> > +
> > +#include <linux/cpu.h>
> > +#include <linux/memblock.h>
> > +#include <linux/pm.h>
> > +#include <linux/sched.h>
> > +#include <linux/suspend.h>
> > +#include <linux/utsname.h>
> > +
> > +/* The logical cpu number we should resume on, initialised to a non-cpu number. */
> > +static int sleep_cpu = -EINVAL;
> > +
> > +/* Pointer to the temporary resume page table. */
> > +static pgd_t *resume_pg_dir;
> > +
> > +/* CPU context to be saved. */
> > +struct suspend_context *hibernate_cpu_context;
> > +EXPORT_SYMBOL_GPL(hibernate_cpu_context);
> > +
> > +unsigned long relocated_restore_code;
> > +EXPORT_SYMBOL_GPL(relocated_restore_code);
> > +
> > +/**
> > + * struct arch_hibernate_hdr_invariants - container to store kernel build version.
> > + * @uts_version: to save the build number and date so that the we do not resume with
> > + *		a different kernel.
> > + */
> > +struct arch_hibernate_hdr_invariants {
> > +	char		uts_version[__NEW_UTS_LEN + 1];
> > +};
> > +
> > +/**
> > + * struct arch_hibernate_hdr - helper parameters that help us to restore the image.
> > + * @invariants: container to store kernel build version.
> > + * @hartid: to make sure same boot_cpu executes the hibernate/restore code.
> > + * @saved_satp: original page table used by the hibernated image.
> > + * @restore_cpu_addr: the kernel's image address to restore the CPU context.
> > + */
> > +static struct arch_hibernate_hdr {
> > +	struct arch_hibernate_hdr_invariants invariants;
> > +	unsigned long	hartid;
> > +	unsigned long	saved_satp;
> > +	unsigned long	restore_cpu_addr;
> > +} resume_hdr;
> > +
> > +static inline void arch_hdr_invariants(struct arch_hibernate_hdr_invariants *i)
> > +{
> > +	memset(i, 0, sizeof(*i));
> > +	memcpy(i->uts_version, init_utsname()->version, sizeof(i->uts_version));
> > +}
> > +
> > +/*
> > + * Check if the given pfn is in the 'nosave' section.
> > + */
> > +int pfn_is_nosave(unsigned long pfn)
> > +{
> > +	unsigned long nosave_begin_pfn = sym_to_pfn(&__nosave_begin);
> > +	unsigned long nosave_end_pfn = sym_to_pfn(&__nosave_end - 1);
> > +
> > +	return ((pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn));
> > +}
> > +
> > +void notrace save_processor_state(void)
> > +{
> > +	WARN_ON(num_online_cpus() != 1);
> > +}
> > +
> > +void notrace restore_processor_state(void)
> > +{
> > +}
> > +
> > +/*
> > + * Helper parameters need to be saved to the hibernation image header.
> > + */
> > +int arch_hibernation_header_save(void *addr, unsigned int max_size)
> > +{
> > +	struct arch_hibernate_hdr *hdr = addr;
> > +
> > +	if (max_size < sizeof(*hdr))
> > +		return -EOVERFLOW;
> > +
> > +	arch_hdr_invariants(&hdr->invariants);
> > +
> > +	hdr->hartid = cpuid_to_hartid_map(sleep_cpu);
> > +	hdr->saved_satp = csr_read(CSR_SATP);
> > +	hdr->restore_cpu_addr = (unsigned long)__hibernate_cpu_resume;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(arch_hibernation_header_save);
> > +
> > +/*
> > + * Retrieve the helper parameters from the hibernation image header.
> > + */
> > +int arch_hibernation_header_restore(void *addr)
> > +{
> > +	struct arch_hibernate_hdr_invariants invariants;
> > +	struct arch_hibernate_hdr *hdr = addr;
> > +	int ret = 0;
> > +
> > +	arch_hdr_invariants(&invariants);
> > +
> > +	if (memcmp(&hdr->invariants, &invariants, sizeof(invariants))) {
> > +		pr_crit("Hibernate image not generated by this kernel!\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	sleep_cpu = riscv_hartid_to_cpuid(hdr->hartid);
> > +	if (sleep_cpu < 0) {
> > +		pr_crit("Hibernated on a CPU not known to this kernel!\n");
> > +		sleep_cpu = -EINVAL;
> > +		return -EINVAL;
> > +	}
> > +
> > +#ifdef CONFIG_SMP
> > +	ret = bringup_hibernate_cpu(sleep_cpu);
> > +	if (ret) {
> > +		sleep_cpu = -EINVAL;
> > +		return ret;
> > +	}
> > +#endif
> > +	resume_hdr = *hdr;
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(arch_hibernation_header_restore);
> > +
> > +int swsusp_arch_suspend(void)
> > +{
> > +	int ret = 0;
> > +
> > +	if (__cpu_suspend_enter(hibernate_cpu_context)) {
> > +		sleep_cpu = smp_processor_id();
> > +		suspend_save_csrs(hibernate_cpu_context);
> > +		ret = swsusp_save();
> > +	} else {
> > +		suspend_restore_csrs(hibernate_cpu_context);
> > +		flush_tlb_all();
> > +		flush_icache_all();
> > +
> > +		/*
> > +		 * Tell the hibernation core that we've just restored the memory.
> > +		 */
> > +		in_suspend = 0;
> > +		sleep_cpu = -EINVAL;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static unsigned long _temp_pgtable_map_pte(pte_t *dst_ptep, pte_t *src_ptep,
> > +					   unsigned long addr, pgprot_t prot)
> > +{
> > +	pte_t pte = READ_ONCE(*src_ptep);
> > +
> > +	if (pte_present(pte))
> > +		set_pte(dst_ptep, __pte(pte_val(pte) | pgprot_val(prot)));
> > +
> > +	return 0;
> > +}
> 
> 
> I don't see the need for this function
Sure, can remove it.
> 
> 
> > +
> > +static unsigned long temp_pgtable_map_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp,
> > +					  unsigned long start, unsigned long end,
> > +					  pgprot_t prot)
> > +{
> > +	unsigned long addr = start;
> > +	pte_t *src_ptep;
> > +	pte_t *dst_ptep;
> > +
> > +	if (pmd_none(READ_ONCE(*dst_pmdp))) {
> > +		dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
> > +		if (!dst_ptep)
> > +			return -ENOMEM;
> > +
> > +		pmd_populate_kernel(NULL, dst_pmdp, dst_ptep);
> > +	}
> > +
> > +	dst_ptep = pte_offset_kernel(dst_pmdp, start);
> > +	src_ptep = pte_offset_kernel(src_pmdp, start);
> > +
> > +	do {
> > +		_temp_pgtable_map_pte(dst_ptep, src_ptep, addr, prot);
> > +	} while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr < end);
> > +
> > +	return 0;
> > +}
> > +
> > +static unsigned long temp_pgtable_map_pmd(pud_t *dst_pudp, pud_t *src_pudp,
> > +					  unsigned long start, unsigned long end,
> > +					  pgprot_t prot)
> > +{
> > +	unsigned long addr = start;
> > +	unsigned long next;
> > +	unsigned long ret;
> > +	pmd_t *src_pmdp;
> > +	pmd_t *dst_pmdp;
> > +
> > +	if (pud_none(READ_ONCE(*dst_pudp))) {
> > +		dst_pmdp = (pmd_t *)get_safe_page(GFP_ATOMIC);
> > +		if (!dst_pmdp)
> > +			return -ENOMEM;
> > +
> > +		pud_populate(NULL, dst_pudp, dst_pmdp);
> > +	}
> > +
> > +	dst_pmdp = pmd_offset(dst_pudp, start);
> > +	src_pmdp = pmd_offset(src_pudp, start);
> > +
> > +	do {
> > +		pmd_t pmd = READ_ONCE(*src_pmdp);
> > +
> > +		next = pmd_addr_end(addr, end);
> > +
> > +		if (pmd_none(pmd))
> > +			continue;
> > +
> > +		if (pmd_leaf(pmd)) {
> > +			set_pmd(dst_pmdp, __pmd(pmd_val(pmd) | pgprot_val(prot)));
> > +		} else {
> > +			ret = temp_pgtable_map_pte(dst_pmdp, src_pmdp, addr, next, prot);
> > +			if (ret)
> > +				return -ENOMEM;
> > +		}
> > +	} while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
> > +
> > +	return 0;
> > +}
> > +
> > +static unsigned long temp_pgtable_map_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp,
> > +					  unsigned long start,
> > +					  unsigned long end, pgprot_t prot)
> > +{
> > +	unsigned long addr = start;
> > +	unsigned long next;
> > +	unsigned long ret;
> > +	pud_t *dst_pudp;
> > +	pud_t *src_pudp;
> > +
> > +	if (p4d_none(READ_ONCE(*dst_p4dp))) {
> > +		dst_pudp = (pud_t *)get_safe_page(GFP_ATOMIC);
> > +		if (!dst_pudp)
> > +			return -ENOMEM;
> > +
> > +		p4d_populate(NULL, dst_p4dp, dst_pudp);
> > +	}
> > +
> > +	dst_pudp = pud_offset(dst_p4dp, start);
> > +	src_pudp = pud_offset(src_p4dp, start);
> > +
> > +	do {
> > +		pud_t pud = READ_ONCE(*src_pudp);
> > +
> > +		next = pud_addr_end(addr, end);
> > +
> > +		if (pud_none(pud))
> > +			continue;
> > +
> > +		if (pud_leaf(pud)) {
> > +			set_pud(dst_pudp, __pud(pud_val(pud) | pgprot_val(prot)));
> > +		} else {
> > +			ret = temp_pgtable_map_pmd(dst_pudp, src_pudp, addr, next, prot);
> > +			if (ret)
> > +				return -ENOMEM;
> > +		}
> > +	} while (dst_pudp++, src_pudp++, addr = next, addr != end);
> > +
> > +	return 0;
> > +}
> > +
> > +static unsigned long temp_pgtable_map_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp,
> > +					  unsigned long start, unsigned long end,
> > +					  pgprot_t prot)
> > +{
> > +	unsigned long addr = start;
> 
> 
> Nit: you don't need the addr variable, you can rename start into addr
> and directly work with it.
sure.
> 
> 
> > +	unsigned long next;
> > +	unsigned long ret;
> > +	p4d_t *dst_p4dp;
> > +	p4d_t *src_p4dp;
> > +
> > +	if (pgd_none(READ_ONCE(*dst_pgdp))) {
> > +		dst_p4dp = (p4d_t *)get_safe_page(GFP_ATOMIC);
> > +		if (!dst_p4dp)
> > +			return -ENOMEM;
> > +
> > +		pgd_populate(NULL, dst_pgdp, dst_p4dp);
> > +	}
> > +
> > +	dst_p4dp = p4d_offset(dst_pgdp, start);
> > +	src_p4dp = p4d_offset(src_pgdp, start);
> > +
> > +	do {
> > +		p4d_t p4d = READ_ONCE(*src_p4dp);
> > +
> > +		next = p4d_addr_end(addr, end);
> > +
> > +		if (p4d_none(READ_ONCE(*src_p4dp)))
> 
> 
> You should use p4d here: p4d_none(p4d)
sure
> 
> 
> > +			continue;
> > +
> > +		if (p4d_leaf(p4d)) {
> > +			set_p4d(dst_p4dp, __p4d(p4d_val(p4d) | pgprot_val(prot)));
> 
> 
> The "| pgprot_val(prot)" happens to work because PAGE_KERNEL will add
> the PAGE_WRITE bit: I'd rather make it more clear by explicitly add
> PAGE_WRITE.
sure, this can be done.
> 
> 
> > +		} else {
> > +			ret = temp_pgtable_map_pud(dst_p4dp, src_p4dp, addr, next, prot);
> > +			if (ret)
> > +				return -ENOMEM;
> > +		}
> > +	} while (dst_p4dp++, src_p4dp++, addr = next, addr != end);
> > +
> > +	return 0;
> > +}
> > +
> > +static unsigned long temp_pgtable_mapping(pgd_t *pgdp)
> > +{
> > +	unsigned long end = (unsigned long)pfn_to_virt(max_low_pfn);
> > +	unsigned long addr = PAGE_OFFSET;
> > +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
> > +	pgd_t *src_pgdp = pgd_offset_k(addr);
> > +	unsigned long next;
> > +
> > +	do {
> > +		next = pgd_addr_end(addr, end);
> > +		if (pgd_none(READ_ONCE(*src_pgdp)))
> > +			continue;
> > +
> 
> 
> We added the pgd_leaf test in kernel_page_present, let's add it here too.
sure.
> 
> 
> > +		if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, next, PAGE_KERNEL))
> > +			return -ENOMEM;
> > +	} while (dst_pgdp++, src_pgdp++, addr = next, addr != end);
> > +
> > +	return 0;
> > +}
> > +
> > +static unsigned long temp_pgtable_text_mapping(pgd_t *pgdp, unsigned long addr)
> > +{
> > +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
> > +	pgd_t *src_pgdp = pgd_offset_k(addr);
> > +
> > +	if (pgd_none(READ_ONCE(*src_pgdp)))
> > +		return -EFAULT;
> > +
> > +	if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, addr, PAGE_KERNEL_EXEC))
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> 
> 
> Ok so if we fall into a huge mapping, you add the exec permission to the
> whole range, which could easily be 1GB. I think that either we can avoid
> this step by mapping the whole linear mapping as executable, or we
> actually use another pgd entry for this page that is not in the linear
> mapping. The latter seems cleaner, what do you think?
we can map the whole linear address to writable & executable, by doing this, we can avoid the remapping at the linear map again.
we still need to use the same pgd entry for the non-linear mapping just like how it did at the swapper_pg_dir (linear and non-linear addr are within the range supported by the pgd).
> 
> 
> > +
> > +static unsigned long relocate_restore_code(void)
> > +{
> > +	unsigned long ret;
> > +	void *page = (void *)get_safe_page(GFP_ATOMIC);
> > +
> > +	if (!page)
> > +		return -ENOMEM;
> > +
> > +	copy_page(page, hibernate_core_restore_code);
> > +
> > +	/* Make the page containing the relocated code executable. */
> > +	set_memory_x((unsigned long)page, 1);
> > +
> > +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)page);
> > +	if (ret)
> > +		return ret;
> > +
> > +	return (unsigned long)page;
> > +}
> > +
> > +int swsusp_arch_resume(void)
> > +{
> > +	unsigned long ret;
> > +
> > +	/*
> > +	 * Memory allocated by get_safe_page() will be dealt with by the hibernation core,
> > +	 * we don't need to free it here.
> > +	 */
> > +	resume_pg_dir = (pgd_t *)get_safe_page(GFP_ATOMIC);
> > +	if (!resume_pg_dir)
> > +		return -ENOMEM;
> > +
> > +	/*
> > +	 * The pages need to be writable when restoring the image.
> > +	 * Create a second copy of page table just for the linear map.
> > +	 * Use this temporary page table to restore the image.
> > +	 */
> > +	ret = temp_pgtable_mapping(resume_pg_dir);
> > +	if (ret)
> > +		return (int)ret;
> 
> 
> The temp_pgtable* functions should return an int to avoid this cast.
> 
> 
> > +
> > +	/* Move the restore code to a new page so that it doesn't get overwritten by itself. */
> > +	relocated_restore_code = relocate_restore_code();
> > +	if (relocated_restore_code == -ENOMEM)
> > +		return -ENOMEM;
> > +
> > +	/*
> > +	 * Map the __hibernate_cpu_resume() address to the temporary page table so that the
> > +	 * restore code can jumps to it after finished restore the image. The next execution
> > +	 * code doesn't find itself in a different address space after switching over to the
> > +	 * original page table used by the hibernated image.
> > +	 */
> > +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)resume_hdr.restore_cpu_addr);
> > +	if (ret)
> > +		return ret;
> > +
> > +	hibernate_restore_image(resume_hdr.saved_satp, (PFN_DOWN(__pa(resume_pg_dir)) | satp_mode),
> > +				resume_hdr.restore_cpu_addr);
> > +
> > +	return 0;
> > +}
> > +
> > +#ifdef CONFIG_PM_SLEEP_SMP
> > +int hibernate_resume_nonboot_cpu_disable(void)
> > +{
> > +	if (sleep_cpu < 0) {
> > +		pr_err("Failing to resume from hibernate on an unknown CPU\n");
> > +		return -ENODEV;
> > +	}
> > +
> > +	return freeze_secondary_cpus(sleep_cpu);
> > +}
> > +#endif
> > +
> > +static int __init riscv_hibernate_init(void)
> > +{
> > +	hibernate_cpu_context = kzalloc(sizeof(*hibernate_cpu_context), GFP_KERNEL);
> > +
> > +	if (WARN_ON(!hibernate_cpu_context))
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> > +
> > +early_initcall(riscv_hibernate_init);
> 
> 
> Overall, it is now nicer with the the proper page table walk: but we can
> now see that the code is exactly the same as arm64, what prevents us
> from merging both somewhere in mm/?
1. low level page table bit definition not the same
2. Need to refactor code for both riscv and arm64
3. Need to verify the solution for both riscv and arm64 platforms (need someone who expertise on arm64)
4. Might need to extend the function to support other arch
5. Overall, it is do-able but the effort to support the above matters are huge.

>
Andrew Jones Feb. 27, 2023, 7:59 a.m. UTC | #12
On Mon, Feb 27, 2023 at 02:14:27AM +0000, JeeHeng Sia wrote:
> 
> 
> > -----Original Message-----
> > From: Andrew Jones <ajones@ventanamicro.com>
> > Sent: Friday, 24 February, 2023 8:07 PM
> > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > 
> > On Fri, Feb 24, 2023 at 10:30:19AM +0000, JeeHeng Sia wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > Sent: Friday, 24 February, 2023 5:55 PM
> > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > >
> > > > On Fri, Feb 24, 2023 at 09:33:31AM +0000, JeeHeng Sia wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > > Sent: Friday, 24 February, 2023 5:00 PM
> > > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > > >
> > > > > > On Fri, Feb 24, 2023 at 02:05:43AM +0000, JeeHeng Sia wrote:
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > > > > Sent: Friday, 24 February, 2023 2:07 AM
> > > > > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > > > > >
> > > > > > > > On Tue, Feb 21, 2023 at 10:35:23AM +0800, Sia Jee Heng wrote:
> > > > > > > > > Low level Arch functions were created to support hibernation.
> > > > > > > > > swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> > > > > > > > > cpu state onto the stack, then calling swsusp_save() to save the memory
> > > > > > > > > image.
> > > > > > > > >
> > > > > > > > > Arch specific hibernation header is implemented and is utilized by the
> > > > > > > > > arch_hibernation_header_restore() and arch_hibernation_header_save()
> > > > > > > > > functions. The arch specific hibernation header consists of satp, hartid,
> > > > > > > > > and the cpu_resume address. The kernel built version is also need to be
> > > > > > > > > saved into the hibernation image header to making sure only the same
> > > > > > > > > kernel is restore when resume.
> > > > > > > > >
> > > > > > > > > swsusp_arch_resume() creates a temporary page table that covering only
> > > > > > > > > the linear map. It copies the restore code to a 'safe' page, then start
> > > > > > > > > to restore the memory image. Once completed, it restores the original
> > > > > > > > > kernel's page table. It then calls into __hibernate_cpu_resume()
> > > > > > > > > to restore the CPU context. Finally, it follows the normal hibernation
> > > > > > > > > path back to the hibernation core.
> > > > > > > > >
> > > > > > > > > To enable hibernation/suspend to disk into RISCV, the below config
> > > > > > > > > need to be enabled:
> > > > > > > > > - CONFIG_ARCH_HIBERNATION_HEADER
> > > > > > > > > - CONFIG_ARCH_HIBERNATION_POSSIBLE
> > > > > > > > >
> > > > > > > > > Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> > > > > > > > > Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> > > > > > > > > Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> > > > > > > > > ---
> > > > > > > > >  arch/riscv/Kconfig                 |   7 +
> > > > > > > > >  arch/riscv/include/asm/assembler.h |  20 ++
> > > > > > > > >  arch/riscv/include/asm/suspend.h   |  19 ++
> > > > > > > > >  arch/riscv/kernel/Makefile         |   1 +
> > > > > > > > >  arch/riscv/kernel/asm-offsets.c    |   5 +
> > > > > > > > >  arch/riscv/kernel/hibernate-asm.S  |  77 +++++
> > > > > > > > >  arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
> > > > > > > > >  7 files changed, 576 insertions(+)
> > > > > > > > >  create mode 100644 arch/riscv/kernel/hibernate-asm.S
> > > > > > > > >  create mode 100644 arch/riscv/kernel/hibernate.c
> > > > > > > > >
> > > > > > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > > > > > > > index e2b656043abf..4555848a817f 100644
> > > > > > > > > --- a/arch/riscv/Kconfig
> > > > > > > > > +++ b/arch/riscv/Kconfig
> > > > > > > > > @@ -690,6 +690,13 @@ menu "Power management options"
> > > > > > > > >
> > > > > > > > >  source "kernel/power/Kconfig"
> > > > > > > > >
> > > > > > > > > +config ARCH_HIBERNATION_POSSIBLE
> > > > > > > > > +	def_bool y
> > > > > > > > > +
> > > > > > > > > +config ARCH_HIBERNATION_HEADER
> > > > > > > > > +	def_bool y
> > > > > > > > > +	depends on HIBERNATION
> > > > > > > >
> > > > > > > > nit: I think this can be simplified as def_bool HIBERNATION
> > > > > > > good suggestion. will change it.
> > > > > > > >
> > > > > > > > > +
> > > > > > > > >  endmenu # "Power management options"
> > > > > > > > >
> > > > > > > > >  menu "CPU Power Management"
> > > > > > > > > diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> > > > > > > > > index 727a97735493..68c46c0e0ea8 100644
> > > > > > > > > --- a/arch/riscv/include/asm/assembler.h
> > > > > > > > > +++ b/arch/riscv/include/asm/assembler.h
> > > > > > > > > @@ -59,4 +59,24 @@
> > > > > > > > >  		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
> > > > > > > > >  	.endm
> > > > > > > > >
> > > > > > > > > +/*
> > > > > > > > > + * copy_page - copy 1 page (4KB) of data from source to destination
> > > > > > > > > + * @a0 - destination
> > > > > > > > > + * @a1 - source
> > > > > > > > > + */
> > > > > > > > > +	.macro	copy_page a0, a1
> > > > > > > > > +		lui	a2, 0x1
> > > > > > > > > +		add	a2, a2, a0
> > > > > > > > > +1 :
> > > > > > > >     ^ please remove this space
> > > > > > > can't remove it otherwise checkpatch will throws ERROR: spaces required around that ':'
> > > > > >
> > > > > > Oh, right, labels in macros have this requirement.
> > > > > >
> > > > > > > >
> > > > > > > > > +		REG_L	t0, 0(a1)
> > > > > > > > > +		REG_L	t1, SZREG(a1)
> > > > > > > > > +
> > > > > > > > > +		REG_S	t0, 0(a0)
> > > > > > > > > +		REG_S	t1, SZREG(a0)
> > > > > > > > > +
> > > > > > > > > +		addi	a0, a0, 2 * SZREG
> > > > > > > > > +		addi	a1, a1, 2 * SZREG
> > > > > > > > > +		bne	a2, a0, 1b
> > > > > > > > > +	.endm
> > > > > > > > > +
> > > > > > > > >  #endif	/* __ASM_ASSEMBLER_H */
> > > > > > > > > diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> > > > > > > > > index 75419c5ca272..3362da56a9d8 100644
> > > > > > > > > --- a/arch/riscv/include/asm/suspend.h
> > > > > > > > > +++ b/arch/riscv/include/asm/suspend.h
> > > > > > > > > @@ -21,6 +21,11 @@ struct suspend_context {
> > > > > > > > >  #endif
> > > > > > > > >  };
> > > > > > > > >
> > > > > > > > > +/*
> > > > > > > > > + * Used by hibernation core and cleared during resume sequence
> > > > > > > > > + */
> > > > > > > > > +extern int in_suspend;
> > > > > > > > > +
> > > > > > > > >  /* Low-level CPU suspend entry function */
> > > > > > > > >  int __cpu_suspend_enter(struct suspend_context *context);
> > > > > > > > >
> > > > > > > > > @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
> > > > > > > > >  /* Used to save and restore the csr */
> > > > > > > > >  void suspend_save_csrs(struct suspend_context *context);
> > > > > > > > >  void suspend_restore_csrs(struct suspend_context *context);
> > > > > > > > > +
> > > > > > > > > +/* Low-level API to support hibernation */
> > > > > > > > > +int swsusp_arch_suspend(void);
> > > > > > > > > +int swsusp_arch_resume(void);
> > > > > > > > > +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> > > > > > > > > +int arch_hibernation_header_restore(void *addr);
> > > > > > > > > +int __hibernate_cpu_resume(void);
> > > > > > > > > +
> > > > > > > > > +/* Used to resume on the CPU we hibernated on */
> > > > > > > > > +int hibernate_resume_nonboot_cpu_disable(void);
> > > > > > > > > +
> > > > > > > > > +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> > > > > > > > > +					unsigned long cpu_resume);
> > > > > > > > > +asmlinkage int hibernate_core_restore_code(void);
> > > > > > > > >  #endif
> > > > > > > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > > > > > > > index 4cf303a779ab..daab341d55e4 100644
> > > > > > > > > --- a/arch/riscv/kernel/Makefile
> > > > > > > > > +++ b/arch/riscv/kernel/Makefile
> > > > > > > > > @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
> > > > > > > > >  obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> > > > > > > > >
> > > > > > > > >  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> > > > > > > > > +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> > > > > > > > >
> > > > > > > > >  obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
> > > > > > > > >  obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> > > > > > > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > > > > > > > index df9444397908..d6a75aac1d27 100644
> > > > > > > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > > > > > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > > > > > > > @@ -9,6 +9,7 @@
> > > > > > > > >  #include <linux/kbuild.h>
> > > > > > > > >  #include <linux/mm.h>
> > > > > > > > >  #include <linux/sched.h>
> > > > > > > > > +#include <linux/suspend.h>
> > > > > > > > >  #include <asm/kvm_host.h>
> > > > > > > > >  #include <asm/thread_info.h>
> > > > > > > > >  #include <asm/ptrace.h>
> > > > > > > > > @@ -116,6 +117,10 @@ void asm_offsets(void)
> > > > > > > > >
> > > > > > > > >  	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> > > > > > > > >
> > > > > > > > > +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> > > > > > > > > +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> > > > > > > > > +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> > > > > > > > > +
> > > > > > > > >  	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
> > > > > > > > >  	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
> > > > > > > > >  	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> > > > > > > > > diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> > > > > > > > > new file mode 100644
> > > > > > > > > index 000000000000..846affe4dced
> > > > > > > > > --- /dev/null
> > > > > > > > > +++ b/arch/riscv/kernel/hibernate-asm.S
> > > > > > > > > @@ -0,0 +1,77 @@
> > > > > > > > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > > > > > > > +/*
> > > > > > > > > + * Hibernation low level support for RISCV.
> > > > > > > > > + *
> > > > > > > > > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > > > > > > > > + *
> > > > > > > > > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > > > + */
> > > > > > > > > +
> > > > > > > > > +#include <asm/asm.h>
> > > > > > > > > +#include <asm/asm-offsets.h>
> > > > > > > > > +#include <asm/assembler.h>
> > > > > > > > > +#include <asm/csr.h>
> > > > > > > > > +
> > > > > > > > > +#include <linux/linkage.h>
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * int __hibernate_cpu_resume(void)
> > > > > > > > > + * Switch back to the hibernated image's page table prior to restoring the CPU
> > > > > > > > > + * context.
> > > > > > > > > + *
> > > > > > > > > + * Always returns 0
> > > > > > > > > + */
> > > > > > > > > +ENTRY(__hibernate_cpu_resume)
> > > > > > > > > +	/* switch to hibernated image's page table. */
> > > > > > > > > +	csrw CSR_SATP, s0
> > > > > > > > > +	sfence.vma
> > > > > > > > > +
> > > > > > > > > +	REG_L	a0, hibernate_cpu_context
> > > > > > > > > +
> > > > > > > > > +	restore_csr
> > > > > > > > > +	restore_reg
> > > > > > > > > +
> > > > > > > > > +	/* Return zero value. */
> > > > > > > > > +	add	a0, zero, zero
> > > > > > > >
> > > > > > > > nit: mv a0, zero
> > > > > > > sure
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +	ret
> > > > > > > > > +END(__hibernate_cpu_resume)
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * Prepare to restore the image.
> > > > > > > > > + * a0: satp of saved page tables.
> > > > > > > > > + * a1: satp of temporary page tables.
> > > > > > > > > + * a2: cpu_resume.
> > > > > > > > > + */
> > > > > > > > > +ENTRY(hibernate_restore_image)
> > > > > > > > > +	mv	s0, a0
> > > > > > > > > +	mv	s1, a1
> > > > > > > > > +	mv	s2, a2
> > > > > > > > > +	REG_L	s4, restore_pblist
> > > > > > > > > +	REG_L	a1, relocated_restore_code
> > > > > > > > > +
> > > > > > > > > +	jalr	a1
> > > > > > > > > +END(hibernate_restore_image)
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * The below code will be executed from a 'safe' page.
> > > > > > > > > + * It first switches to the temporary page table, then starts to copy the pages
> > > > > > > > > + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> > > > > > > > > + * to restore the CPU context.
> > > > > > > > > + */
> > > > > > > > > +ENTRY(hibernate_core_restore_code)
> > > > > > > > > +	/* switch to temp page table. */
> > > > > > > > > +	csrw satp, s1
> > > > > > > > > +	sfence.vma
> > > > > > > > > +.Lcopy:
> > > > > > > > > +	/* The below code will restore the hibernated image. */
> > > > > > > > > +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> > > > > > > > > +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> > > > > > > >
> > > > > > > > Are we sure restore_pblist will never be NULL?
> > > > > > > restore_pblist is a link-list, it will be null during initialization or during page clean up by hibernation core. During the initial
> > > > resume
> > > > > > process, the hibernation core will check the header and load the pages. If everything works correctly, the page will be linked to
> > the
> > > > > > restore_pblist and then invoke swsusp_arch_resume() else hibernation core will throws error and failed to resume from the
> > > > > > hibernated image.
> > > > > >
> > > > > > I know restore_pblist is a linked-list and this doesn't answer the
> > > > > > question. The comment above restore_pblist says
> > > > > >
> > > > > > /*
> > > > > >  * List of PBEs needed for restoring the pages that were allocated before
> > > > > >  * the suspend and included in the suspend image, but have also been
> > > > > >  * allocated by the "resume" kernel, so their contents cannot be written
> > > > > >  * directly to their "original" page frames.
> > > > > >  */
> > > > > >
> > > > > > which implies the pages that end up on this list are "special". My
> > > > > > question is whether or not we're guaranteed to have at least one
> > > > > > of these special pages. If not, we shouldn't assume s4 is non-null.
> > > > > > If so, then a comment stating why that's guaranteed would be nice.
> > > > > The restore_pblist will not be null otherwise swsusp_arch_resume wouldn't get invoked. you can find how the link-list are link
> > and
> > > > how it checks against validity at https://elixir.bootlin.com/linux/v6.2-rc8/source/kernel/power/snapshot.c . " A comment stating
> > why
> > > > that's guaranteed would be nice" ? Hmm, perhaps this is out of my scope but I do believe in the page validity checking in the link I
> > > > shared.
> > > >
> > > > Sorry, but pointing to an entire source file (one that I've obviously
> > > > already looked at, since I quoted a comment from it...) is not helpful.
> > > > I don't see where restore_pblist is being checked before
> > > > swsusp_arch_resume() is issued (from its callsite in hibernate.c).
> > > Sure, below shows the hibernation flow for your reference. The link-list creation and checking found at:
> > https://elixir.bootlin.com/linux/v6.2/source/kernel/power/snapshot.c#L2576
> > > software_resume()
> > > 	load_image_and_restore()
> > > 		swsusp_read()
> > > 			load_image()
> > >  				snapshot_write_next()
> > > 					get_buffer() <-- This is the function checks and links the pages to the restore_pblist
> > 
> > Yup, I've read this path, including get_buffer(), where I saw that
> > get_buffer() can return an address without allocating a PBE. Where is the
> > check that restore_pblist isn't NULL, i.e. we see that at least one PBE
> > has been allocated by get_buffer(), before we call swsusp_arch_resume()?
> > 
> > Or, is known that at least one or more pages match the criteria pointed
> > out in the comment below (copied from get_buffer())?
> > 
> >         /*
> >          * The "original" page frame has not been allocated and we have to
> >          * use a "safe" page frame to store the loaded page.
> >          */
> > 
> > If so, then which ones? And where does it state that?
> Let's look at the below pseudocode and hope it clear your doubt. restore_pblist depends on safe_page_list and pbe and both pointers are checked. I couldn't find from where the restore_pblist will be null..
> 	//Pseudocode to illustrate the image loading
> 	initialize restore_pblist to null;
> 	initialize safe_pages_list to null;
> 	Allocate safe page list, return error if failed;
> 	load image;
> loop:	Create pbe chain, return error if failed;

This loop pseudocode is incomplete. It's

loop:
        if (swsusp_page_is_forbidden(page) && swsusp_page_is_free(page))
	   return page_address(page);
	Create pbe chain, return error if failed;
	...

which I pointed out explicitly in my last reply. Also, as I asked in my
last reply (and have been asking four times now, albeit less explicitly
the first two times), how do we know at least one PBE will be linked?
Or, even more specifically this time, where is the proof that for each
hibernation resume, there exists some page such that
!swsusp_page_is_forbidden(page) or !swsusp_page_is_free(page) is true?

Thanks,
drew

> 	assign orig_addr and safe_page to pbe;
> 	link pbe to restore_pblist;
> 	return pbe to handle->buffer;
> 	check handle->buffer;
> 	goto loop if no error else return with error;
> > 
> > Thanks,
> > drew
> > 
> > 
> > > 		hibernation_restore()
> > > 			resume_target_kernel()
> > > 				swsusp_arch_resume()
> > > >
> > > > Thanks,
> > > > drew
Sia Jee Heng Feb. 27, 2023, 10:52 a.m. UTC | #13
> -----Original Message-----
> From: Andrew Jones <ajones@ventanamicro.com>
> Sent: Monday, 27 February, 2023 4:00 PM
> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> On Mon, Feb 27, 2023 at 02:14:27AM +0000, JeeHeng Sia wrote:
> >
> >
> > > -----Original Message-----
> > > From: Andrew Jones <ajones@ventanamicro.com>
> > > Sent: Friday, 24 February, 2023 8:07 PM
> > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > >
> > > On Fri, Feb 24, 2023 at 10:30:19AM +0000, JeeHeng Sia wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > Sent: Friday, 24 February, 2023 5:55 PM
> > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > >
> > > > > On Fri, Feb 24, 2023 at 09:33:31AM +0000, JeeHeng Sia wrote:
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > > > Sent: Friday, 24 February, 2023 5:00 PM
> > > > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > > > >
> > > > > > > On Fri, Feb 24, 2023 at 02:05:43AM +0000, JeeHeng Sia wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > > > > > Sent: Friday, 24 February, 2023 2:07 AM
> > > > > > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > > > > > >
> > > > > > > > > On Tue, Feb 21, 2023 at 10:35:23AM +0800, Sia Jee Heng wrote:
> > > > > > > > > > Low level Arch functions were created to support hibernation.
> > > > > > > > > > swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> > > > > > > > > > cpu state onto the stack, then calling swsusp_save() to save the memory
> > > > > > > > > > image.
> > > > > > > > > >
> > > > > > > > > > Arch specific hibernation header is implemented and is utilized by the
> > > > > > > > > > arch_hibernation_header_restore() and arch_hibernation_header_save()
> > > > > > > > > > functions. The arch specific hibernation header consists of satp, hartid,
> > > > > > > > > > and the cpu_resume address. The kernel built version is also need to be
> > > > > > > > > > saved into the hibernation image header to making sure only the same
> > > > > > > > > > kernel is restore when resume.
> > > > > > > > > >
> > > > > > > > > > swsusp_arch_resume() creates a temporary page table that covering only
> > > > > > > > > > the linear map. It copies the restore code to a 'safe' page, then start
> > > > > > > > > > to restore the memory image. Once completed, it restores the original
> > > > > > > > > > kernel's page table. It then calls into __hibernate_cpu_resume()
> > > > > > > > > > to restore the CPU context. Finally, it follows the normal hibernation
> > > > > > > > > > path back to the hibernation core.
> > > > > > > > > >
> > > > > > > > > > To enable hibernation/suspend to disk into RISCV, the below config
> > > > > > > > > > need to be enabled:
> > > > > > > > > > - CONFIG_ARCH_HIBERNATION_HEADER
> > > > > > > > > > - CONFIG_ARCH_HIBERNATION_POSSIBLE
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> > > > > > > > > > Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> > > > > > > > > > Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> > > > > > > > > > ---
> > > > > > > > > >  arch/riscv/Kconfig                 |   7 +
> > > > > > > > > >  arch/riscv/include/asm/assembler.h |  20 ++
> > > > > > > > > >  arch/riscv/include/asm/suspend.h   |  19 ++
> > > > > > > > > >  arch/riscv/kernel/Makefile         |   1 +
> > > > > > > > > >  arch/riscv/kernel/asm-offsets.c    |   5 +
> > > > > > > > > >  arch/riscv/kernel/hibernate-asm.S  |  77 +++++
> > > > > > > > > >  arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
> > > > > > > > > >  7 files changed, 576 insertions(+)
> > > > > > > > > >  create mode 100644 arch/riscv/kernel/hibernate-asm.S
> > > > > > > > > >  create mode 100644 arch/riscv/kernel/hibernate.c
> > > > > > > > > >
> > > > > > > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > > > > > > > > index e2b656043abf..4555848a817f 100644
> > > > > > > > > > --- a/arch/riscv/Kconfig
> > > > > > > > > > +++ b/arch/riscv/Kconfig
> > > > > > > > > > @@ -690,6 +690,13 @@ menu "Power management options"
> > > > > > > > > >
> > > > > > > > > >  source "kernel/power/Kconfig"
> > > > > > > > > >
> > > > > > > > > > +config ARCH_HIBERNATION_POSSIBLE
> > > > > > > > > > +	def_bool y
> > > > > > > > > > +
> > > > > > > > > > +config ARCH_HIBERNATION_HEADER
> > > > > > > > > > +	def_bool y
> > > > > > > > > > +	depends on HIBERNATION
> > > > > > > > >
> > > > > > > > > nit: I think this can be simplified as def_bool HIBERNATION
> > > > > > > > good suggestion. will change it.
> > > > > > > > >
> > > > > > > > > > +
> > > > > > > > > >  endmenu # "Power management options"
> > > > > > > > > >
> > > > > > > > > >  menu "CPU Power Management"
> > > > > > > > > > diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> > > > > > > > > > index 727a97735493..68c46c0e0ea8 100644
> > > > > > > > > > --- a/arch/riscv/include/asm/assembler.h
> > > > > > > > > > +++ b/arch/riscv/include/asm/assembler.h
> > > > > > > > > > @@ -59,4 +59,24 @@
> > > > > > > > > >  		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
> > > > > > > > > >  	.endm
> > > > > > > > > >
> > > > > > > > > > +/*
> > > > > > > > > > + * copy_page - copy 1 page (4KB) of data from source to destination
> > > > > > > > > > + * @a0 - destination
> > > > > > > > > > + * @a1 - source
> > > > > > > > > > + */
> > > > > > > > > > +	.macro	copy_page a0, a1
> > > > > > > > > > +		lui	a2, 0x1
> > > > > > > > > > +		add	a2, a2, a0
> > > > > > > > > > +1 :
> > > > > > > > >     ^ please remove this space
> > > > > > > > can't remove it otherwise checkpatch will throws ERROR: spaces required around that ':'
> > > > > > >
> > > > > > > Oh, right, labels in macros have this requirement.
> > > > > > >
> > > > > > > > >
> > > > > > > > > > +		REG_L	t0, 0(a1)
> > > > > > > > > > +		REG_L	t1, SZREG(a1)
> > > > > > > > > > +
> > > > > > > > > > +		REG_S	t0, 0(a0)
> > > > > > > > > > +		REG_S	t1, SZREG(a0)
> > > > > > > > > > +
> > > > > > > > > > +		addi	a0, a0, 2 * SZREG
> > > > > > > > > > +		addi	a1, a1, 2 * SZREG
> > > > > > > > > > +		bne	a2, a0, 1b
> > > > > > > > > > +	.endm
> > > > > > > > > > +
> > > > > > > > > >  #endif	/* __ASM_ASSEMBLER_H */
> > > > > > > > > > diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> > > > > > > > > > index 75419c5ca272..3362da56a9d8 100644
> > > > > > > > > > --- a/arch/riscv/include/asm/suspend.h
> > > > > > > > > > +++ b/arch/riscv/include/asm/suspend.h
> > > > > > > > > > @@ -21,6 +21,11 @@ struct suspend_context {
> > > > > > > > > >  #endif
> > > > > > > > > >  };
> > > > > > > > > >
> > > > > > > > > > +/*
> > > > > > > > > > + * Used by hibernation core and cleared during resume sequence
> > > > > > > > > > + */
> > > > > > > > > > +extern int in_suspend;
> > > > > > > > > > +
> > > > > > > > > >  /* Low-level CPU suspend entry function */
> > > > > > > > > >  int __cpu_suspend_enter(struct suspend_context *context);
> > > > > > > > > >
> > > > > > > > > > @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
> > > > > > > > > >  /* Used to save and restore the csr */
> > > > > > > > > >  void suspend_save_csrs(struct suspend_context *context);
> > > > > > > > > >  void suspend_restore_csrs(struct suspend_context *context);
> > > > > > > > > > +
> > > > > > > > > > +/* Low-level API to support hibernation */
> > > > > > > > > > +int swsusp_arch_suspend(void);
> > > > > > > > > > +int swsusp_arch_resume(void);
> > > > > > > > > > +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> > > > > > > > > > +int arch_hibernation_header_restore(void *addr);
> > > > > > > > > > +int __hibernate_cpu_resume(void);
> > > > > > > > > > +
> > > > > > > > > > +/* Used to resume on the CPU we hibernated on */
> > > > > > > > > > +int hibernate_resume_nonboot_cpu_disable(void);
> > > > > > > > > > +
> > > > > > > > > > +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> > > > > > > > > > +					unsigned long cpu_resume);
> > > > > > > > > > +asmlinkage int hibernate_core_restore_code(void);
> > > > > > > > > >  #endif
> > > > > > > > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > > > > > > > > index 4cf303a779ab..daab341d55e4 100644
> > > > > > > > > > --- a/arch/riscv/kernel/Makefile
> > > > > > > > > > +++ b/arch/riscv/kernel/Makefile
> > > > > > > > > > @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
> > > > > > > > > >  obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> > > > > > > > > >
> > > > > > > > > >  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> > > > > > > > > > +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> > > > > > > > > >
> > > > > > > > > >  obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
> > > > > > > > > >  obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> > > > > > > > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > > > > > > > > index df9444397908..d6a75aac1d27 100644
> > > > > > > > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > > > > > > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > > > > > > > > @@ -9,6 +9,7 @@
> > > > > > > > > >  #include <linux/kbuild.h>
> > > > > > > > > >  #include <linux/mm.h>
> > > > > > > > > >  #include <linux/sched.h>
> > > > > > > > > > +#include <linux/suspend.h>
> > > > > > > > > >  #include <asm/kvm_host.h>
> > > > > > > > > >  #include <asm/thread_info.h>
> > > > > > > > > >  #include <asm/ptrace.h>
> > > > > > > > > > @@ -116,6 +117,10 @@ void asm_offsets(void)
> > > > > > > > > >
> > > > > > > > > >  	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> > > > > > > > > >
> > > > > > > > > > +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> > > > > > > > > > +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> > > > > > > > > > +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> > > > > > > > > > +
> > > > > > > > > >  	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
> > > > > > > > > >  	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
> > > > > > > > > >  	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> > > > > > > > > > diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> > > > > > > > > > new file mode 100644
> > > > > > > > > > index 000000000000..846affe4dced
> > > > > > > > > > --- /dev/null
> > > > > > > > > > +++ b/arch/riscv/kernel/hibernate-asm.S
> > > > > > > > > > @@ -0,0 +1,77 @@
> > > > > > > > > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > > > > > > > > +/*
> > > > > > > > > > + * Hibernation low level support for RISCV.
> > > > > > > > > > + *
> > > > > > > > > > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > > > > > > > > > + *
> > > > > > > > > > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > > > > + */
> > > > > > > > > > +
> > > > > > > > > > +#include <asm/asm.h>
> > > > > > > > > > +#include <asm/asm-offsets.h>
> > > > > > > > > > +#include <asm/assembler.h>
> > > > > > > > > > +#include <asm/csr.h>
> > > > > > > > > > +
> > > > > > > > > > +#include <linux/linkage.h>
> > > > > > > > > > +
> > > > > > > > > > +/*
> > > > > > > > > > + * int __hibernate_cpu_resume(void)
> > > > > > > > > > + * Switch back to the hibernated image's page table prior to restoring the CPU
> > > > > > > > > > + * context.
> > > > > > > > > > + *
> > > > > > > > > > + * Always returns 0
> > > > > > > > > > + */
> > > > > > > > > > +ENTRY(__hibernate_cpu_resume)
> > > > > > > > > > +	/* switch to hibernated image's page table. */
> > > > > > > > > > +	csrw CSR_SATP, s0
> > > > > > > > > > +	sfence.vma
> > > > > > > > > > +
> > > > > > > > > > +	REG_L	a0, hibernate_cpu_context
> > > > > > > > > > +
> > > > > > > > > > +	restore_csr
> > > > > > > > > > +	restore_reg
> > > > > > > > > > +
> > > > > > > > > > +	/* Return zero value. */
> > > > > > > > > > +	add	a0, zero, zero
> > > > > > > > >
> > > > > > > > > nit: mv a0, zero
> > > > > > > > sure
> > > > > > > > >
> > > > > > > > > > +
> > > > > > > > > > +	ret
> > > > > > > > > > +END(__hibernate_cpu_resume)
> > > > > > > > > > +
> > > > > > > > > > +/*
> > > > > > > > > > + * Prepare to restore the image.
> > > > > > > > > > + * a0: satp of saved page tables.
> > > > > > > > > > + * a1: satp of temporary page tables.
> > > > > > > > > > + * a2: cpu_resume.
> > > > > > > > > > + */
> > > > > > > > > > +ENTRY(hibernate_restore_image)
> > > > > > > > > > +	mv	s0, a0
> > > > > > > > > > +	mv	s1, a1
> > > > > > > > > > +	mv	s2, a2
> > > > > > > > > > +	REG_L	s4, restore_pblist
> > > > > > > > > > +	REG_L	a1, relocated_restore_code
> > > > > > > > > > +
> > > > > > > > > > +	jalr	a1
> > > > > > > > > > +END(hibernate_restore_image)
> > > > > > > > > > +
> > > > > > > > > > +/*
> > > > > > > > > > + * The below code will be executed from a 'safe' page.
> > > > > > > > > > + * It first switches to the temporary page table, then starts to copy the pages
> > > > > > > > > > + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> > > > > > > > > > + * to restore the CPU context.
> > > > > > > > > > + */
> > > > > > > > > > +ENTRY(hibernate_core_restore_code)
> > > > > > > > > > +	/* switch to temp page table. */
> > > > > > > > > > +	csrw satp, s1
> > > > > > > > > > +	sfence.vma
> > > > > > > > > > +.Lcopy:
> > > > > > > > > > +	/* The below code will restore the hibernated image. */
> > > > > > > > > > +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> > > > > > > > > > +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> > > > > > > > >
> > > > > > > > > Are we sure restore_pblist will never be NULL?
> > > > > > > > restore_pblist is a link-list, it will be null during initialization or during page clean up by hibernation core. During the initial
> > > > > resume
> > > > > > > process, the hibernation core will check the header and load the pages. If everything works correctly, the page will be linked
> to
> > > the
> > > > > > > restore_pblist and then invoke swsusp_arch_resume() else hibernation core will throws error and failed to resume from
> the
> > > > > > > hibernated image.
> > > > > > >
> > > > > > > I know restore_pblist is a linked-list and this doesn't answer the
> > > > > > > question. The comment above restore_pblist says
> > > > > > >
> > > > > > > /*
> > > > > > >  * List of PBEs needed for restoring the pages that were allocated before
> > > > > > >  * the suspend and included in the suspend image, but have also been
> > > > > > >  * allocated by the "resume" kernel, so their contents cannot be written
> > > > > > >  * directly to their "original" page frames.
> > > > > > >  */
> > > > > > >
> > > > > > > which implies the pages that end up on this list are "special". My
> > > > > > > question is whether or not we're guaranteed to have at least one
> > > > > > > of these special pages. If not, we shouldn't assume s4 is non-null.
> > > > > > > If so, then a comment stating why that's guaranteed would be nice.
> > > > > > The restore_pblist will not be null otherwise swsusp_arch_resume wouldn't get invoked. you can find how the link-list are
> link
> > > and
> > > > > how it checks against validity at https://elixir.bootlin.com/linux/v6.2-rc8/source/kernel/power/snapshot.c . " A comment
> stating
> > > why
> > > > > that's guaranteed would be nice" ? Hmm, perhaps this is out of my scope but I do believe in the page validity checking in the
> link I
> > > > > shared.
> > > > >
> > > > > Sorry, but pointing to an entire source file (one that I've obviously
> > > > > already looked at, since I quoted a comment from it...) is not helpful.
> > > > > I don't see where restore_pblist is being checked before
> > > > > swsusp_arch_resume() is issued (from its callsite in hibernate.c).
> > > > Sure, below shows the hibernation flow for your reference. The link-list creation and checking found at:
> > > https://elixir.bootlin.com/linux/v6.2/source/kernel/power/snapshot.c#L2576
> > > > software_resume()
> > > > 	load_image_and_restore()
> > > > 		swsusp_read()
> > > > 			load_image()
> > > >  				snapshot_write_next()
> > > > 					get_buffer() <-- This is the function checks and links the pages to the restore_pblist
> > >
> > > Yup, I've read this path, including get_buffer(), where I saw that
> > > get_buffer() can return an address without allocating a PBE. Where is the
> > > check that restore_pblist isn't NULL, i.e. we see that at least one PBE
> > > has been allocated by get_buffer(), before we call swsusp_arch_resume()?
> > >
> > > Or, is known that at least one or more pages match the criteria pointed
> > > out in the comment below (copied from get_buffer())?
> > >
> > >         /*
> > >          * The "original" page frame has not been allocated and we have to
> > >          * use a "safe" page frame to store the loaded page.
> > >          */
> > >
> > > If so, then which ones? And where does it state that?
> > Let's look at the below pseudocode and hope it clear your doubt. restore_pblist depends on safe_page_list and pbe and both
> pointers are checked. I couldn't find from where the restore_pblist will be null..
> > 	//Pseudocode to illustrate the image loading
> > 	initialize restore_pblist to null;
> > 	initialize safe_pages_list to null;
> > 	Allocate safe page list, return error if failed;
> > 	load image;
> > loop:	Create pbe chain, return error if failed;
> 
> This loop pseudocode is incomplete. It's
> 
> loop:
>         if (swsusp_page_is_forbidden(page) && swsusp_page_is_free(page))
> 	   return page_address(page);
> 	Create pbe chain, return error if failed;
> 	...
> 
> which I pointed out explicitly in my last reply. Also, as I asked in my
> last reply (and have been asking four times now, albeit less explicitly
> the first two times), how do we know at least one PBE will be linked?
1 PBE correspond to 1 page, you shouldn't expect only 1 page is saved. Hibernation core will do the calculation. If the PBEs (restore_pblist) linked successfully, the hibernated image will be restore else normal boot will take place.
> Or, even more specifically this time, where is the proof that for each
> hibernation resume, there exists some page such that
> !swsusp_page_is_forbidden(page) or !swsusp_page_is_free(page) is true?
forbidden_pages and free_pages are not contributed to the restore_pblist (as you already aware from the code). Infact, the forbidden_pages and free_pages are not save into the disk.
> 
> Thanks,
> drew
> 
> > 	assign orig_addr and safe_page to pbe;
> > 	link pbe to restore_pblist;
> > 	return pbe to handle->buffer;
> > 	check handle->buffer;
> > 	goto loop if no error else return with error;
> > >
> > > Thanks,
> > > drew
> > >
> > >
> > > > 		hibernation_restore()
> > > > 			resume_target_kernel()
> > > > 				swsusp_arch_resume()
> > > > >
> > > > > Thanks,
> > > > > drew
Andrew Jones Feb. 27, 2023, 11:44 a.m. UTC | #14
On Mon, Feb 27, 2023 at 10:52:32AM +0000, JeeHeng Sia wrote:
> 
> 
> > -----Original Message-----
> > From: Andrew Jones <ajones@ventanamicro.com>
> > Sent: Monday, 27 February, 2023 4:00 PM
> > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > 
> > On Mon, Feb 27, 2023 at 02:14:27AM +0000, JeeHeng Sia wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > Sent: Friday, 24 February, 2023 8:07 PM
> > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > >
> > > > On Fri, Feb 24, 2023 at 10:30:19AM +0000, JeeHeng Sia wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > > Sent: Friday, 24 February, 2023 5:55 PM
> > > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > > >
> > > > > > On Fri, Feb 24, 2023 at 09:33:31AM +0000, JeeHeng Sia wrote:
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > > > > Sent: Friday, 24 February, 2023 5:00 PM
> > > > > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > > > > >
> > > > > > > > On Fri, Feb 24, 2023 at 02:05:43AM +0000, JeeHeng Sia wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > > > > > > Sent: Friday, 24 February, 2023 2:07 AM
> > > > > > > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > > > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > > > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > > > > > > >
> > > > > > > > > > On Tue, Feb 21, 2023 at 10:35:23AM +0800, Sia Jee Heng wrote:
> > > > > > > > > > > Low level Arch functions were created to support hibernation.
> > > > > > > > > > > swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> > > > > > > > > > > cpu state onto the stack, then calling swsusp_save() to save the memory
> > > > > > > > > > > image.
> > > > > > > > > > >
> > > > > > > > > > > Arch specific hibernation header is implemented and is utilized by the
> > > > > > > > > > > arch_hibernation_header_restore() and arch_hibernation_header_save()
> > > > > > > > > > > functions. The arch specific hibernation header consists of satp, hartid,
> > > > > > > > > > > and the cpu_resume address. The kernel built version is also need to be
> > > > > > > > > > > saved into the hibernation image header to making sure only the same
> > > > > > > > > > > kernel is restore when resume.
> > > > > > > > > > >
> > > > > > > > > > > swsusp_arch_resume() creates a temporary page table that covering only
> > > > > > > > > > > the linear map. It copies the restore code to a 'safe' page, then start
> > > > > > > > > > > to restore the memory image. Once completed, it restores the original
> > > > > > > > > > > kernel's page table. It then calls into __hibernate_cpu_resume()
> > > > > > > > > > > to restore the CPU context. Finally, it follows the normal hibernation
> > > > > > > > > > > path back to the hibernation core.
> > > > > > > > > > >
> > > > > > > > > > > To enable hibernation/suspend to disk into RISCV, the below config
> > > > > > > > > > > need to be enabled:
> > > > > > > > > > > - CONFIG_ARCH_HIBERNATION_HEADER
> > > > > > > > > > > - CONFIG_ARCH_HIBERNATION_POSSIBLE
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> > > > > > > > > > > Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> > > > > > > > > > > Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  arch/riscv/Kconfig                 |   7 +
> > > > > > > > > > >  arch/riscv/include/asm/assembler.h |  20 ++
> > > > > > > > > > >  arch/riscv/include/asm/suspend.h   |  19 ++
> > > > > > > > > > >  arch/riscv/kernel/Makefile         |   1 +
> > > > > > > > > > >  arch/riscv/kernel/asm-offsets.c    |   5 +
> > > > > > > > > > >  arch/riscv/kernel/hibernate-asm.S  |  77 +++++
> > > > > > > > > > >  arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
> > > > > > > > > > >  7 files changed, 576 insertions(+)
> > > > > > > > > > >  create mode 100644 arch/riscv/kernel/hibernate-asm.S
> > > > > > > > > > >  create mode 100644 arch/riscv/kernel/hibernate.c
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > > > > > > > > > index e2b656043abf..4555848a817f 100644
> > > > > > > > > > > --- a/arch/riscv/Kconfig
> > > > > > > > > > > +++ b/arch/riscv/Kconfig
> > > > > > > > > > > @@ -690,6 +690,13 @@ menu "Power management options"
> > > > > > > > > > >
> > > > > > > > > > >  source "kernel/power/Kconfig"
> > > > > > > > > > >
> > > > > > > > > > > +config ARCH_HIBERNATION_POSSIBLE
> > > > > > > > > > > +	def_bool y
> > > > > > > > > > > +
> > > > > > > > > > > +config ARCH_HIBERNATION_HEADER
> > > > > > > > > > > +	def_bool y
> > > > > > > > > > > +	depends on HIBERNATION
> > > > > > > > > >
> > > > > > > > > > nit: I think this can be simplified as def_bool HIBERNATION
> > > > > > > > > good suggestion. will change it.
> > > > > > > > > >
> > > > > > > > > > > +
> > > > > > > > > > >  endmenu # "Power management options"
> > > > > > > > > > >
> > > > > > > > > > >  menu "CPU Power Management"
> > > > > > > > > > > diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> > > > > > > > > > > index 727a97735493..68c46c0e0ea8 100644
> > > > > > > > > > > --- a/arch/riscv/include/asm/assembler.h
> > > > > > > > > > > +++ b/arch/riscv/include/asm/assembler.h
> > > > > > > > > > > @@ -59,4 +59,24 @@
> > > > > > > > > > >  		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
> > > > > > > > > > >  	.endm
> > > > > > > > > > >
> > > > > > > > > > > +/*
> > > > > > > > > > > + * copy_page - copy 1 page (4KB) of data from source to destination
> > > > > > > > > > > + * @a0 - destination
> > > > > > > > > > > + * @a1 - source
> > > > > > > > > > > + */
> > > > > > > > > > > +	.macro	copy_page a0, a1
> > > > > > > > > > > +		lui	a2, 0x1
> > > > > > > > > > > +		add	a2, a2, a0
> > > > > > > > > > > +1 :
> > > > > > > > > >     ^ please remove this space
> > > > > > > > > can't remove it otherwise checkpatch will throws ERROR: spaces required around that ':'
> > > > > > > >
> > > > > > > > Oh, right, labels in macros have this requirement.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > +		REG_L	t0, 0(a1)
> > > > > > > > > > > +		REG_L	t1, SZREG(a1)
> > > > > > > > > > > +
> > > > > > > > > > > +		REG_S	t0, 0(a0)
> > > > > > > > > > > +		REG_S	t1, SZREG(a0)
> > > > > > > > > > > +
> > > > > > > > > > > +		addi	a0, a0, 2 * SZREG
> > > > > > > > > > > +		addi	a1, a1, 2 * SZREG
> > > > > > > > > > > +		bne	a2, a0, 1b
> > > > > > > > > > > +	.endm
> > > > > > > > > > > +
> > > > > > > > > > >  #endif	/* __ASM_ASSEMBLER_H */
> > > > > > > > > > > diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> > > > > > > > > > > index 75419c5ca272..3362da56a9d8 100644
> > > > > > > > > > > --- a/arch/riscv/include/asm/suspend.h
> > > > > > > > > > > +++ b/arch/riscv/include/asm/suspend.h
> > > > > > > > > > > @@ -21,6 +21,11 @@ struct suspend_context {
> > > > > > > > > > >  #endif
> > > > > > > > > > >  };
> > > > > > > > > > >
> > > > > > > > > > > +/*
> > > > > > > > > > > + * Used by hibernation core and cleared during resume sequence
> > > > > > > > > > > + */
> > > > > > > > > > > +extern int in_suspend;
> > > > > > > > > > > +
> > > > > > > > > > >  /* Low-level CPU suspend entry function */
> > > > > > > > > > >  int __cpu_suspend_enter(struct suspend_context *context);
> > > > > > > > > > >
> > > > > > > > > > > @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
> > > > > > > > > > >  /* Used to save and restore the csr */
> > > > > > > > > > >  void suspend_save_csrs(struct suspend_context *context);
> > > > > > > > > > >  void suspend_restore_csrs(struct suspend_context *context);
> > > > > > > > > > > +
> > > > > > > > > > > +/* Low-level API to support hibernation */
> > > > > > > > > > > +int swsusp_arch_suspend(void);
> > > > > > > > > > > +int swsusp_arch_resume(void);
> > > > > > > > > > > +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> > > > > > > > > > > +int arch_hibernation_header_restore(void *addr);
> > > > > > > > > > > +int __hibernate_cpu_resume(void);
> > > > > > > > > > > +
> > > > > > > > > > > +/* Used to resume on the CPU we hibernated on */
> > > > > > > > > > > +int hibernate_resume_nonboot_cpu_disable(void);
> > > > > > > > > > > +
> > > > > > > > > > > +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> > > > > > > > > > > +					unsigned long cpu_resume);
> > > > > > > > > > > +asmlinkage int hibernate_core_restore_code(void);
> > > > > > > > > > >  #endif
> > > > > > > > > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > > > > > > > > > index 4cf303a779ab..daab341d55e4 100644
> > > > > > > > > > > --- a/arch/riscv/kernel/Makefile
> > > > > > > > > > > +++ b/arch/riscv/kernel/Makefile
> > > > > > > > > > > @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
> > > > > > > > > > >  obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> > > > > > > > > > >
> > > > > > > > > > >  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> > > > > > > > > > > +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> > > > > > > > > > >
> > > > > > > > > > >  obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
> > > > > > > > > > >  obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> > > > > > > > > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > > > > > > > > > index df9444397908..d6a75aac1d27 100644
> > > > > > > > > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > > > > > > > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > > > > > > > > > @@ -9,6 +9,7 @@
> > > > > > > > > > >  #include <linux/kbuild.h>
> > > > > > > > > > >  #include <linux/mm.h>
> > > > > > > > > > >  #include <linux/sched.h>
> > > > > > > > > > > +#include <linux/suspend.h>
> > > > > > > > > > >  #include <asm/kvm_host.h>
> > > > > > > > > > >  #include <asm/thread_info.h>
> > > > > > > > > > >  #include <asm/ptrace.h>
> > > > > > > > > > > @@ -116,6 +117,10 @@ void asm_offsets(void)
> > > > > > > > > > >
> > > > > > > > > > >  	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> > > > > > > > > > >
> > > > > > > > > > > +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> > > > > > > > > > > +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> > > > > > > > > > > +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> > > > > > > > > > > +
> > > > > > > > > > >  	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
> > > > > > > > > > >  	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
> > > > > > > > > > >  	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> > > > > > > > > > > diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> > > > > > > > > > > new file mode 100644
> > > > > > > > > > > index 000000000000..846affe4dced
> > > > > > > > > > > --- /dev/null
> > > > > > > > > > > +++ b/arch/riscv/kernel/hibernate-asm.S
> > > > > > > > > > > @@ -0,0 +1,77 @@
> > > > > > > > > > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > > > > > > > > > +/*
> > > > > > > > > > > + * Hibernation low level support for RISCV.
> > > > > > > > > > > + *
> > > > > > > > > > > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > > > > > > > > > > + *
> > > > > > > > > > > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > > > > > + */
> > > > > > > > > > > +
> > > > > > > > > > > +#include <asm/asm.h>
> > > > > > > > > > > +#include <asm/asm-offsets.h>
> > > > > > > > > > > +#include <asm/assembler.h>
> > > > > > > > > > > +#include <asm/csr.h>
> > > > > > > > > > > +
> > > > > > > > > > > +#include <linux/linkage.h>
> > > > > > > > > > > +
> > > > > > > > > > > +/*
> > > > > > > > > > > + * int __hibernate_cpu_resume(void)
> > > > > > > > > > > + * Switch back to the hibernated image's page table prior to restoring the CPU
> > > > > > > > > > > + * context.
> > > > > > > > > > > + *
> > > > > > > > > > > + * Always returns 0
> > > > > > > > > > > + */
> > > > > > > > > > > +ENTRY(__hibernate_cpu_resume)
> > > > > > > > > > > +	/* switch to hibernated image's page table. */
> > > > > > > > > > > +	csrw CSR_SATP, s0
> > > > > > > > > > > +	sfence.vma
> > > > > > > > > > > +
> > > > > > > > > > > +	REG_L	a0, hibernate_cpu_context
> > > > > > > > > > > +
> > > > > > > > > > > +	restore_csr
> > > > > > > > > > > +	restore_reg
> > > > > > > > > > > +
> > > > > > > > > > > +	/* Return zero value. */
> > > > > > > > > > > +	add	a0, zero, zero
> > > > > > > > > >
> > > > > > > > > > nit: mv a0, zero
> > > > > > > > > sure
> > > > > > > > > >
> > > > > > > > > > > +
> > > > > > > > > > > +	ret
> > > > > > > > > > > +END(__hibernate_cpu_resume)
> > > > > > > > > > > +
> > > > > > > > > > > +/*
> > > > > > > > > > > + * Prepare to restore the image.
> > > > > > > > > > > + * a0: satp of saved page tables.
> > > > > > > > > > > + * a1: satp of temporary page tables.
> > > > > > > > > > > + * a2: cpu_resume.
> > > > > > > > > > > + */
> > > > > > > > > > > +ENTRY(hibernate_restore_image)
> > > > > > > > > > > +	mv	s0, a0
> > > > > > > > > > > +	mv	s1, a1
> > > > > > > > > > > +	mv	s2, a2
> > > > > > > > > > > +	REG_L	s4, restore_pblist
> > > > > > > > > > > +	REG_L	a1, relocated_restore_code
> > > > > > > > > > > +
> > > > > > > > > > > +	jalr	a1
> > > > > > > > > > > +END(hibernate_restore_image)
> > > > > > > > > > > +
> > > > > > > > > > > +/*
> > > > > > > > > > > + * The below code will be executed from a 'safe' page.
> > > > > > > > > > > + * It first switches to the temporary page table, then starts to copy the pages
> > > > > > > > > > > + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> > > > > > > > > > > + * to restore the CPU context.
> > > > > > > > > > > + */
> > > > > > > > > > > +ENTRY(hibernate_core_restore_code)
> > > > > > > > > > > +	/* switch to temp page table. */
> > > > > > > > > > > +	csrw satp, s1
> > > > > > > > > > > +	sfence.vma
> > > > > > > > > > > +.Lcopy:
> > > > > > > > > > > +	/* The below code will restore the hibernated image. */
> > > > > > > > > > > +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> > > > > > > > > > > +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> > > > > > > > > >
> > > > > > > > > > Are we sure restore_pblist will never be NULL?
> > > > > > > > > restore_pblist is a link-list, it will be null during initialization or during page clean up by hibernation core. During the initial
> > > > > > resume
> > > > > > > > process, the hibernation core will check the header and load the pages. If everything works correctly, the page will be linked
> > to
> > > > the
> > > > > > > > restore_pblist and then invoke swsusp_arch_resume() else hibernation core will throws error and failed to resume from
> > the
> > > > > > > > hibernated image.
> > > > > > > >
> > > > > > > > I know restore_pblist is a linked-list and this doesn't answer the
> > > > > > > > question. The comment above restore_pblist says
> > > > > > > >
> > > > > > > > /*
> > > > > > > >  * List of PBEs needed for restoring the pages that were allocated before
> > > > > > > >  * the suspend and included in the suspend image, but have also been
> > > > > > > >  * allocated by the "resume" kernel, so their contents cannot be written
> > > > > > > >  * directly to their "original" page frames.
> > > > > > > >  */
> > > > > > > >
> > > > > > > > which implies the pages that end up on this list are "special". My
> > > > > > > > question is whether or not we're guaranteed to have at least one
> > > > > > > > of these special pages. If not, we shouldn't assume s4 is non-null.
> > > > > > > > If so, then a comment stating why that's guaranteed would be nice.
> > > > > > > The restore_pblist will not be null otherwise swsusp_arch_resume wouldn't get invoked. you can find how the link-list are
> > link
> > > > and
> > > > > > how it checks against validity at https://elixir.bootlin.com/linux/v6.2-rc8/source/kernel/power/snapshot.c . " A comment
> > stating
> > > > why
> > > > > > that's guaranteed would be nice" ? Hmm, perhaps this is out of my scope but I do believe in the page validity checking in the
> > link I
> > > > > > shared.
> > > > > >
> > > > > > Sorry, but pointing to an entire source file (one that I've obviously
> > > > > > already looked at, since I quoted a comment from it...) is not helpful.
> > > > > > I don't see where restore_pblist is being checked before
> > > > > > swsusp_arch_resume() is issued (from its callsite in hibernate.c).
> > > > > Sure, below shows the hibernation flow for your reference. The link-list creation and checking found at:
> > > > https://elixir.bootlin.com/linux/v6.2/source/kernel/power/snapshot.c#L2576
> > > > > software_resume()
> > > > > 	load_image_and_restore()
> > > > > 		swsusp_read()
> > > > > 			load_image()
> > > > >  				snapshot_write_next()
> > > > > 					get_buffer() <-- This is the function checks and links the pages to the restore_pblist
> > > >
> > > > Yup, I've read this path, including get_buffer(), where I saw that
> > > > get_buffer() can return an address without allocating a PBE. Where is the
> > > > check that restore_pblist isn't NULL, i.e. we see that at least one PBE
> > > > has been allocated by get_buffer(), before we call swsusp_arch_resume()?
> > > >
> > > > Or, is known that at least one or more pages match the criteria pointed
> > > > out in the comment below (copied from get_buffer())?
> > > >
> > > >         /*
> > > >          * The "original" page frame has not been allocated and we have to
> > > >          * use a "safe" page frame to store the loaded page.
> > > >          */
> > > >
> > > > If so, then which ones? And where does it state that?
> > > Let's look at the below pseudocode and hope it clear your doubt. restore_pblist depends on safe_page_list and pbe and both
> > pointers are checked. I couldn't find from where the restore_pblist will be null..
> > > 	//Pseudocode to illustrate the image loading
> > > 	initialize restore_pblist to null;
> > > 	initialize safe_pages_list to null;
> > > 	Allocate safe page list, return error if failed;
> > > 	load image;
> > > loop:	Create pbe chain, return error if failed;
> > 
> > This loop pseudocode is incomplete. It's
> > 
> > loop:
> >         if (swsusp_page_is_forbidden(page) && swsusp_page_is_free(page))
> > 	   return page_address(page);
> > 	Create pbe chain, return error if failed;
> > 	...
> > 
> > which I pointed out explicitly in my last reply. Also, as I asked in my
> > last reply (and have been asking four times now, albeit less explicitly
> > the first two times), how do we know at least one PBE will be linked?
> 1 PBE correspond to 1 page, you shouldn't expect only 1 page is saved.

I know PBEs correspond to pages. *Why* should I not expect only one page
is saved? Or, more importantly, why should I expect more than zero pages
are saved?

Convincing answers might be because we *always* put the restore code in
pages which get added to the PBE list or that the original page tables
*always* get put in pages which get added to the PBE list. It's not very
convincing to simply *assume* that at least one random page will always
meet the PBE list criteria.

> Hibernation core will do the calculation. If the PBEs (restore_pblist) linked successfully, the hibernated image will be restore else normal boot will take place.
> > Or, even more specifically this time, where is the proof that for each
> > hibernation resume, there exists some page such that
> > !swsusp_page_is_forbidden(page) or !swsusp_page_is_free(page) is true?
> forbidden_pages and free_pages are not contributed to the restore_pblist (as you already aware from the code). Infact, the forbidden_pages and free_pages are not save into the disk.

Exactly, so those pages are *not* going to contribute to the greater than
zero pages. What I've been asking for, from the beginning, is to know
which page(s) are known to *always* contribute to the list. Or, IOW, how
do you know the PBE list isn't empty, a.k.a restore_pblist isn't NULL?

Thanks,
drew

> > 
> > Thanks,
> > drew
> > 
> > > 	assign orig_addr and safe_page to pbe;
> > > 	link pbe to restore_pblist;
> > > 	return pbe to handle->buffer;
> > > 	check handle->buffer;
> > > 	goto loop if no error else return with error;
> > > >
> > > > Thanks,
> > > > drew
> > > >
> > > >
> > > > > 		hibernation_restore()
> > > > > 			resume_target_kernel()
> > > > > 				swsusp_arch_resume()
> > > > > >
> > > > > > Thanks,
> > > > > > drew
Alexandre Ghiti Feb. 27, 2023, 8:31 p.m. UTC | #15
On 2/27/23 04:11, JeeHeng Sia wrote:
>
>> -----Original Message-----
>> From: Alexandre Ghiti <alex@ghiti.fr>
>> Sent: Friday, 24 February, 2023 8:29 PM
>> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>; paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu
>> Cc: linux-riscv@lists.infradead.org; linux-kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo
>> <mason.huo@starfivetech.com>
>> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
>>
>> On 2/21/23 03:35, Sia Jee Heng wrote:
>>> Low level Arch functions were created to support hibernation.
>>> swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
>>> cpu state onto the stack, then calling swsusp_save() to save the memory
>>> image.
>>>
>>> Arch specific hibernation header is implemented and is utilized by the
>>> arch_hibernation_header_restore() and arch_hibernation_header_save()
>>> functions. The arch specific hibernation header consists of satp, hartid,
>>> and the cpu_resume address. The kernel built version is also need to be
>>> saved into the hibernation image header to making sure only the same
>>> kernel is restore when resume.
>>>
>>> swsusp_arch_resume() creates a temporary page table that covering only
>>> the linear map. It copies the restore code to a 'safe' page, then start
>>> to restore the memory image. Once completed, it restores the original
>>> kernel's page table. It then calls into __hibernate_cpu_resume()
>>> to restore the CPU context. Finally, it follows the normal hibernation
>>> path back to the hibernation core.
>>>
>>> To enable hibernation/suspend to disk into RISCV, the below config
>>> need to be enabled:
>>> - CONFIG_ARCH_HIBERNATION_HEADER
>>> - CONFIG_ARCH_HIBERNATION_POSSIBLE
>>>
>>> Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
>>> Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
>>> Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
>>> ---
>>>    arch/riscv/Kconfig                 |   7 +
>>>    arch/riscv/include/asm/assembler.h |  20 ++
>>>    arch/riscv/include/asm/suspend.h   |  19 ++
>>>    arch/riscv/kernel/Makefile         |   1 +
>>>    arch/riscv/kernel/asm-offsets.c    |   5 +
>>>    arch/riscv/kernel/hibernate-asm.S  |  77 +++++
>>>    arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
>>>    7 files changed, 576 insertions(+)
>>>    create mode 100644 arch/riscv/kernel/hibernate-asm.S
>>>    create mode 100644 arch/riscv/kernel/hibernate.c
>>>
>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>>> index e2b656043abf..4555848a817f 100644
>>> --- a/arch/riscv/Kconfig
>>> +++ b/arch/riscv/Kconfig
>>> @@ -690,6 +690,13 @@ menu "Power management options"
>>>
>>>    source "kernel/power/Kconfig"
>>>
>>> +config ARCH_HIBERNATION_POSSIBLE
>>> +	def_bool y
>>> +
>>> +config ARCH_HIBERNATION_HEADER
>>> +	def_bool y
>>> +	depends on HIBERNATION
>>> +
>>>    endmenu # "Power management options"
>>>
>>>    menu "CPU Power Management"
>>> diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
>>> index 727a97735493..68c46c0e0ea8 100644
>>> --- a/arch/riscv/include/asm/assembler.h
>>> +++ b/arch/riscv/include/asm/assembler.h
>>> @@ -59,4 +59,24 @@
>>>    		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
>>>    	.endm
>>>
>>> +/*
>>> + * copy_page - copy 1 page (4KB) of data from source to destination
>>> + * @a0 - destination
>>> + * @a1 - source
>>> + */
>>> +	.macro	copy_page a0, a1
>>> +		lui	a2, 0x1
>>> +		add	a2, a2, a0
>>> +1 :
>>> +		REG_L	t0, 0(a1)
>>> +		REG_L	t1, SZREG(a1)
>>> +
>>> +		REG_S	t0, 0(a0)
>>> +		REG_S	t1, SZREG(a0)
>>> +
>>> +		addi	a0, a0, 2 * SZREG
>>> +		addi	a1, a1, 2 * SZREG
>>> +		bne	a2, a0, 1b
>>> +	.endm
>>> +
>>>    #endif	/* __ASM_ASSEMBLER_H */
>>> diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
>>> index 75419c5ca272..3362da56a9d8 100644
>>> --- a/arch/riscv/include/asm/suspend.h
>>> +++ b/arch/riscv/include/asm/suspend.h
>>> @@ -21,6 +21,11 @@ struct suspend_context {
>>>    #endif
>>>    };
>>>
>>> +/*
>>> + * Used by hibernation core and cleared during resume sequence
>>> + */
>>> +extern int in_suspend;
>>> +
>>>    /* Low-level CPU suspend entry function */
>>>    int __cpu_suspend_enter(struct suspend_context *context);
>>>
>>> @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
>>>    /* Used to save and restore the csr */
>>>    void suspend_save_csrs(struct suspend_context *context);
>>>    void suspend_restore_csrs(struct suspend_context *context);
>>> +
>>> +/* Low-level API to support hibernation */
>>> +int swsusp_arch_suspend(void);
>>> +int swsusp_arch_resume(void);
>>> +int arch_hibernation_header_save(void *addr, unsigned int max_size);
>>> +int arch_hibernation_header_restore(void *addr);
>>> +int __hibernate_cpu_resume(void);
>>> +
>>> +/* Used to resume on the CPU we hibernated on */
>>> +int hibernate_resume_nonboot_cpu_disable(void);
>>> +
>>> +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
>>> +					unsigned long cpu_resume);
>>> +asmlinkage int hibernate_core_restore_code(void);
>>>    #endif
>>> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
>>> index 4cf303a779ab..daab341d55e4 100644
>>> --- a/arch/riscv/kernel/Makefile
>>> +++ b/arch/riscv/kernel/Makefile
>>> @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
>>>    obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
>>>
>>>    obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
>>> +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
>>>
>>>    obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
>>>    obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
>>> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
>>> index df9444397908..d6a75aac1d27 100644
>>> --- a/arch/riscv/kernel/asm-offsets.c
>>> +++ b/arch/riscv/kernel/asm-offsets.c
>>> @@ -9,6 +9,7 @@
>>>    #include <linux/kbuild.h>
>>>    #include <linux/mm.h>
>>>    #include <linux/sched.h>
>>> +#include <linux/suspend.h>
>>>    #include <asm/kvm_host.h>
>>>    #include <asm/thread_info.h>
>>>    #include <asm/ptrace.h>
>>> @@ -116,6 +117,10 @@ void asm_offsets(void)
>>>
>>>    	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
>>>
>>> +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
>>> +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
>>> +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
>>> +
>>>    	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
>>>    	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
>>>    	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
>>> diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
>>> new file mode 100644
>>> index 000000000000..846affe4dced
>>> --- /dev/null
>>> +++ b/arch/riscv/kernel/hibernate-asm.S
>>> @@ -0,0 +1,77 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/*
>>> + * Hibernation low level support for RISCV.
>>> + *
>>> + * Copyright (C) 2023 StarFive Technology Co., Ltd.
>>> + *
>>> + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
>>> + */
>>> +
>>> +#include <asm/asm.h>
>>> +#include <asm/asm-offsets.h>
>>> +#include <asm/assembler.h>
>>> +#include <asm/csr.h>
>>> +
>>> +#include <linux/linkage.h>
>>> +
>>> +/*
>>> + * int __hibernate_cpu_resume(void)
>>> + * Switch back to the hibernated image's page table prior to restoring the CPU
>>> + * context.
>>> + *
>>> + * Always returns 0
>>> + */
>>> +ENTRY(__hibernate_cpu_resume)
>>> +	/* switch to hibernated image's page table. */
>>> +	csrw CSR_SATP, s0
>>> +	sfence.vma
>>> +
>>> +	REG_L	a0, hibernate_cpu_context
>>> +
>>> +	restore_csr
>>> +	restore_reg
>>> +
>>> +	/* Return zero value. */
>>> +	add	a0, zero, zero
>>> +
>>> +	ret
>>> +END(__hibernate_cpu_resume)
>>> +
>>> +/*
>>> + * Prepare to restore the image.
>>> + * a0: satp of saved page tables.
>>> + * a1: satp of temporary page tables.
>>> + * a2: cpu_resume.
>>> + */
>>> +ENTRY(hibernate_restore_image)
>>> +	mv	s0, a0
>>> +	mv	s1, a1
>>> +	mv	s2, a2
>>> +	REG_L	s4, restore_pblist
>>> +	REG_L	a1, relocated_restore_code
>>> +
>>> +	jalr	a1
>>> +END(hibernate_restore_image)
>>> +
>>> +/*
>>> + * The below code will be executed from a 'safe' page.
>>> + * It first switches to the temporary page table, then starts to copy the pages
>>> + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
>>> + * to restore the CPU context.
>>> + */
>>> +ENTRY(hibernate_core_restore_code)
>>> +	/* switch to temp page table. */
>>> +	csrw satp, s1
>>> +	sfence.vma
>>> +.Lcopy:
>>> +	/* The below code will restore the hibernated image. */
>>> +	REG_L	a1, HIBERN_PBE_ADDR(s4)
>>> +	REG_L	a0, HIBERN_PBE_ORIG(s4)
>>> +
>>> +	copy_page a0, a1
>>> +
>>> +	REG_L	s4, HIBERN_PBE_NEXT(s4)
>>> +	bnez	s4, .Lcopy
>>> +
>>> +	jalr	s2
>>> +END(hibernate_core_restore_code)
>>> diff --git a/arch/riscv/kernel/hibernate.c b/arch/riscv/kernel/hibernate.c
>>> new file mode 100644
>>> index 000000000000..46a2f470db6e
>>> --- /dev/null
>>> +++ b/arch/riscv/kernel/hibernate.c
>>> @@ -0,0 +1,447 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Hibernation support for RISCV
>>> + *
>>> + * Copyright (C) 2023 StarFive Technology Co., Ltd.
>>> + *
>>> + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
>>> + */
>>> +
>>> +#include <asm/barrier.h>
>>> +#include <asm/cacheflush.h>
>>> +#include <asm/mmu_context.h>
>>> +#include <asm/page.h>
>>> +#include <asm/pgalloc.h>
>>> +#include <asm/pgtable.h>
>>> +#include <asm/sections.h>
>>> +#include <asm/set_memory.h>
>>> +#include <asm/smp.h>
>>> +#include <asm/suspend.h>
>>> +
>>> +#include <linux/cpu.h>
>>> +#include <linux/memblock.h>
>>> +#include <linux/pm.h>
>>> +#include <linux/sched.h>
>>> +#include <linux/suspend.h>
>>> +#include <linux/utsname.h>
>>> +
>>> +/* The logical cpu number we should resume on, initialised to a non-cpu number. */
>>> +static int sleep_cpu = -EINVAL;
>>> +
>>> +/* Pointer to the temporary resume page table. */
>>> +static pgd_t *resume_pg_dir;
>>> +
>>> +/* CPU context to be saved. */
>>> +struct suspend_context *hibernate_cpu_context;
>>> +EXPORT_SYMBOL_GPL(hibernate_cpu_context);
>>> +
>>> +unsigned long relocated_restore_code;
>>> +EXPORT_SYMBOL_GPL(relocated_restore_code);
>>> +
>>> +/**
>>> + * struct arch_hibernate_hdr_invariants - container to store kernel build version.
>>> + * @uts_version: to save the build number and date so that the we do not resume with
>>> + *		a different kernel.
>>> + */
>>> +struct arch_hibernate_hdr_invariants {
>>> +	char		uts_version[__NEW_UTS_LEN + 1];
>>> +};
>>> +
>>> +/**
>>> + * struct arch_hibernate_hdr - helper parameters that help us to restore the image.
>>> + * @invariants: container to store kernel build version.
>>> + * @hartid: to make sure same boot_cpu executes the hibernate/restore code.
>>> + * @saved_satp: original page table used by the hibernated image.
>>> + * @restore_cpu_addr: the kernel's image address to restore the CPU context.
>>> + */
>>> +static struct arch_hibernate_hdr {
>>> +	struct arch_hibernate_hdr_invariants invariants;
>>> +	unsigned long	hartid;
>>> +	unsigned long	saved_satp;
>>> +	unsigned long	restore_cpu_addr;
>>> +} resume_hdr;
>>> +
>>> +static inline void arch_hdr_invariants(struct arch_hibernate_hdr_invariants *i)
>>> +{
>>> +	memset(i, 0, sizeof(*i));
>>> +	memcpy(i->uts_version, init_utsname()->version, sizeof(i->uts_version));
>>> +}
>>> +
>>> +/*
>>> + * Check if the given pfn is in the 'nosave' section.
>>> + */
>>> +int pfn_is_nosave(unsigned long pfn)
>>> +{
>>> +	unsigned long nosave_begin_pfn = sym_to_pfn(&__nosave_begin);
>>> +	unsigned long nosave_end_pfn = sym_to_pfn(&__nosave_end - 1);
>>> +
>>> +	return ((pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn));
>>> +}
>>> +
>>> +void notrace save_processor_state(void)
>>> +{
>>> +	WARN_ON(num_online_cpus() != 1);
>>> +}
>>> +
>>> +void notrace restore_processor_state(void)
>>> +{
>>> +}
>>> +
>>> +/*
>>> + * Helper parameters need to be saved to the hibernation image header.
>>> + */
>>> +int arch_hibernation_header_save(void *addr, unsigned int max_size)
>>> +{
>>> +	struct arch_hibernate_hdr *hdr = addr;
>>> +
>>> +	if (max_size < sizeof(*hdr))
>>> +		return -EOVERFLOW;
>>> +
>>> +	arch_hdr_invariants(&hdr->invariants);
>>> +
>>> +	hdr->hartid = cpuid_to_hartid_map(sleep_cpu);
>>> +	hdr->saved_satp = csr_read(CSR_SATP);
>>> +	hdr->restore_cpu_addr = (unsigned long)__hibernate_cpu_resume;
>>> +
>>> +	return 0;
>>> +}
>>> +EXPORT_SYMBOL_GPL(arch_hibernation_header_save);
>>> +
>>> +/*
>>> + * Retrieve the helper parameters from the hibernation image header.
>>> + */
>>> +int arch_hibernation_header_restore(void *addr)
>>> +{
>>> +	struct arch_hibernate_hdr_invariants invariants;
>>> +	struct arch_hibernate_hdr *hdr = addr;
>>> +	int ret = 0;
>>> +
>>> +	arch_hdr_invariants(&invariants);
>>> +
>>> +	if (memcmp(&hdr->invariants, &invariants, sizeof(invariants))) {
>>> +		pr_crit("Hibernate image not generated by this kernel!\n");
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	sleep_cpu = riscv_hartid_to_cpuid(hdr->hartid);
>>> +	if (sleep_cpu < 0) {
>>> +		pr_crit("Hibernated on a CPU not known to this kernel!\n");
>>> +		sleep_cpu = -EINVAL;
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +#ifdef CONFIG_SMP
>>> +	ret = bringup_hibernate_cpu(sleep_cpu);
>>> +	if (ret) {
>>> +		sleep_cpu = -EINVAL;
>>> +		return ret;
>>> +	}
>>> +#endif
>>> +	resume_hdr = *hdr;
>>> +
>>> +	return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(arch_hibernation_header_restore);
>>> +
>>> +int swsusp_arch_suspend(void)
>>> +{
>>> +	int ret = 0;
>>> +
>>> +	if (__cpu_suspend_enter(hibernate_cpu_context)) {
>>> +		sleep_cpu = smp_processor_id();
>>> +		suspend_save_csrs(hibernate_cpu_context);
>>> +		ret = swsusp_save();
>>> +	} else {
>>> +		suspend_restore_csrs(hibernate_cpu_context);
>>> +		flush_tlb_all();
>>> +		flush_icache_all();
>>> +
>>> +		/*
>>> +		 * Tell the hibernation core that we've just restored the memory.
>>> +		 */
>>> +		in_suspend = 0;
>>> +		sleep_cpu = -EINVAL;
>>> +	}
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +static unsigned long _temp_pgtable_map_pte(pte_t *dst_ptep, pte_t *src_ptep,
>>> +					   unsigned long addr, pgprot_t prot)
>>> +{
>>> +	pte_t pte = READ_ONCE(*src_ptep);
>>> +
>>> +	if (pte_present(pte))
>>> +		set_pte(dst_ptep, __pte(pte_val(pte) | pgprot_val(prot)));
>>> +
>>> +	return 0;
>>> +}
>>
>> I don't see the need for this function
> Sure, can remove it.
>>
>>> +
>>> +static unsigned long temp_pgtable_map_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp,
>>> +					  unsigned long start, unsigned long end,
>>> +					  pgprot_t prot)
>>> +{
>>> +	unsigned long addr = start;
>>> +	pte_t *src_ptep;
>>> +	pte_t *dst_ptep;
>>> +
>>> +	if (pmd_none(READ_ONCE(*dst_pmdp))) {
>>> +		dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
>>> +		if (!dst_ptep)
>>> +			return -ENOMEM;
>>> +
>>> +		pmd_populate_kernel(NULL, dst_pmdp, dst_ptep);
>>> +	}
>>> +
>>> +	dst_ptep = pte_offset_kernel(dst_pmdp, start);
>>> +	src_ptep = pte_offset_kernel(src_pmdp, start);
>>> +
>>> +	do {
>>> +		_temp_pgtable_map_pte(dst_ptep, src_ptep, addr, prot);
>>> +	} while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr < end);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static unsigned long temp_pgtable_map_pmd(pud_t *dst_pudp, pud_t *src_pudp,
>>> +					  unsigned long start, unsigned long end,
>>> +					  pgprot_t prot)
>>> +{
>>> +	unsigned long addr = start;
>>> +	unsigned long next;
>>> +	unsigned long ret;
>>> +	pmd_t *src_pmdp;
>>> +	pmd_t *dst_pmdp;
>>> +
>>> +	if (pud_none(READ_ONCE(*dst_pudp))) {
>>> +		dst_pmdp = (pmd_t *)get_safe_page(GFP_ATOMIC);
>>> +		if (!dst_pmdp)
>>> +			return -ENOMEM;
>>> +
>>> +		pud_populate(NULL, dst_pudp, dst_pmdp);
>>> +	}
>>> +
>>> +	dst_pmdp = pmd_offset(dst_pudp, start);
>>> +	src_pmdp = pmd_offset(src_pudp, start);
>>> +
>>> +	do {
>>> +		pmd_t pmd = READ_ONCE(*src_pmdp);
>>> +
>>> +		next = pmd_addr_end(addr, end);
>>> +
>>> +		if (pmd_none(pmd))
>>> +			continue;
>>> +
>>> +		if (pmd_leaf(pmd)) {
>>> +			set_pmd(dst_pmdp, __pmd(pmd_val(pmd) | pgprot_val(prot)));
>>> +		} else {
>>> +			ret = temp_pgtable_map_pte(dst_pmdp, src_pmdp, addr, next, prot);
>>> +			if (ret)
>>> +				return -ENOMEM;
>>> +		}
>>> +	} while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static unsigned long temp_pgtable_map_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp,
>>> +					  unsigned long start,
>>> +					  unsigned long end, pgprot_t prot)
>>> +{
>>> +	unsigned long addr = start;
>>> +	unsigned long next;
>>> +	unsigned long ret;
>>> +	pud_t *dst_pudp;
>>> +	pud_t *src_pudp;
>>> +
>>> +	if (p4d_none(READ_ONCE(*dst_p4dp))) {
>>> +		dst_pudp = (pud_t *)get_safe_page(GFP_ATOMIC);
>>> +		if (!dst_pudp)
>>> +			return -ENOMEM;
>>> +
>>> +		p4d_populate(NULL, dst_p4dp, dst_pudp);
>>> +	}
>>> +
>>> +	dst_pudp = pud_offset(dst_p4dp, start);
>>> +	src_pudp = pud_offset(src_p4dp, start);
>>> +
>>> +	do {
>>> +		pud_t pud = READ_ONCE(*src_pudp);
>>> +
>>> +		next = pud_addr_end(addr, end);
>>> +
>>> +		if (pud_none(pud))
>>> +			continue;
>>> +
>>> +		if (pud_leaf(pud)) {
>>> +			set_pud(dst_pudp, __pud(pud_val(pud) | pgprot_val(prot)));
>>> +		} else {
>>> +			ret = temp_pgtable_map_pmd(dst_pudp, src_pudp, addr, next, prot);
>>> +			if (ret)
>>> +				return -ENOMEM;
>>> +		}
>>> +	} while (dst_pudp++, src_pudp++, addr = next, addr != end);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static unsigned long temp_pgtable_map_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp,
>>> +					  unsigned long start, unsigned long end,
>>> +					  pgprot_t prot)
>>> +{
>>> +	unsigned long addr = start;
>>
>> Nit: you don't need the addr variable, you can rename start into addr
>> and directly work with it.
> sure.
>>
>>> +	unsigned long next;
>>> +	unsigned long ret;
>>> +	p4d_t *dst_p4dp;
>>> +	p4d_t *src_p4dp;
>>> +
>>> +	if (pgd_none(READ_ONCE(*dst_pgdp))) {
>>> +		dst_p4dp = (p4d_t *)get_safe_page(GFP_ATOMIC);
>>> +		if (!dst_p4dp)
>>> +			return -ENOMEM;
>>> +
>>> +		pgd_populate(NULL, dst_pgdp, dst_p4dp);
>>> +	}
>>> +
>>> +	dst_p4dp = p4d_offset(dst_pgdp, start);
>>> +	src_p4dp = p4d_offset(src_pgdp, start);
>>> +
>>> +	do {
>>> +		p4d_t p4d = READ_ONCE(*src_p4dp);
>>> +
>>> +		next = p4d_addr_end(addr, end);
>>> +
>>> +		if (p4d_none(READ_ONCE(*src_p4dp)))
>>
>> You should use p4d here: p4d_none(p4d)
> sure
>>
>>> +			continue;
>>> +
>>> +		if (p4d_leaf(p4d)) {
>>> +			set_p4d(dst_p4dp, __p4d(p4d_val(p4d) | pgprot_val(prot)));
>>
>> The "| pgprot_val(prot)" happens to work because PAGE_KERNEL will add
>> the PAGE_WRITE bit: I'd rather make it more clear by explicitly add
>> PAGE_WRITE.
> sure, this can be done.
>>
>>> +		} else {
>>> +			ret = temp_pgtable_map_pud(dst_p4dp, src_p4dp, addr, next, prot);
>>> +			if (ret)
>>> +				return -ENOMEM;
>>> +		}
>>> +	} while (dst_p4dp++, src_p4dp++, addr = next, addr != end);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static unsigned long temp_pgtable_mapping(pgd_t *pgdp)
>>> +{
>>> +	unsigned long end = (unsigned long)pfn_to_virt(max_low_pfn);
>>> +	unsigned long addr = PAGE_OFFSET;
>>> +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
>>> +	pgd_t *src_pgdp = pgd_offset_k(addr);
>>> +	unsigned long next;
>>> +
>>> +	do {
>>> +		next = pgd_addr_end(addr, end);
>>> +		if (pgd_none(READ_ONCE(*src_pgdp)))
>>> +			continue;
>>> +
>>
>> We added the pgd_leaf test in kernel_page_present, let's add it here too.
> sure.
>>
>>> +		if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, next, PAGE_KERNEL))
>>> +			return -ENOMEM;
>>> +	} while (dst_pgdp++, src_pgdp++, addr = next, addr != end);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static unsigned long temp_pgtable_text_mapping(pgd_t *pgdp, unsigned long addr)
>>> +{
>>> +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
>>> +	pgd_t *src_pgdp = pgd_offset_k(addr);
>>> +
>>> +	if (pgd_none(READ_ONCE(*src_pgdp)))
>>> +		return -EFAULT;
>>> +
>>> +	if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, addr, PAGE_KERNEL_EXEC))
>>> +		return -ENOMEM;
>>> +
>>> +	return 0;
>>> +}
>>
>> Ok so if we fall into a huge mapping, you add the exec permission to the
>> whole range, which could easily be 1GB. I think that either we can avoid
>> this step by mapping the whole linear mapping as executable, or we
>> actually use another pgd entry for this page that is not in the linear
>> mapping. The latter seems cleaner, what do you think?
> we can map the whole linear address to writable & executable, by doing this, we can avoid the remapping at the linear map again.
> we still need to use the same pgd entry for the non-linear mapping just like how it did at the swapper_pg_dir (linear and non-linear addr are within the range supported by the pgd).


Not sure I understand your last sentence


>>
>>> +
>>> +static unsigned long relocate_restore_code(void)
>>> +{
>>> +	unsigned long ret;
>>> +	void *page = (void *)get_safe_page(GFP_ATOMIC);
>>> +
>>> +	if (!page)
>>> +		return -ENOMEM;
>>> +
>>> +	copy_page(page, hibernate_core_restore_code);
>>> +
>>> +	/* Make the page containing the relocated code executable. */
>>> +	set_memory_x((unsigned long)page, 1);
>>> +
>>> +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)page);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	return (unsigned long)page;
>>> +}
>>> +
>>> +int swsusp_arch_resume(void)
>>> +{
>>> +	unsigned long ret;
>>> +
>>> +	/*
>>> +	 * Memory allocated by get_safe_page() will be dealt with by the hibernation core,
>>> +	 * we don't need to free it here.
>>> +	 */
>>> +	resume_pg_dir = (pgd_t *)get_safe_page(GFP_ATOMIC);
>>> +	if (!resume_pg_dir)
>>> +		return -ENOMEM;
>>> +
>>> +	/*
>>> +	 * The pages need to be writable when restoring the image.
>>> +	 * Create a second copy of page table just for the linear map.
>>> +	 * Use this temporary page table to restore the image.
>>> +	 */
>>> +	ret = temp_pgtable_mapping(resume_pg_dir);
>>> +	if (ret)
>>> +		return (int)ret;
>>
>> The temp_pgtable* functions should return an int to avoid this cast.


Did you note this comment too?


>>
>>
>>> +
>>> +	/* Move the restore code to a new page so that it doesn't get overwritten by itself. */
>>> +	relocated_restore_code = relocate_restore_code();
>>> +	if (relocated_restore_code == -ENOMEM)
>>> +		return -ENOMEM;
>>> +
>>> +	/*
>>> +	 * Map the __hibernate_cpu_resume() address to the temporary page table so that the
>>> +	 * restore code can jumps to it after finished restore the image. The next execution
>>> +	 * code doesn't find itself in a different address space after switching over to the
>>> +	 * original page table used by the hibernated image.
>>> +	 */
>>> +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)resume_hdr.restore_cpu_addr);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	hibernate_restore_image(resume_hdr.saved_satp, (PFN_DOWN(__pa(resume_pg_dir)) | satp_mode),
>>> +				resume_hdr.restore_cpu_addr);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +#ifdef CONFIG_PM_SLEEP_SMP
>>> +int hibernate_resume_nonboot_cpu_disable(void)
>>> +{
>>> +	if (sleep_cpu < 0) {
>>> +		pr_err("Failing to resume from hibernate on an unknown CPU\n");
>>> +		return -ENODEV;
>>> +	}
>>> +
>>> +	return freeze_secondary_cpus(sleep_cpu);
>>> +}
>>> +#endif
>>> +
>>> +static int __init riscv_hibernate_init(void)
>>> +{
>>> +	hibernate_cpu_context = kzalloc(sizeof(*hibernate_cpu_context), GFP_KERNEL);
>>> +
>>> +	if (WARN_ON(!hibernate_cpu_context))
>>> +		return -ENOMEM;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +early_initcall(riscv_hibernate_init);
>>
>> Overall, it is now nicer with the the proper page table walk: but we can
>> now see that the code is exactly the same as arm64, what prevents us
>> from merging both somewhere in mm/?
> 1. low level page table bit definition not the same
> 2. Need to refactor code for both riscv and arm64
> 3. Need to verify the solution for both riscv and arm64 platforms (need someone who expertise on arm64)
> 4. Might need to extend the function to support other arch
> 5. Overall, it is do-able but the effort to support the above matters are huge.


Too bad, I really see benefits of avoiding code duplication, but that's 
up to you.

Thanks,

Alex


>
Sia Jee Heng Feb. 28, 2023, 1:20 a.m. UTC | #16
> -----Original Message-----
> From: Alexandre Ghiti <alex@ghiti.fr>
> Sent: Tuesday, 28 February, 2023 4:32 AM
> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>; paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu
> Cc: linux-riscv@lists.infradead.org; linux-kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo
> <mason.huo@starfivetech.com>
> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> 
> On 2/27/23 04:11, JeeHeng Sia wrote:
> >
> >> -----Original Message-----
> >> From: Alexandre Ghiti <alex@ghiti.fr>
> >> Sent: Friday, 24 February, 2023 8:29 PM
> >> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>; paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu
> >> Cc: linux-riscv@lists.infradead.org; linux-kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo
> >> <mason.huo@starfivetech.com>
> >> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> >>
> >> On 2/21/23 03:35, Sia Jee Heng wrote:
> >>> Low level Arch functions were created to support hibernation.
> >>> swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> >>> cpu state onto the stack, then calling swsusp_save() to save the memory
> >>> image.
> >>>
> >>> Arch specific hibernation header is implemented and is utilized by the
> >>> arch_hibernation_header_restore() and arch_hibernation_header_save()
> >>> functions. The arch specific hibernation header consists of satp, hartid,
> >>> and the cpu_resume address. The kernel built version is also need to be
> >>> saved into the hibernation image header to making sure only the same
> >>> kernel is restore when resume.
> >>>
> >>> swsusp_arch_resume() creates a temporary page table that covering only
> >>> the linear map. It copies the restore code to a 'safe' page, then start
> >>> to restore the memory image. Once completed, it restores the original
> >>> kernel's page table. It then calls into __hibernate_cpu_resume()
> >>> to restore the CPU context. Finally, it follows the normal hibernation
> >>> path back to the hibernation core.
> >>>
> >>> To enable hibernation/suspend to disk into RISCV, the below config
> >>> need to be enabled:
> >>> - CONFIG_ARCH_HIBERNATION_HEADER
> >>> - CONFIG_ARCH_HIBERNATION_POSSIBLE
> >>>
> >>> Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> >>> Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> >>> Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> >>> ---
> >>>    arch/riscv/Kconfig                 |   7 +
> >>>    arch/riscv/include/asm/assembler.h |  20 ++
> >>>    arch/riscv/include/asm/suspend.h   |  19 ++
> >>>    arch/riscv/kernel/Makefile         |   1 +
> >>>    arch/riscv/kernel/asm-offsets.c    |   5 +
> >>>    arch/riscv/kernel/hibernate-asm.S  |  77 +++++
> >>>    arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
> >>>    7 files changed, 576 insertions(+)
> >>>    create mode 100644 arch/riscv/kernel/hibernate-asm.S
> >>>    create mode 100644 arch/riscv/kernel/hibernate.c
> >>>
> >>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> >>> index e2b656043abf..4555848a817f 100644
> >>> --- a/arch/riscv/Kconfig
> >>> +++ b/arch/riscv/Kconfig
> >>> @@ -690,6 +690,13 @@ menu "Power management options"
> >>>
> >>>    source "kernel/power/Kconfig"
> >>>
> >>> +config ARCH_HIBERNATION_POSSIBLE
> >>> +	def_bool y
> >>> +
> >>> +config ARCH_HIBERNATION_HEADER
> >>> +	def_bool y
> >>> +	depends on HIBERNATION
> >>> +
> >>>    endmenu # "Power management options"
> >>>
> >>>    menu "CPU Power Management"
> >>> diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> >>> index 727a97735493..68c46c0e0ea8 100644
> >>> --- a/arch/riscv/include/asm/assembler.h
> >>> +++ b/arch/riscv/include/asm/assembler.h
> >>> @@ -59,4 +59,24 @@
> >>>    		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
> >>>    	.endm
> >>>
> >>> +/*
> >>> + * copy_page - copy 1 page (4KB) of data from source to destination
> >>> + * @a0 - destination
> >>> + * @a1 - source
> >>> + */
> >>> +	.macro	copy_page a0, a1
> >>> +		lui	a2, 0x1
> >>> +		add	a2, a2, a0
> >>> +1 :
> >>> +		REG_L	t0, 0(a1)
> >>> +		REG_L	t1, SZREG(a1)
> >>> +
> >>> +		REG_S	t0, 0(a0)
> >>> +		REG_S	t1, SZREG(a0)
> >>> +
> >>> +		addi	a0, a0, 2 * SZREG
> >>> +		addi	a1, a1, 2 * SZREG
> >>> +		bne	a2, a0, 1b
> >>> +	.endm
> >>> +
> >>>    #endif	/* __ASM_ASSEMBLER_H */
> >>> diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> >>> index 75419c5ca272..3362da56a9d8 100644
> >>> --- a/arch/riscv/include/asm/suspend.h
> >>> +++ b/arch/riscv/include/asm/suspend.h
> >>> @@ -21,6 +21,11 @@ struct suspend_context {
> >>>    #endif
> >>>    };
> >>>
> >>> +/*
> >>> + * Used by hibernation core and cleared during resume sequence
> >>> + */
> >>> +extern int in_suspend;
> >>> +
> >>>    /* Low-level CPU suspend entry function */
> >>>    int __cpu_suspend_enter(struct suspend_context *context);
> >>>
> >>> @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
> >>>    /* Used to save and restore the csr */
> >>>    void suspend_save_csrs(struct suspend_context *context);
> >>>    void suspend_restore_csrs(struct suspend_context *context);
> >>> +
> >>> +/* Low-level API to support hibernation */
> >>> +int swsusp_arch_suspend(void);
> >>> +int swsusp_arch_resume(void);
> >>> +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> >>> +int arch_hibernation_header_restore(void *addr);
> >>> +int __hibernate_cpu_resume(void);
> >>> +
> >>> +/* Used to resume on the CPU we hibernated on */
> >>> +int hibernate_resume_nonboot_cpu_disable(void);
> >>> +
> >>> +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> >>> +					unsigned long cpu_resume);
> >>> +asmlinkage int hibernate_core_restore_code(void);
> >>>    #endif
> >>> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> >>> index 4cf303a779ab..daab341d55e4 100644
> >>> --- a/arch/riscv/kernel/Makefile
> >>> +++ b/arch/riscv/kernel/Makefile
> >>> @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
> >>>    obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> >>>
> >>>    obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> >>> +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> >>>
> >>>    obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
> >>>    obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> >>> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> >>> index df9444397908..d6a75aac1d27 100644
> >>> --- a/arch/riscv/kernel/asm-offsets.c
> >>> +++ b/arch/riscv/kernel/asm-offsets.c
> >>> @@ -9,6 +9,7 @@
> >>>    #include <linux/kbuild.h>
> >>>    #include <linux/mm.h>
> >>>    #include <linux/sched.h>
> >>> +#include <linux/suspend.h>
> >>>    #include <asm/kvm_host.h>
> >>>    #include <asm/thread_info.h>
> >>>    #include <asm/ptrace.h>
> >>> @@ -116,6 +117,10 @@ void asm_offsets(void)
> >>>
> >>>    	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> >>>
> >>> +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> >>> +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> >>> +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> >>> +
> >>>    	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
> >>>    	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
> >>>    	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> >>> diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> >>> new file mode 100644
> >>> index 000000000000..846affe4dced
> >>> --- /dev/null
> >>> +++ b/arch/riscv/kernel/hibernate-asm.S
> >>> @@ -0,0 +1,77 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0-only */
> >>> +/*
> >>> + * Hibernation low level support for RISCV.
> >>> + *
> >>> + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> >>> + *
> >>> + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> >>> + */
> >>> +
> >>> +#include <asm/asm.h>
> >>> +#include <asm/asm-offsets.h>
> >>> +#include <asm/assembler.h>
> >>> +#include <asm/csr.h>
> >>> +
> >>> +#include <linux/linkage.h>
> >>> +
> >>> +/*
> >>> + * int __hibernate_cpu_resume(void)
> >>> + * Switch back to the hibernated image's page table prior to restoring the CPU
> >>> + * context.
> >>> + *
> >>> + * Always returns 0
> >>> + */
> >>> +ENTRY(__hibernate_cpu_resume)
> >>> +	/* switch to hibernated image's page table. */
> >>> +	csrw CSR_SATP, s0
> >>> +	sfence.vma
> >>> +
> >>> +	REG_L	a0, hibernate_cpu_context
> >>> +
> >>> +	restore_csr
> >>> +	restore_reg
> >>> +
> >>> +	/* Return zero value. */
> >>> +	add	a0, zero, zero
> >>> +
> >>> +	ret
> >>> +END(__hibernate_cpu_resume)
> >>> +
> >>> +/*
> >>> + * Prepare to restore the image.
> >>> + * a0: satp of saved page tables.
> >>> + * a1: satp of temporary page tables.
> >>> + * a2: cpu_resume.
> >>> + */
> >>> +ENTRY(hibernate_restore_image)
> >>> +	mv	s0, a0
> >>> +	mv	s1, a1
> >>> +	mv	s2, a2
> >>> +	REG_L	s4, restore_pblist
> >>> +	REG_L	a1, relocated_restore_code
> >>> +
> >>> +	jalr	a1
> >>> +END(hibernate_restore_image)
> >>> +
> >>> +/*
> >>> + * The below code will be executed from a 'safe' page.
> >>> + * It first switches to the temporary page table, then starts to copy the pages
> >>> + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> >>> + * to restore the CPU context.
> >>> + */
> >>> +ENTRY(hibernate_core_restore_code)
> >>> +	/* switch to temp page table. */
> >>> +	csrw satp, s1
> >>> +	sfence.vma
> >>> +.Lcopy:
> >>> +	/* The below code will restore the hibernated image. */
> >>> +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> >>> +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> >>> +
> >>> +	copy_page a0, a1
> >>> +
> >>> +	REG_L	s4, HIBERN_PBE_NEXT(s4)
> >>> +	bnez	s4, .Lcopy
> >>> +
> >>> +	jalr	s2
> >>> +END(hibernate_core_restore_code)
> >>> diff --git a/arch/riscv/kernel/hibernate.c b/arch/riscv/kernel/hibernate.c
> >>> new file mode 100644
> >>> index 000000000000..46a2f470db6e
> >>> --- /dev/null
> >>> +++ b/arch/riscv/kernel/hibernate.c
> >>> @@ -0,0 +1,447 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * Hibernation support for RISCV
> >>> + *
> >>> + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> >>> + *
> >>> + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> >>> + */
> >>> +
> >>> +#include <asm/barrier.h>
> >>> +#include <asm/cacheflush.h>
> >>> +#include <asm/mmu_context.h>
> >>> +#include <asm/page.h>
> >>> +#include <asm/pgalloc.h>
> >>> +#include <asm/pgtable.h>
> >>> +#include <asm/sections.h>
> >>> +#include <asm/set_memory.h>
> >>> +#include <asm/smp.h>
> >>> +#include <asm/suspend.h>
> >>> +
> >>> +#include <linux/cpu.h>
> >>> +#include <linux/memblock.h>
> >>> +#include <linux/pm.h>
> >>> +#include <linux/sched.h>
> >>> +#include <linux/suspend.h>
> >>> +#include <linux/utsname.h>
> >>> +
> >>> +/* The logical cpu number we should resume on, initialised to a non-cpu number. */
> >>> +static int sleep_cpu = -EINVAL;
> >>> +
> >>> +/* Pointer to the temporary resume page table. */
> >>> +static pgd_t *resume_pg_dir;
> >>> +
> >>> +/* CPU context to be saved. */
> >>> +struct suspend_context *hibernate_cpu_context;
> >>> +EXPORT_SYMBOL_GPL(hibernate_cpu_context);
> >>> +
> >>> +unsigned long relocated_restore_code;
> >>> +EXPORT_SYMBOL_GPL(relocated_restore_code);
> >>> +
> >>> +/**
> >>> + * struct arch_hibernate_hdr_invariants - container to store kernel build version.
> >>> + * @uts_version: to save the build number and date so that the we do not resume with
> >>> + *		a different kernel.
> >>> + */
> >>> +struct arch_hibernate_hdr_invariants {
> >>> +	char		uts_version[__NEW_UTS_LEN + 1];
> >>> +};
> >>> +
> >>> +/**
> >>> + * struct arch_hibernate_hdr - helper parameters that help us to restore the image.
> >>> + * @invariants: container to store kernel build version.
> >>> + * @hartid: to make sure same boot_cpu executes the hibernate/restore code.
> >>> + * @saved_satp: original page table used by the hibernated image.
> >>> + * @restore_cpu_addr: the kernel's image address to restore the CPU context.
> >>> + */
> >>> +static struct arch_hibernate_hdr {
> >>> +	struct arch_hibernate_hdr_invariants invariants;
> >>> +	unsigned long	hartid;
> >>> +	unsigned long	saved_satp;
> >>> +	unsigned long	restore_cpu_addr;
> >>> +} resume_hdr;
> >>> +
> >>> +static inline void arch_hdr_invariants(struct arch_hibernate_hdr_invariants *i)
> >>> +{
> >>> +	memset(i, 0, sizeof(*i));
> >>> +	memcpy(i->uts_version, init_utsname()->version, sizeof(i->uts_version));
> >>> +}
> >>> +
> >>> +/*
> >>> + * Check if the given pfn is in the 'nosave' section.
> >>> + */
> >>> +int pfn_is_nosave(unsigned long pfn)
> >>> +{
> >>> +	unsigned long nosave_begin_pfn = sym_to_pfn(&__nosave_begin);
> >>> +	unsigned long nosave_end_pfn = sym_to_pfn(&__nosave_end - 1);
> >>> +
> >>> +	return ((pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn));
> >>> +}
> >>> +
> >>> +void notrace save_processor_state(void)
> >>> +{
> >>> +	WARN_ON(num_online_cpus() != 1);
> >>> +}
> >>> +
> >>> +void notrace restore_processor_state(void)
> >>> +{
> >>> +}
> >>> +
> >>> +/*
> >>> + * Helper parameters need to be saved to the hibernation image header.
> >>> + */
> >>> +int arch_hibernation_header_save(void *addr, unsigned int max_size)
> >>> +{
> >>> +	struct arch_hibernate_hdr *hdr = addr;
> >>> +
> >>> +	if (max_size < sizeof(*hdr))
> >>> +		return -EOVERFLOW;
> >>> +
> >>> +	arch_hdr_invariants(&hdr->invariants);
> >>> +
> >>> +	hdr->hartid = cpuid_to_hartid_map(sleep_cpu);
> >>> +	hdr->saved_satp = csr_read(CSR_SATP);
> >>> +	hdr->restore_cpu_addr = (unsigned long)__hibernate_cpu_resume;
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(arch_hibernation_header_save);
> >>> +
> >>> +/*
> >>> + * Retrieve the helper parameters from the hibernation image header.
> >>> + */
> >>> +int arch_hibernation_header_restore(void *addr)
> >>> +{
> >>> +	struct arch_hibernate_hdr_invariants invariants;
> >>> +	struct arch_hibernate_hdr *hdr = addr;
> >>> +	int ret = 0;
> >>> +
> >>> +	arch_hdr_invariants(&invariants);
> >>> +
> >>> +	if (memcmp(&hdr->invariants, &invariants, sizeof(invariants))) {
> >>> +		pr_crit("Hibernate image not generated by this kernel!\n");
> >>> +		return -EINVAL;
> >>> +	}
> >>> +
> >>> +	sleep_cpu = riscv_hartid_to_cpuid(hdr->hartid);
> >>> +	if (sleep_cpu < 0) {
> >>> +		pr_crit("Hibernated on a CPU not known to this kernel!\n");
> >>> +		sleep_cpu = -EINVAL;
> >>> +		return -EINVAL;
> >>> +	}
> >>> +
> >>> +#ifdef CONFIG_SMP
> >>> +	ret = bringup_hibernate_cpu(sleep_cpu);
> >>> +	if (ret) {
> >>> +		sleep_cpu = -EINVAL;
> >>> +		return ret;
> >>> +	}
> >>> +#endif
> >>> +	resume_hdr = *hdr;
> >>> +
> >>> +	return ret;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(arch_hibernation_header_restore);
> >>> +
> >>> +int swsusp_arch_suspend(void)
> >>> +{
> >>> +	int ret = 0;
> >>> +
> >>> +	if (__cpu_suspend_enter(hibernate_cpu_context)) {
> >>> +		sleep_cpu = smp_processor_id();
> >>> +		suspend_save_csrs(hibernate_cpu_context);
> >>> +		ret = swsusp_save();
> >>> +	} else {
> >>> +		suspend_restore_csrs(hibernate_cpu_context);
> >>> +		flush_tlb_all();
> >>> +		flush_icache_all();
> >>> +
> >>> +		/*
> >>> +		 * Tell the hibernation core that we've just restored the memory.
> >>> +		 */
> >>> +		in_suspend = 0;
> >>> +		sleep_cpu = -EINVAL;
> >>> +	}
> >>> +
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +static unsigned long _temp_pgtable_map_pte(pte_t *dst_ptep, pte_t *src_ptep,
> >>> +					   unsigned long addr, pgprot_t prot)
> >>> +{
> >>> +	pte_t pte = READ_ONCE(*src_ptep);
> >>> +
> >>> +	if (pte_present(pte))
> >>> +		set_pte(dst_ptep, __pte(pte_val(pte) | pgprot_val(prot)));
> >>> +
> >>> +	return 0;
> >>> +}
> >>
> >> I don't see the need for this function
> > Sure, can remove it.
> >>
> >>> +
> >>> +static unsigned long temp_pgtable_map_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp,
> >>> +					  unsigned long start, unsigned long end,
> >>> +					  pgprot_t prot)
> >>> +{
> >>> +	unsigned long addr = start;
> >>> +	pte_t *src_ptep;
> >>> +	pte_t *dst_ptep;
> >>> +
> >>> +	if (pmd_none(READ_ONCE(*dst_pmdp))) {
> >>> +		dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
> >>> +		if (!dst_ptep)
> >>> +			return -ENOMEM;
> >>> +
> >>> +		pmd_populate_kernel(NULL, dst_pmdp, dst_ptep);
> >>> +	}
> >>> +
> >>> +	dst_ptep = pte_offset_kernel(dst_pmdp, start);
> >>> +	src_ptep = pte_offset_kernel(src_pmdp, start);
> >>> +
> >>> +	do {
> >>> +		_temp_pgtable_map_pte(dst_ptep, src_ptep, addr, prot);
> >>> +	} while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr < end);
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static unsigned long temp_pgtable_map_pmd(pud_t *dst_pudp, pud_t *src_pudp,
> >>> +					  unsigned long start, unsigned long end,
> >>> +					  pgprot_t prot)
> >>> +{
> >>> +	unsigned long addr = start;
> >>> +	unsigned long next;
> >>> +	unsigned long ret;
> >>> +	pmd_t *src_pmdp;
> >>> +	pmd_t *dst_pmdp;
> >>> +
> >>> +	if (pud_none(READ_ONCE(*dst_pudp))) {
> >>> +		dst_pmdp = (pmd_t *)get_safe_page(GFP_ATOMIC);
> >>> +		if (!dst_pmdp)
> >>> +			return -ENOMEM;
> >>> +
> >>> +		pud_populate(NULL, dst_pudp, dst_pmdp);
> >>> +	}
> >>> +
> >>> +	dst_pmdp = pmd_offset(dst_pudp, start);
> >>> +	src_pmdp = pmd_offset(src_pudp, start);
> >>> +
> >>> +	do {
> >>> +		pmd_t pmd = READ_ONCE(*src_pmdp);
> >>> +
> >>> +		next = pmd_addr_end(addr, end);
> >>> +
> >>> +		if (pmd_none(pmd))
> >>> +			continue;
> >>> +
> >>> +		if (pmd_leaf(pmd)) {
> >>> +			set_pmd(dst_pmdp, __pmd(pmd_val(pmd) | pgprot_val(prot)));
> >>> +		} else {
> >>> +			ret = temp_pgtable_map_pte(dst_pmdp, src_pmdp, addr, next, prot);
> >>> +			if (ret)
> >>> +				return -ENOMEM;
> >>> +		}
> >>> +	} while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static unsigned long temp_pgtable_map_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp,
> >>> +					  unsigned long start,
> >>> +					  unsigned long end, pgprot_t prot)
> >>> +{
> >>> +	unsigned long addr = start;
> >>> +	unsigned long next;
> >>> +	unsigned long ret;
> >>> +	pud_t *dst_pudp;
> >>> +	pud_t *src_pudp;
> >>> +
> >>> +	if (p4d_none(READ_ONCE(*dst_p4dp))) {
> >>> +		dst_pudp = (pud_t *)get_safe_page(GFP_ATOMIC);
> >>> +		if (!dst_pudp)
> >>> +			return -ENOMEM;
> >>> +
> >>> +		p4d_populate(NULL, dst_p4dp, dst_pudp);
> >>> +	}
> >>> +
> >>> +	dst_pudp = pud_offset(dst_p4dp, start);
> >>> +	src_pudp = pud_offset(src_p4dp, start);
> >>> +
> >>> +	do {
> >>> +		pud_t pud = READ_ONCE(*src_pudp);
> >>> +
> >>> +		next = pud_addr_end(addr, end);
> >>> +
> >>> +		if (pud_none(pud))
> >>> +			continue;
> >>> +
> >>> +		if (pud_leaf(pud)) {
> >>> +			set_pud(dst_pudp, __pud(pud_val(pud) | pgprot_val(prot)));
> >>> +		} else {
> >>> +			ret = temp_pgtable_map_pmd(dst_pudp, src_pudp, addr, next, prot);
> >>> +			if (ret)
> >>> +				return -ENOMEM;
> >>> +		}
> >>> +	} while (dst_pudp++, src_pudp++, addr = next, addr != end);
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static unsigned long temp_pgtable_map_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp,
> >>> +					  unsigned long start, unsigned long end,
> >>> +					  pgprot_t prot)
> >>> +{
> >>> +	unsigned long addr = start;
> >>
> >> Nit: you don't need the addr variable, you can rename start into addr
> >> and directly work with it.
> > sure.
> >>
> >>> +	unsigned long next;
> >>> +	unsigned long ret;
> >>> +	p4d_t *dst_p4dp;
> >>> +	p4d_t *src_p4dp;
> >>> +
> >>> +	if (pgd_none(READ_ONCE(*dst_pgdp))) {
> >>> +		dst_p4dp = (p4d_t *)get_safe_page(GFP_ATOMIC);
> >>> +		if (!dst_p4dp)
> >>> +			return -ENOMEM;
> >>> +
> >>> +		pgd_populate(NULL, dst_pgdp, dst_p4dp);
> >>> +	}
> >>> +
> >>> +	dst_p4dp = p4d_offset(dst_pgdp, start);
> >>> +	src_p4dp = p4d_offset(src_pgdp, start);
> >>> +
> >>> +	do {
> >>> +		p4d_t p4d = READ_ONCE(*src_p4dp);
> >>> +
> >>> +		next = p4d_addr_end(addr, end);
> >>> +
> >>> +		if (p4d_none(READ_ONCE(*src_p4dp)))
> >>
> >> You should use p4d here: p4d_none(p4d)
> > sure
> >>
> >>> +			continue;
> >>> +
> >>> +		if (p4d_leaf(p4d)) {
> >>> +			set_p4d(dst_p4dp, __p4d(p4d_val(p4d) | pgprot_val(prot)));
> >>
> >> The "| pgprot_val(prot)" happens to work because PAGE_KERNEL will add
> >> the PAGE_WRITE bit: I'd rather make it more clear by explicitly add
> >> PAGE_WRITE.
> > sure, this can be done.
> >>
> >>> +		} else {
> >>> +			ret = temp_pgtable_map_pud(dst_p4dp, src_p4dp, addr, next, prot);
> >>> +			if (ret)
> >>> +				return -ENOMEM;
> >>> +		}
> >>> +	} while (dst_p4dp++, src_p4dp++, addr = next, addr != end);
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static unsigned long temp_pgtable_mapping(pgd_t *pgdp)
> >>> +{
> >>> +	unsigned long end = (unsigned long)pfn_to_virt(max_low_pfn);
> >>> +	unsigned long addr = PAGE_OFFSET;
> >>> +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
> >>> +	pgd_t *src_pgdp = pgd_offset_k(addr);
> >>> +	unsigned long next;
> >>> +
> >>> +	do {
> >>> +		next = pgd_addr_end(addr, end);
> >>> +		if (pgd_none(READ_ONCE(*src_pgdp)))
> >>> +			continue;
> >>> +
> >>
> >> We added the pgd_leaf test in kernel_page_present, let's add it here too.
> > sure.
> >>
> >>> +		if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, next, PAGE_KERNEL))
> >>> +			return -ENOMEM;
> >>> +	} while (dst_pgdp++, src_pgdp++, addr = next, addr != end);
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static unsigned long temp_pgtable_text_mapping(pgd_t *pgdp, unsigned long addr)
> >>> +{
> >>> +	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
> >>> +	pgd_t *src_pgdp = pgd_offset_k(addr);
> >>> +
> >>> +	if (pgd_none(READ_ONCE(*src_pgdp)))
> >>> +		return -EFAULT;
> >>> +
> >>> +	if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, addr, PAGE_KERNEL_EXEC))
> >>> +		return -ENOMEM;
> >>> +
> >>> +	return 0;
> >>> +}
> >>
> >> Ok so if we fall into a huge mapping, you add the exec permission to the
> >> whole range, which could easily be 1GB. I think that either we can avoid
> >> this step by mapping the whole linear mapping as executable, or we
> >> actually use another pgd entry for this page that is not in the linear
> >> mapping. The latter seems cleaner, what do you think?
> > we can map the whole linear address to writable & executable, by doing this, we can avoid the remapping at the linear map again.
> > we still need to use the same pgd entry for the non-linear mapping just like how it did at the swapper_pg_dir (linear and non-linear
> addr are within the range supported by the pgd).
> 
> 
> Not sure I understand your last sentence
I mean same pgd entry can be used for linear and non-linear addr, we don’t have to create another pgd entry like what it does for the process handling.
> 
> 
> >>
> >>> +
> >>> +static unsigned long relocate_restore_code(void)
> >>> +{
> >>> +	unsigned long ret;
> >>> +	void *page = (void *)get_safe_page(GFP_ATOMIC);
> >>> +
> >>> +	if (!page)
> >>> +		return -ENOMEM;
> >>> +
> >>> +	copy_page(page, hibernate_core_restore_code);
> >>> +
> >>> +	/* Make the page containing the relocated code executable. */
> >>> +	set_memory_x((unsigned long)page, 1);
> >>> +
> >>> +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)page);
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	return (unsigned long)page;
> >>> +}
> >>> +
> >>> +int swsusp_arch_resume(void)
> >>> +{
> >>> +	unsigned long ret;
> >>> +
> >>> +	/*
> >>> +	 * Memory allocated by get_safe_page() will be dealt with by the hibernation core,
> >>> +	 * we don't need to free it here.
> >>> +	 */
> >>> +	resume_pg_dir = (pgd_t *)get_safe_page(GFP_ATOMIC);
> >>> +	if (!resume_pg_dir)
> >>> +		return -ENOMEM;
> >>> +
> >>> +	/*
> >>> +	 * The pages need to be writable when restoring the image.
> >>> +	 * Create a second copy of page table just for the linear map.
> >>> +	 * Use this temporary page table to restore the image.
> >>> +	 */
> >>> +	ret = temp_pgtable_mapping(resume_pg_dir);
> >>> +	if (ret)
> >>> +		return (int)ret;
> >>
> >> The temp_pgtable* functions should return an int to avoid this cast.
> 
> 
> Did you note this comment too?
oops, missed this comment. Thanks for point it out. Sure, it can be done.
> 
> 
> >>
> >>
> >>> +
> >>> +	/* Move the restore code to a new page so that it doesn't get overwritten by itself. */
> >>> +	relocated_restore_code = relocate_restore_code();
> >>> +	if (relocated_restore_code == -ENOMEM)
> >>> +		return -ENOMEM;
> >>> +
> >>> +	/*
> >>> +	 * Map the __hibernate_cpu_resume() address to the temporary page table so that the
> >>> +	 * restore code can jumps to it after finished restore the image. The next execution
> >>> +	 * code doesn't find itself in a different address space after switching over to the
> >>> +	 * original page table used by the hibernated image.
> >>> +	 */
> >>> +	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)resume_hdr.restore_cpu_addr);
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	hibernate_restore_image(resume_hdr.saved_satp, (PFN_DOWN(__pa(resume_pg_dir)) | satp_mode),
> >>> +				resume_hdr.restore_cpu_addr);
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +#ifdef CONFIG_PM_SLEEP_SMP
> >>> +int hibernate_resume_nonboot_cpu_disable(void)
> >>> +{
> >>> +	if (sleep_cpu < 0) {
> >>> +		pr_err("Failing to resume from hibernate on an unknown CPU\n");
> >>> +		return -ENODEV;
> >>> +	}
> >>> +
> >>> +	return freeze_secondary_cpus(sleep_cpu);
> >>> +}
> >>> +#endif
> >>> +
> >>> +static int __init riscv_hibernate_init(void)
> >>> +{
> >>> +	hibernate_cpu_context = kzalloc(sizeof(*hibernate_cpu_context), GFP_KERNEL);
> >>> +
> >>> +	if (WARN_ON(!hibernate_cpu_context))
> >>> +		return -ENOMEM;
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +early_initcall(riscv_hibernate_init);
> >>
> >> Overall, it is now nicer with the the proper page table walk: but we can
> >> now see that the code is exactly the same as arm64, what prevents us
> >> from merging both somewhere in mm/?
> > 1. low level page table bit definition not the same
> > 2. Need to refactor code for both riscv and arm64
> > 3. Need to verify the solution for both riscv and arm64 platforms (need someone who expertise on arm64)
> > 4. Might need to extend the function to support other arch
> > 5. Overall, it is do-able but the effort to support the above matters are huge.
> 
> 
> Too bad, I really see benefits of avoiding code duplication, but that's
> up to you.
Sure, I do see the benefit but I also see the effort needed. Perhaps I can put it to my todo list. I can work out with you in the near future but for sure not in this patch series.
> 
> Thanks,
> 
> Alex
> 
> 
> >
Sia Jee Heng Feb. 28, 2023, 1:32 a.m. UTC | #17
> -----Original Message-----
> From: Andrew Jones <ajones@ventanamicro.com>
> Sent: Monday, 27 February, 2023 7:45 PM
> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> On Mon, Feb 27, 2023 at 10:52:32AM +0000, JeeHeng Sia wrote:
> >
> >
> > > -----Original Message-----
> > > From: Andrew Jones <ajones@ventanamicro.com>
> > > Sent: Monday, 27 February, 2023 4:00 PM
> > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > >
> > > On Mon, Feb 27, 2023 at 02:14:27AM +0000, JeeHeng Sia wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > Sent: Friday, 24 February, 2023 8:07 PM
> > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > >
> > > > > On Fri, Feb 24, 2023 at 10:30:19AM +0000, JeeHeng Sia wrote:
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > > > Sent: Friday, 24 February, 2023 5:55 PM
> > > > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > > > >
> > > > > > > On Fri, Feb 24, 2023 at 09:33:31AM +0000, JeeHeng Sia wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > > > > > Sent: Friday, 24 February, 2023 5:00 PM
> > > > > > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > > > > > >
> > > > > > > > > On Fri, Feb 24, 2023 at 02:05:43AM +0000, JeeHeng Sia wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > > > > > > > Sent: Friday, 24 February, 2023 2:07 AM
> > > > > > > > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org;
> linux-
> > > > > > > > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > > > > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Feb 21, 2023 at 10:35:23AM +0800, Sia Jee Heng wrote:
> > > > > > > > > > > > Low level Arch functions were created to support hibernation.
> > > > > > > > > > > > swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write
> > > > > > > > > > > > cpu state onto the stack, then calling swsusp_save() to save the memory
> > > > > > > > > > > > image.
> > > > > > > > > > > >
> > > > > > > > > > > > Arch specific hibernation header is implemented and is utilized by the
> > > > > > > > > > > > arch_hibernation_header_restore() and arch_hibernation_header_save()
> > > > > > > > > > > > functions. The arch specific hibernation header consists of satp, hartid,
> > > > > > > > > > > > and the cpu_resume address. The kernel built version is also need to be
> > > > > > > > > > > > saved into the hibernation image header to making sure only the same
> > > > > > > > > > > > kernel is restore when resume.
> > > > > > > > > > > >
> > > > > > > > > > > > swsusp_arch_resume() creates a temporary page table that covering only
> > > > > > > > > > > > the linear map. It copies the restore code to a 'safe' page, then start
> > > > > > > > > > > > to restore the memory image. Once completed, it restores the original
> > > > > > > > > > > > kernel's page table. It then calls into __hibernate_cpu_resume()
> > > > > > > > > > > > to restore the CPU context. Finally, it follows the normal hibernation
> > > > > > > > > > > > path back to the hibernation core.
> > > > > > > > > > > >
> > > > > > > > > > > > To enable hibernation/suspend to disk into RISCV, the below config
> > > > > > > > > > > > need to be enabled:
> > > > > > > > > > > > - CONFIG_ARCH_HIBERNATION_HEADER
> > > > > > > > > > > > - CONFIG_ARCH_HIBERNATION_POSSIBLE
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
> > > > > > > > > > > > Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
> > > > > > > > > > > > Reviewed-by: Mason Huo <mason.huo@starfivetech.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  arch/riscv/Kconfig                 |   7 +
> > > > > > > > > > > >  arch/riscv/include/asm/assembler.h |  20 ++
> > > > > > > > > > > >  arch/riscv/include/asm/suspend.h   |  19 ++
> > > > > > > > > > > >  arch/riscv/kernel/Makefile         |   1 +
> > > > > > > > > > > >  arch/riscv/kernel/asm-offsets.c    |   5 +
> > > > > > > > > > > >  arch/riscv/kernel/hibernate-asm.S  |  77 +++++
> > > > > > > > > > > >  arch/riscv/kernel/hibernate.c      | 447 +++++++++++++++++++++++++++++
> > > > > > > > > > > >  7 files changed, 576 insertions(+)
> > > > > > > > > > > >  create mode 100644 arch/riscv/kernel/hibernate-asm.S
> > > > > > > > > > > >  create mode 100644 arch/riscv/kernel/hibernate.c
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > > > > > > > > > > index e2b656043abf..4555848a817f 100644
> > > > > > > > > > > > --- a/arch/riscv/Kconfig
> > > > > > > > > > > > +++ b/arch/riscv/Kconfig
> > > > > > > > > > > > @@ -690,6 +690,13 @@ menu "Power management options"
> > > > > > > > > > > >
> > > > > > > > > > > >  source "kernel/power/Kconfig"
> > > > > > > > > > > >
> > > > > > > > > > > > +config ARCH_HIBERNATION_POSSIBLE
> > > > > > > > > > > > +	def_bool y
> > > > > > > > > > > > +
> > > > > > > > > > > > +config ARCH_HIBERNATION_HEADER
> > > > > > > > > > > > +	def_bool y
> > > > > > > > > > > > +	depends on HIBERNATION
> > > > > > > > > > >
> > > > > > > > > > > nit: I think this can be simplified as def_bool HIBERNATION
> > > > > > > > > > good suggestion. will change it.
> > > > > > > > > > >
> > > > > > > > > > > > +
> > > > > > > > > > > >  endmenu # "Power management options"
> > > > > > > > > > > >
> > > > > > > > > > > >  menu "CPU Power Management"
> > > > > > > > > > > > diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
> > > > > > > > > > > > index 727a97735493..68c46c0e0ea8 100644
> > > > > > > > > > > > --- a/arch/riscv/include/asm/assembler.h
> > > > > > > > > > > > +++ b/arch/riscv/include/asm/assembler.h
> > > > > > > > > > > > @@ -59,4 +59,24 @@
> > > > > > > > > > > >  		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
> > > > > > > > > > > >  	.endm
> > > > > > > > > > > >
> > > > > > > > > > > > +/*
> > > > > > > > > > > > + * copy_page - copy 1 page (4KB) of data from source to destination
> > > > > > > > > > > > + * @a0 - destination
> > > > > > > > > > > > + * @a1 - source
> > > > > > > > > > > > + */
> > > > > > > > > > > > +	.macro	copy_page a0, a1
> > > > > > > > > > > > +		lui	a2, 0x1
> > > > > > > > > > > > +		add	a2, a2, a0
> > > > > > > > > > > > +1 :
> > > > > > > > > > >     ^ please remove this space
> > > > > > > > > > can't remove it otherwise checkpatch will throws ERROR: spaces required around that ':'
> > > > > > > > >
> > > > > > > > > Oh, right, labels in macros have this requirement.
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > +		REG_L	t0, 0(a1)
> > > > > > > > > > > > +		REG_L	t1, SZREG(a1)
> > > > > > > > > > > > +
> > > > > > > > > > > > +		REG_S	t0, 0(a0)
> > > > > > > > > > > > +		REG_S	t1, SZREG(a0)
> > > > > > > > > > > > +
> > > > > > > > > > > > +		addi	a0, a0, 2 * SZREG
> > > > > > > > > > > > +		addi	a1, a1, 2 * SZREG
> > > > > > > > > > > > +		bne	a2, a0, 1b
> > > > > > > > > > > > +	.endm
> > > > > > > > > > > > +
> > > > > > > > > > > >  #endif	/* __ASM_ASSEMBLER_H */
> > > > > > > > > > > > diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
> > > > > > > > > > > > index 75419c5ca272..3362da56a9d8 100644
> > > > > > > > > > > > --- a/arch/riscv/include/asm/suspend.h
> > > > > > > > > > > > +++ b/arch/riscv/include/asm/suspend.h
> > > > > > > > > > > > @@ -21,6 +21,11 @@ struct suspend_context {
> > > > > > > > > > > >  #endif
> > > > > > > > > > > >  };
> > > > > > > > > > > >
> > > > > > > > > > > > +/*
> > > > > > > > > > > > + * Used by hibernation core and cleared during resume sequence
> > > > > > > > > > > > + */
> > > > > > > > > > > > +extern int in_suspend;
> > > > > > > > > > > > +
> > > > > > > > > > > >  /* Low-level CPU suspend entry function */
> > > > > > > > > > > >  int __cpu_suspend_enter(struct suspend_context *context);
> > > > > > > > > > > >
> > > > > > > > > > > > @@ -36,4 +41,18 @@ int __cpu_resume_enter(unsigned long hartid, unsigned long context);
> > > > > > > > > > > >  /* Used to save and restore the csr */
> > > > > > > > > > > >  void suspend_save_csrs(struct suspend_context *context);
> > > > > > > > > > > >  void suspend_restore_csrs(struct suspend_context *context);
> > > > > > > > > > > > +
> > > > > > > > > > > > +/* Low-level API to support hibernation */
> > > > > > > > > > > > +int swsusp_arch_suspend(void);
> > > > > > > > > > > > +int swsusp_arch_resume(void);
> > > > > > > > > > > > +int arch_hibernation_header_save(void *addr, unsigned int max_size);
> > > > > > > > > > > > +int arch_hibernation_header_restore(void *addr);
> > > > > > > > > > > > +int __hibernate_cpu_resume(void);
> > > > > > > > > > > > +
> > > > > > > > > > > > +/* Used to resume on the CPU we hibernated on */
> > > > > > > > > > > > +int hibernate_resume_nonboot_cpu_disable(void);
> > > > > > > > > > > > +
> > > > > > > > > > > > +asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
> > > > > > > > > > > > +					unsigned long cpu_resume);
> > > > > > > > > > > > +asmlinkage int hibernate_core_restore_code(void);
> > > > > > > > > > > >  #endif
> > > > > > > > > > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > > > > > > > > > > index 4cf303a779ab..daab341d55e4 100644
> > > > > > > > > > > > --- a/arch/riscv/kernel/Makefile
> > > > > > > > > > > > +++ b/arch/riscv/kernel/Makefile
> > > > > > > > > > > > @@ -64,6 +64,7 @@ obj-$(CONFIG_MODULES)		+= module.o
> > > > > > > > > > > >  obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
> > > > > > > > > > > >
> > > > > > > > > > > >  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
> > > > > > > > > > > > +obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
> > > > > > > > > > > >
> > > > > > > > > > > >  obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
> > > > > > > > > > > >  obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
> > > > > > > > > > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > > > > > > > > > > index df9444397908..d6a75aac1d27 100644
> > > > > > > > > > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > > > > > > > > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > > > > > > > > > > @@ -9,6 +9,7 @@
> > > > > > > > > > > >  #include <linux/kbuild.h>
> > > > > > > > > > > >  #include <linux/mm.h>
> > > > > > > > > > > >  #include <linux/sched.h>
> > > > > > > > > > > > +#include <linux/suspend.h>
> > > > > > > > > > > >  #include <asm/kvm_host.h>
> > > > > > > > > > > >  #include <asm/thread_info.h>
> > > > > > > > > > > >  #include <asm/ptrace.h>
> > > > > > > > > > > > @@ -116,6 +117,10 @@ void asm_offsets(void)
> > > > > > > > > > > >
> > > > > > > > > > > >  	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
> > > > > > > > > > > >
> > > > > > > > > > > > +	OFFSET(HIBERN_PBE_ADDR, pbe, address);
> > > > > > > > > > > > +	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
> > > > > > > > > > > > +	OFFSET(HIBERN_PBE_NEXT, pbe, next);
> > > > > > > > > > > > +
> > > > > > > > > > > >  	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
> > > > > > > > > > > >  	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
> > > > > > > > > > > >  	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
> > > > > > > > > > > > diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
> > > > > > > > > > > > new file mode 100644
> > > > > > > > > > > > index 000000000000..846affe4dced
> > > > > > > > > > > > --- /dev/null
> > > > > > > > > > > > +++ b/arch/riscv/kernel/hibernate-asm.S
> > > > > > > > > > > > @@ -0,0 +1,77 @@
> > > > > > > > > > > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > > > > > > > > > > +/*
> > > > > > > > > > > > + * Hibernation low level support for RISCV.
> > > > > > > > > > > > + *
> > > > > > > > > > > > + * Copyright (C) 2023 StarFive Technology Co., Ltd.
> > > > > > > > > > > > + *
> > > > > > > > > > > > + * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
> > > > > > > > > > > > + */
> > > > > > > > > > > > +
> > > > > > > > > > > > +#include <asm/asm.h>
> > > > > > > > > > > > +#include <asm/asm-offsets.h>
> > > > > > > > > > > > +#include <asm/assembler.h>
> > > > > > > > > > > > +#include <asm/csr.h>
> > > > > > > > > > > > +
> > > > > > > > > > > > +#include <linux/linkage.h>
> > > > > > > > > > > > +
> > > > > > > > > > > > +/*
> > > > > > > > > > > > + * int __hibernate_cpu_resume(void)
> > > > > > > > > > > > + * Switch back to the hibernated image's page table prior to restoring the CPU
> > > > > > > > > > > > + * context.
> > > > > > > > > > > > + *
> > > > > > > > > > > > + * Always returns 0
> > > > > > > > > > > > + */
> > > > > > > > > > > > +ENTRY(__hibernate_cpu_resume)
> > > > > > > > > > > > +	/* switch to hibernated image's page table. */
> > > > > > > > > > > > +	csrw CSR_SATP, s0
> > > > > > > > > > > > +	sfence.vma
> > > > > > > > > > > > +
> > > > > > > > > > > > +	REG_L	a0, hibernate_cpu_context
> > > > > > > > > > > > +
> > > > > > > > > > > > +	restore_csr
> > > > > > > > > > > > +	restore_reg
> > > > > > > > > > > > +
> > > > > > > > > > > > +	/* Return zero value. */
> > > > > > > > > > > > +	add	a0, zero, zero
> > > > > > > > > > >
> > > > > > > > > > > nit: mv a0, zero
> > > > > > > > > > sure
> > > > > > > > > > >
> > > > > > > > > > > > +
> > > > > > > > > > > > +	ret
> > > > > > > > > > > > +END(__hibernate_cpu_resume)
> > > > > > > > > > > > +
> > > > > > > > > > > > +/*
> > > > > > > > > > > > + * Prepare to restore the image.
> > > > > > > > > > > > + * a0: satp of saved page tables.
> > > > > > > > > > > > + * a1: satp of temporary page tables.
> > > > > > > > > > > > + * a2: cpu_resume.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +ENTRY(hibernate_restore_image)
> > > > > > > > > > > > +	mv	s0, a0
> > > > > > > > > > > > +	mv	s1, a1
> > > > > > > > > > > > +	mv	s2, a2
> > > > > > > > > > > > +	REG_L	s4, restore_pblist
> > > > > > > > > > > > +	REG_L	a1, relocated_restore_code
> > > > > > > > > > > > +
> > > > > > > > > > > > +	jalr	a1
> > > > > > > > > > > > +END(hibernate_restore_image)
> > > > > > > > > > > > +
> > > > > > > > > > > > +/*
> > > > > > > > > > > > + * The below code will be executed from a 'safe' page.
> > > > > > > > > > > > + * It first switches to the temporary page table, then starts to copy the pages
> > > > > > > > > > > > + * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
> > > > > > > > > > > > + * to restore the CPU context.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +ENTRY(hibernate_core_restore_code)
> > > > > > > > > > > > +	/* switch to temp page table. */
> > > > > > > > > > > > +	csrw satp, s1
> > > > > > > > > > > > +	sfence.vma
> > > > > > > > > > > > +.Lcopy:
> > > > > > > > > > > > +	/* The below code will restore the hibernated image. */
> > > > > > > > > > > > +	REG_L	a1, HIBERN_PBE_ADDR(s4)
> > > > > > > > > > > > +	REG_L	a0, HIBERN_PBE_ORIG(s4)
> > > > > > > > > > >
> > > > > > > > > > > Are we sure restore_pblist will never be NULL?
> > > > > > > > > > restore_pblist is a link-list, it will be null during initialization or during page clean up by hibernation core. During the
> initial
> > > > > > > resume
> > > > > > > > > process, the hibernation core will check the header and load the pages. If everything works correctly, the page will be
> linked
> > > to
> > > > > the
> > > > > > > > > restore_pblist and then invoke swsusp_arch_resume() else hibernation core will throws error and failed to resume from
> > > the
> > > > > > > > > hibernated image.
> > > > > > > > >
> > > > > > > > > I know restore_pblist is a linked-list and this doesn't answer the
> > > > > > > > > question. The comment above restore_pblist says
> > > > > > > > >
> > > > > > > > > /*
> > > > > > > > >  * List of PBEs needed for restoring the pages that were allocated before
> > > > > > > > >  * the suspend and included in the suspend image, but have also been
> > > > > > > > >  * allocated by the "resume" kernel, so their contents cannot be written
> > > > > > > > >  * directly to their "original" page frames.
> > > > > > > > >  */
> > > > > > > > >
> > > > > > > > > which implies the pages that end up on this list are "special". My
> > > > > > > > > question is whether or not we're guaranteed to have at least one
> > > > > > > > > of these special pages. If not, we shouldn't assume s4 is non-null.
> > > > > > > > > If so, then a comment stating why that's guaranteed would be nice.
> > > > > > > > The restore_pblist will not be null otherwise swsusp_arch_resume wouldn't get invoked. you can find how the link-list are
> > > link
> > > > > and
> > > > > > > how it checks against validity at https://elixir.bootlin.com/linux/v6.2-rc8/source/kernel/power/snapshot.c . " A comment
> > > stating
> > > > > why
> > > > > > > that's guaranteed would be nice" ? Hmm, perhaps this is out of my scope but I do believe in the page validity checking in the
> > > link I
> > > > > > > shared.
> > > > > > >
> > > > > > > Sorry, but pointing to an entire source file (one that I've obviously
> > > > > > > already looked at, since I quoted a comment from it...) is not helpful.
> > > > > > > I don't see where restore_pblist is being checked before
> > > > > > > swsusp_arch_resume() is issued (from its callsite in hibernate.c).
> > > > > > Sure, below shows the hibernation flow for your reference. The link-list creation and checking found at:
> > > > > https://elixir.bootlin.com/linux/v6.2/source/kernel/power/snapshot.c#L2576
> > > > > > software_resume()
> > > > > > 	load_image_and_restore()
> > > > > > 		swsusp_read()
> > > > > > 			load_image()
> > > > > >  				snapshot_write_next()
> > > > > > 					get_buffer() <-- This is the function checks and links the pages to the restore_pblist
> > > > >
> > > > > Yup, I've read this path, including get_buffer(), where I saw that
> > > > > get_buffer() can return an address without allocating a PBE. Where is the
> > > > > check that restore_pblist isn't NULL, i.e. we see that at least one PBE
> > > > > has been allocated by get_buffer(), before we call swsusp_arch_resume()?
> > > > >
> > > > > Or, is known that at least one or more pages match the criteria pointed
> > > > > out in the comment below (copied from get_buffer())?
> > > > >
> > > > >         /*
> > > > >          * The "original" page frame has not been allocated and we have to
> > > > >          * use a "safe" page frame to store the loaded page.
> > > > >          */
> > > > >
> > > > > If so, then which ones? And where does it state that?
> > > > Let's look at the below pseudocode and hope it clear your doubt. restore_pblist depends on safe_page_list and pbe and both
> > > pointers are checked. I couldn't find from where the restore_pblist will be null..
> > > > 	//Pseudocode to illustrate the image loading
> > > > 	initialize restore_pblist to null;
> > > > 	initialize safe_pages_list to null;
> > > > 	Allocate safe page list, return error if failed;
> > > > 	load image;
> > > > loop:	Create pbe chain, return error if failed;
> > >
> > > This loop pseudocode is incomplete. It's
> > >
> > > loop:
> > >         if (swsusp_page_is_forbidden(page) && swsusp_page_is_free(page))
> > > 	   return page_address(page);
> > > 	Create pbe chain, return error if failed;
> > > 	...
> > >
> > > which I pointed out explicitly in my last reply. Also, as I asked in my
> > > last reply (and have been asking four times now, albeit less explicitly
> > > the first two times), how do we know at least one PBE will be linked?
> > 1 PBE correspond to 1 page, you shouldn't expect only 1 page is saved.
> 
> I know PBEs correspond to pages. *Why* should I not expect only one page
> is saved? Or, more importantly, why should I expect more than zero pages
> are saved?
> 
> Convincing answers might be because we *always* put the restore code in
> pages which get added to the PBE list or that the original page tables
> *always* get put in pages which get added to the PBE list. It's not very
> convincing to simply *assume* that at least one random page will always
> meet the PBE list criteria.
> 
> > Hibernation core will do the calculation. If the PBEs (restore_pblist) linked successfully, the hibernated image will be restore else
> normal boot will take place.
> > > Or, even more specifically this time, where is the proof that for each
> > > hibernation resume, there exists some page such that
> > > !swsusp_page_is_forbidden(page) or !swsusp_page_is_free(page) is true?
> > forbidden_pages and free_pages are not contributed to the restore_pblist (as you already aware from the code). Infact, the
> forbidden_pages and free_pages are not save into the disk.
> 
> Exactly, so those pages are *not* going to contribute to the greater than
> zero pages. What I've been asking for, from the beginning, is to know
> which page(s) are known to *always* contribute to the list. Or, IOW, how
> do you know the PBE list isn't empty, a.k.a restore_pblist isn't NULL?
Well, this is keep going around in a circle, thought the answer is in the hibernation code. restore_pblist get the pointer from the PBE, and the PBE already checked for validity.
Can I suggest you to submit a patch to the hibernation core?
> 
> Thanks,
> drew
> 
> > >
> > > Thanks,
> > > drew
> > >
> > > > 	assign orig_addr and safe_page to pbe;
> > > > 	link pbe to restore_pblist;
> > > > 	return pbe to handle->buffer;
> > > > 	check handle->buffer;
> > > > 	goto loop if no error else return with error;
> > > > >
> > > > > Thanks,
> > > > > drew
> > > > >
> > > > >
> > > > > > 		hibernation_restore()
> > > > > > 			resume_target_kernel()
> > > > > > 				swsusp_arch_resume()
> > > > > > >
> > > > > > > Thanks,
> > > > > > > drew
Andrew Jones Feb. 28, 2023, 5:04 a.m. UTC | #18
On Tue, Feb 28, 2023 at 01:32:53AM +0000, JeeHeng Sia wrote:
> > > > > 	load image;
> > > > > loop:	Create pbe chain, return error if failed;
> > > >
> > > > This loop pseudocode is incomplete. It's
> > > >
> > > > loop:
> > > >         if (swsusp_page_is_forbidden(page) && swsusp_page_is_free(page))
> > > > 	   return page_address(page);
> > > > 	Create pbe chain, return error if failed;
> > > > 	...
> > > >
> > > > which I pointed out explicitly in my last reply. Also, as I asked in my
> > > > last reply (and have been asking four times now, albeit less explicitly
> > > > the first two times), how do we know at least one PBE will be linked?
> > > 1 PBE correspond to 1 page, you shouldn't expect only 1 page is saved.
> > 
> > I know PBEs correspond to pages. *Why* should I not expect only one page
> > is saved? Or, more importantly, why should I expect more than zero pages
> > are saved?
> > 
> > Convincing answers might be because we *always* put the restore code in
> > pages which get added to the PBE list or that the original page tables
> > *always* get put in pages which get added to the PBE list. It's not very
> > convincing to simply *assume* that at least one random page will always
> > meet the PBE list criteria.
> > 
> > > Hibernation core will do the calculation. If the PBEs (restore_pblist) linked successfully, the hibernated image will be restore else
> > normal boot will take place.
> > > > Or, even more specifically this time, where is the proof that for each
> > > > hibernation resume, there exists some page such that
> > > > !swsusp_page_is_forbidden(page) or !swsusp_page_is_free(page) is true?
> > > forbidden_pages and free_pages are not contributed to the restore_pblist (as you already aware from the code). Infact, the
> > forbidden_pages and free_pages are not save into the disk.
> > 
> > Exactly, so those pages are *not* going to contribute to the greater than
> > zero pages. What I've been asking for, from the beginning, is to know
> > which page(s) are known to *always* contribute to the list. Or, IOW, how
> > do you know the PBE list isn't empty, a.k.a restore_pblist isn't NULL?
> Well, this is keep going around in a circle, thought the answer is in the hibernation code. restore_pblist get the pointer from the PBE, and the PBE already checked for validity.

It keeps going around in circles because you keep avoiding my question by
pointing out trivial linked list code. I'm not worried about the linked
list code being correct. My concern is that you're using a linked list
with an assumption that it is not empty. My question has been all along,
how do you know it's not empty?

I'll change the way I ask this time. Please take a look at your PBE list
and let me know if there are PBEs on it that must be there on each
hibernation resume, e.g. the resume code page is there or whatever.

> Can I suggest you to submit a patch to the hibernation core?

Why? What's wrong with it?

Thanks,
drew
Sia Jee Heng Feb. 28, 2023, 5:33 a.m. UTC | #19
> -----Original Message-----
> From: Andrew Jones <ajones@ventanamicro.com>
> Sent: Tuesday, 28 February, 2023 1:05 PM
> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> On Tue, Feb 28, 2023 at 01:32:53AM +0000, JeeHeng Sia wrote:
> > > > > > 	load image;
> > > > > > loop:	Create pbe chain, return error if failed;
> > > > >
> > > > > This loop pseudocode is incomplete. It's
> > > > >
> > > > > loop:
> > > > >         if (swsusp_page_is_forbidden(page) && swsusp_page_is_free(page))
> > > > > 	   return page_address(page);
> > > > > 	Create pbe chain, return error if failed;
> > > > > 	...
> > > > >
> > > > > which I pointed out explicitly in my last reply. Also, as I asked in my
> > > > > last reply (and have been asking four times now, albeit less explicitly
> > > > > the first two times), how do we know at least one PBE will be linked?
> > > > 1 PBE correspond to 1 page, you shouldn't expect only 1 page is saved.
> > >
> > > I know PBEs correspond to pages. *Why* should I not expect only one page
> > > is saved? Or, more importantly, why should I expect more than zero pages
> > > are saved?
> > >
> > > Convincing answers might be because we *always* put the restore code in
> > > pages which get added to the PBE list or that the original page tables
> > > *always* get put in pages which get added to the PBE list. It's not very
> > > convincing to simply *assume* that at least one random page will always
> > > meet the PBE list criteria.
> > >
> > > > Hibernation core will do the calculation. If the PBEs (restore_pblist) linked successfully, the hibernated image will be restore else
> > > normal boot will take place.
> > > > > Or, even more specifically this time, where is the proof that for each
> > > > > hibernation resume, there exists some page such that
> > > > > !swsusp_page_is_forbidden(page) or !swsusp_page_is_free(page) is true?
> > > > forbidden_pages and free_pages are not contributed to the restore_pblist (as you already aware from the code). Infact, the
> > > forbidden_pages and free_pages are not save into the disk.
> > >
> > > Exactly, so those pages are *not* going to contribute to the greater than
> > > zero pages. What I've been asking for, from the beginning, is to know
> > > which page(s) are known to *always* contribute to the list. Or, IOW, how
> > > do you know the PBE list isn't empty, a.k.a restore_pblist isn't NULL?
> > Well, this is keep going around in a circle, thought the answer is in the hibernation code. restore_pblist get the pointer from the PBE,
> and the PBE already checked for validity.
> 
> It keeps going around in circles because you keep avoiding my question by
> pointing out trivial linked list code. I'm not worried about the linked
> list code being correct. My concern is that you're using a linked list
> with an assumption that it is not empty. My question has been all along,
> how do you know it's not empty?
> 
> I'll change the way I ask this time. Please take a look at your PBE list
> and let me know if there are PBEs on it that must be there on each
> hibernation resume, e.g. the resume code page is there or whatever.
> 
> > Can I suggest you to submit a patch to the hibernation core?
> 
> Why? What's wrong with it?
Kindly let me draw 2 scenarios for you. Option 1 is to add the restore_pblist checking to the hibernation core and option 2 is to add restore_pblist checking to the arch solution
Although I really don't think it is needed. But if you really wanted to add the checking, I would suggest to go with option 1. again, I really think that it is not needed!

	//Option 1
	//Pseudocode to illustrate the image loading
	initialize restore_pblist to null;
	initialize safe_pages_list to null;
	Allocate safe page list, return error if failed;
	load image;
loop:	Create pbe chain, return error if failed;
	assign orig_addr and safe_page to pbe;
	link pbe to restore_pblist;
	/* Add checking here */
	return error if restore_pblist equal to null;
	return pbe to handle->buffer;
	check handle->buffer;
	goto loop if no error else return with error;

	//option 2
	//Pseudocode to illustrate the image loading
	initialize restore_pblist to null;
	initialize safe_pages_list to null;
	Allocate safe page list, return error if failed;
	load image;
loop:	Create pbe chain, return error if failed;
	assign orig_addr and safe_page to pbe;
	link pbe to restore_pblist;
	return pbe to handle->buffer;
	check handle->buffer;
	goto loop if no error else return with error;
	everything works correctly, continue the rest of the operation
	invoke swsusp_arch_resume

	//@swsusp_arch_resume()
loop2: return error if restore_pblist is null
	increment restore_pblist and goto loop2
	create temp_pg_table
	continue the rest of the resume operation
> 
> Thanks,
> drew
Sia Jee Heng Feb. 28, 2023, 6:33 a.m. UTC | #20
> -----Original Message-----
> From: Andrew Jones <ajones@ventanamicro.com>
> Sent: Tuesday, 28 February, 2023 1:05 PM
> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> On Tue, Feb 28, 2023 at 01:32:53AM +0000, JeeHeng Sia wrote:
> > > > > > 	load image;
> > > > > > loop:	Create pbe chain, return error if failed;
> > > > >
> > > > > This loop pseudocode is incomplete. It's
> > > > >
> > > > > loop:
> > > > >         if (swsusp_page_is_forbidden(page) && swsusp_page_is_free(page))
> > > > > 	   return page_address(page);
> > > > > 	Create pbe chain, return error if failed;
> > > > > 	...
> > > > >
> > > > > which I pointed out explicitly in my last reply. Also, as I asked in my
> > > > > last reply (and have been asking four times now, albeit less explicitly
> > > > > the first two times), how do we know at least one PBE will be linked?
> > > > 1 PBE correspond to 1 page, you shouldn't expect only 1 page is saved.
> > >
> > > I know PBEs correspond to pages. *Why* should I not expect only one page
> > > is saved? Or, more importantly, why should I expect more than zero pages
> > > are saved?
> > >
> > > Convincing answers might be because we *always* put the restore code in
> > > pages which get added to the PBE list or that the original page tables
> > > *always* get put in pages which get added to the PBE list. It's not very
> > > convincing to simply *assume* that at least one random page will always
> > > meet the PBE list criteria.
> > >
> > > > Hibernation core will do the calculation. If the PBEs (restore_pblist) linked successfully, the hibernated image will be restore else
> > > normal boot will take place.
> > > > > Or, even more specifically this time, where is the proof that for each
> > > > > hibernation resume, there exists some page such that
> > > > > !swsusp_page_is_forbidden(page) or !swsusp_page_is_free(page) is true?
> > > > forbidden_pages and free_pages are not contributed to the restore_pblist (as you already aware from the code). Infact, the
> > > forbidden_pages and free_pages are not save into the disk.
> > >
> > > Exactly, so those pages are *not* going to contribute to the greater than
> > > zero pages. What I've been asking for, from the beginning, is to know
> > > which page(s) are known to *always* contribute to the list. Or, IOW, how
> > > do you know the PBE list isn't empty, a.k.a restore_pblist isn't NULL?
> > Well, this is keep going around in a circle, thought the answer is in the hibernation code. restore_pblist get the pointer from the PBE,
> and the PBE already checked for validity.
> 
> It keeps going around in circles because you keep avoiding my question by
> pointing out trivial linked list code. I'm not worried about the linked
> list code being correct. My concern is that you're using a linked list
> with an assumption that it is not empty. My question has been all along,
> how do you know it's not empty?
> 
> I'll change the way I ask this time. Please take a look at your PBE list
> and let me know if there are PBEs on it that must be there on each
> hibernation resume, e.g. the resume code page is there or whatever.
Just to add on, it is not "my" PBE list but the list is from the hibernation core. As already draw out the scenarios for you, checking should be done at the initialization phase. 
> 
> > Can I suggest you to submit a patch to the hibernation core?
> 
> Why? What's wrong with it?
> 
> Thanks,
> drew
Andrew Jones Feb. 28, 2023, 7:18 a.m. UTC | #21
On Tue, Feb 28, 2023 at 05:33:32AM +0000, JeeHeng Sia wrote:
> 
> 
> > -----Original Message-----
> > From: Andrew Jones <ajones@ventanamicro.com>
> > Sent: Tuesday, 28 February, 2023 1:05 PM
> > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > 
> > On Tue, Feb 28, 2023 at 01:32:53AM +0000, JeeHeng Sia wrote:
> > > > > > > 	load image;
> > > > > > > loop:	Create pbe chain, return error if failed;
> > > > > >
> > > > > > This loop pseudocode is incomplete. It's
> > > > > >
> > > > > > loop:
> > > > > >         if (swsusp_page_is_forbidden(page) && swsusp_page_is_free(page))
> > > > > > 	   return page_address(page);
> > > > > > 	Create pbe chain, return error if failed;
> > > > > > 	...
> > > > > >
> > > > > > which I pointed out explicitly in my last reply. Also, as I asked in my
> > > > > > last reply (and have been asking four times now, albeit less explicitly
> > > > > > the first two times), how do we know at least one PBE will be linked?
> > > > > 1 PBE correspond to 1 page, you shouldn't expect only 1 page is saved.
> > > >
> > > > I know PBEs correspond to pages. *Why* should I not expect only one page
> > > > is saved? Or, more importantly, why should I expect more than zero pages
> > > > are saved?
> > > >
> > > > Convincing answers might be because we *always* put the restore code in
> > > > pages which get added to the PBE list or that the original page tables
> > > > *always* get put in pages which get added to the PBE list. It's not very
> > > > convincing to simply *assume* that at least one random page will always
> > > > meet the PBE list criteria.
> > > >
> > > > > Hibernation core will do the calculation. If the PBEs (restore_pblist) linked successfully, the hibernated image will be restore else
> > > > normal boot will take place.
> > > > > > Or, even more specifically this time, where is the proof that for each
> > > > > > hibernation resume, there exists some page such that
> > > > > > !swsusp_page_is_forbidden(page) or !swsusp_page_is_free(page) is true?
> > > > > forbidden_pages and free_pages are not contributed to the restore_pblist (as you already aware from the code). Infact, the
> > > > forbidden_pages and free_pages are not save into the disk.
> > > >
> > > > Exactly, so those pages are *not* going to contribute to the greater than
> > > > zero pages. What I've been asking for, from the beginning, is to know
> > > > which page(s) are known to *always* contribute to the list. Or, IOW, how
> > > > do you know the PBE list isn't empty, a.k.a restore_pblist isn't NULL?
> > > Well, this is keep going around in a circle, thought the answer is in the hibernation code. restore_pblist get the pointer from the PBE,
> > and the PBE already checked for validity.
> > 
> > It keeps going around in circles because you keep avoiding my question by
> > pointing out trivial linked list code. I'm not worried about the linked
> > list code being correct. My concern is that you're using a linked list
> > with an assumption that it is not empty. My question has been all along,
> > how do you know it's not empty?
> > 
> > I'll change the way I ask this time. Please take a look at your PBE list
> > and let me know if there are PBEs on it that must be there on each
> > hibernation resume, e.g. the resume code page is there or whatever.
> > 
> > > Can I suggest you to submit a patch to the hibernation core?
> > 
> > Why? What's wrong with it?
> Kindly let me draw 2 scenarios for you. Option 1 is to add the restore_pblist checking to the hibernation core and option 2 is to add restore_pblist checking to the arch solution
> Although I really don't think it is needed. But if you really wanted to add the checking, I would suggest to go with option 1. again, I really think that it is not needed!

This entire email thread is because you've first coded, and now stated,
that you don't think the PBE list will ever be empty. And now, below, I
see you're proposing to return an error when the PBE list is empty, why?
If there's nothing in the PBE list, then there's nothing to do for it.
Why is that an error condition?

Please explain to me why you think the PBE list *must* not be empty
(which is what I've been asking for over and over). OIOW, are there
any pages you have in mind which the resume kernel always uses and
are also always going to end up in the suspend image? I don't know,
but I assume clean, file-backed pages do not get added to the suspend
image, which would rule out most kernel code pages. Also, many pages
written during boot (which is where the resume kernel is at resume time)
were no longer resident at hibernate time, so they won't be in the
suspend image either. While it's quite likely I'm missing something
obvious, I'd rather be told what that is than to assume the PBE list
will never be empty. Which is why I keep asking about it...

Thanks,
drew

> 
> 	//Option 1
> 	//Pseudocode to illustrate the image loading
> 	initialize restore_pblist to null;
> 	initialize safe_pages_list to null;
> 	Allocate safe page list, return error if failed;
> 	load image;
> loop:	Create pbe chain, return error if failed;
> 	assign orig_addr and safe_page to pbe;
> 	link pbe to restore_pblist;
> 	/* Add checking here */
> 	return error if restore_pblist equal to null;
> 	return pbe to handle->buffer;
> 	check handle->buffer;
> 	goto loop if no error else return with error;
> 
> 	//option 2
> 	//Pseudocode to illustrate the image loading
> 	initialize restore_pblist to null;
> 	initialize safe_pages_list to null;
> 	Allocate safe page list, return error if failed;
> 	load image;
> loop:	Create pbe chain, return error if failed;
> 	assign orig_addr and safe_page to pbe;
> 	link pbe to restore_pblist;
> 	return pbe to handle->buffer;
> 	check handle->buffer;
> 	goto loop if no error else return with error;
> 	everything works correctly, continue the rest of the operation
> 	invoke swsusp_arch_resume
> 
> 	//@swsusp_arch_resume()
> loop2: return error if restore_pblist is null
> 	increment restore_pblist and goto loop2
> 	create temp_pg_table
> 	continue the rest of the resume operation
> > 
> > Thanks,
> > drew
Andrew Jones Feb. 28, 2023, 7:29 a.m. UTC | #22
On Tue, Feb 28, 2023 at 06:33:59AM +0000, JeeHeng Sia wrote:
> 
> 
> > -----Original Message-----
> > From: Andrew Jones <ajones@ventanamicro.com>
> > Sent: Tuesday, 28 February, 2023 1:05 PM
> > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > 
> > On Tue, Feb 28, 2023 at 01:32:53AM +0000, JeeHeng Sia wrote:
> > > > > > > 	load image;
> > > > > > > loop:	Create pbe chain, return error if failed;
> > > > > >
> > > > > > This loop pseudocode is incomplete. It's
> > > > > >
> > > > > > loop:
> > > > > >         if (swsusp_page_is_forbidden(page) && swsusp_page_is_free(page))
> > > > > > 	   return page_address(page);
> > > > > > 	Create pbe chain, return error if failed;
> > > > > > 	...
> > > > > >
> > > > > > which I pointed out explicitly in my last reply. Also, as I asked in my
> > > > > > last reply (and have been asking four times now, albeit less explicitly
> > > > > > the first two times), how do we know at least one PBE will be linked?
> > > > > 1 PBE correspond to 1 page, you shouldn't expect only 1 page is saved.
> > > >
> > > > I know PBEs correspond to pages. *Why* should I not expect only one page
> > > > is saved? Or, more importantly, why should I expect more than zero pages
> > > > are saved?
> > > >
> > > > Convincing answers might be because we *always* put the restore code in
> > > > pages which get added to the PBE list or that the original page tables
> > > > *always* get put in pages which get added to the PBE list. It's not very
> > > > convincing to simply *assume* that at least one random page will always
> > > > meet the PBE list criteria.
> > > >
> > > > > Hibernation core will do the calculation. If the PBEs (restore_pblist) linked successfully, the hibernated image will be restore else
> > > > normal boot will take place.
> > > > > > Or, even more specifically this time, where is the proof that for each
> > > > > > hibernation resume, there exists some page such that
> > > > > > !swsusp_page_is_forbidden(page) or !swsusp_page_is_free(page) is true?
> > > > > forbidden_pages and free_pages are not contributed to the restore_pblist (as you already aware from the code). Infact, the
> > > > forbidden_pages and free_pages are not save into the disk.
> > > >
> > > > Exactly, so those pages are *not* going to contribute to the greater than
> > > > zero pages. What I've been asking for, from the beginning, is to know
> > > > which page(s) are known to *always* contribute to the list. Or, IOW, how
> > > > do you know the PBE list isn't empty, a.k.a restore_pblist isn't NULL?
> > > Well, this is keep going around in a circle, thought the answer is in the hibernation code. restore_pblist get the pointer from the PBE,
> > and the PBE already checked for validity.
> > 
> > It keeps going around in circles because you keep avoiding my question by
> > pointing out trivial linked list code. I'm not worried about the linked
> > list code being correct. My concern is that you're using a linked list
> > with an assumption that it is not empty. My question has been all along,
> > how do you know it's not empty?
> > 
> > I'll change the way I ask this time. Please take a look at your PBE list
> > and let me know if there are PBEs on it that must be there on each
> > hibernation resume, e.g. the resume code page is there or whatever.
> Just to add on, it is not "my" PBE list but the list is from the hibernation core. As already draw out the scenarios for you, checking should be done at the initialization phase. 

Your PBE list is your instance of the PBE list when you resume your
hibernation test. I'm simply asking you to dump the PBE list while
you resume a hibernation, and then tell me what's there.

Please stop thinking about the trivial details of the code, like which
file a variable is in, and start thinking about how the code is being
used. A PBE list is a concept, your PBE list is an instance of that
concept, the code, which is the least interesting part, is just an
implementation of that concept. First, I want to understand the concept,
then we can worry about the code.

drew

> > 
> > > Can I suggest you to submit a patch to the hibernation core?
> > 
> > Why? What's wrong with it?
> > 
> > Thanks,
> > drew
Sia Jee Heng Feb. 28, 2023, 7:29 a.m. UTC | #23
> -----Original Message-----
> From: Andrew Jones <ajones@ventanamicro.com>
> Sent: Tuesday, 28 February, 2023 3:19 PM
> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> On Tue, Feb 28, 2023 at 05:33:32AM +0000, JeeHeng Sia wrote:
> >
> >
> > > -----Original Message-----
> > > From: Andrew Jones <ajones@ventanamicro.com>
> > > Sent: Tuesday, 28 February, 2023 1:05 PM
> > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > >
> > > On Tue, Feb 28, 2023 at 01:32:53AM +0000, JeeHeng Sia wrote:
> > > > > > > > 	load image;
> > > > > > > > loop:	Create pbe chain, return error if failed;
> > > > > > >
> > > > > > > This loop pseudocode is incomplete. It's
> > > > > > >
> > > > > > > loop:
> > > > > > >         if (swsusp_page_is_forbidden(page) && swsusp_page_is_free(page))
> > > > > > > 	   return page_address(page);
> > > > > > > 	Create pbe chain, return error if failed;
> > > > > > > 	...
> > > > > > >
> > > > > > > which I pointed out explicitly in my last reply. Also, as I asked in my
> > > > > > > last reply (and have been asking four times now, albeit less explicitly
> > > > > > > the first two times), how do we know at least one PBE will be linked?
> > > > > > 1 PBE correspond to 1 page, you shouldn't expect only 1 page is saved.
> > > > >
> > > > > I know PBEs correspond to pages. *Why* should I not expect only one page
> > > > > is saved? Or, more importantly, why should I expect more than zero pages
> > > > > are saved?
> > > > >
> > > > > Convincing answers might be because we *always* put the restore code in
> > > > > pages which get added to the PBE list or that the original page tables
> > > > > *always* get put in pages which get added to the PBE list. It's not very
> > > > > convincing to simply *assume* that at least one random page will always
> > > > > meet the PBE list criteria.
> > > > >
> > > > > > Hibernation core will do the calculation. If the PBEs (restore_pblist) linked successfully, the hibernated image will be restore
> else
> > > > > normal boot will take place.
> > > > > > > Or, even more specifically this time, where is the proof that for each
> > > > > > > hibernation resume, there exists some page such that
> > > > > > > !swsusp_page_is_forbidden(page) or !swsusp_page_is_free(page) is true?
> > > > > > forbidden_pages and free_pages are not contributed to the restore_pblist (as you already aware from the code). Infact, the
> > > > > forbidden_pages and free_pages are not save into the disk.
> > > > >
> > > > > Exactly, so those pages are *not* going to contribute to the greater than
> > > > > zero pages. What I've been asking for, from the beginning, is to know
> > > > > which page(s) are known to *always* contribute to the list. Or, IOW, how
> > > > > do you know the PBE list isn't empty, a.k.a restore_pblist isn't NULL?
> > > > Well, this is keep going around in a circle, thought the answer is in the hibernation code. restore_pblist get the pointer from the
> PBE,
> > > and the PBE already checked for validity.
> > >
> > > It keeps going around in circles because you keep avoiding my question by
> > > pointing out trivial linked list code. I'm not worried about the linked
> > > list code being correct. My concern is that you're using a linked list
> > > with an assumption that it is not empty. My question has been all along,
> > > how do you know it's not empty?
> > >
> > > I'll change the way I ask this time. Please take a look at your PBE list
> > > and let me know if there are PBEs on it that must be there on each
> > > hibernation resume, e.g. the resume code page is there or whatever.
> > >
> > > > Can I suggest you to submit a patch to the hibernation core?
> > >
> > > Why? What's wrong with it?
> > Kindly let me draw 2 scenarios for you. Option 1 is to add the restore_pblist checking to the hibernation core and option 2 is to add
> restore_pblist checking to the arch solution
> > Although I really don't think it is needed. But if you really wanted to add the checking, I would suggest to go with option 1. again, I
> really think that it is not needed!
> 
> This entire email thread is because you've first coded, and now stated,
> that you don't think the PBE list will ever be empty. And now, below, I
> see you're proposing to return an error when the PBE list is empty, why?
> If there's nothing in the PBE list, then there's nothing to do for it.
> Why is that an error condition?
> 
> Please explain to me why you think the PBE list *must* not be empty
> (which is what I've been asking for over and over). OIOW, are there
> any pages you have in mind which the resume kernel always uses and
> are also always going to end up in the suspend image? I don't know,
> but I assume clean, file-backed pages do not get added to the suspend
> image, which would rule out most kernel code pages. Also, many pages
> written during boot (which is where the resume kernel is at resume time)
> were no longer resident at hibernate time, so they won't be in the
> suspend image either. While it's quite likely I'm missing something
> obvious, I'd rather be told what that is than to assume the PBE list
> will never be empty. Which is why I keep asking about it...
The answer already in the Linux kernel hibernation core, do you need me to write a white paper to explain in detail or you need a conference call? 
> 
> Thanks,
> drew
> 
> >
> > 	//Option 1
> > 	//Pseudocode to illustrate the image loading
> > 	initialize restore_pblist to null;
> > 	initialize safe_pages_list to null;
> > 	Allocate safe page list, return error if failed;
> > 	load image;
> > loop:	Create pbe chain, return error if failed;
> > 	assign orig_addr and safe_page to pbe;
> > 	link pbe to restore_pblist;
> > 	/* Add checking here */
> > 	return error if restore_pblist equal to null;
> > 	return pbe to handle->buffer;
> > 	check handle->buffer;
> > 	goto loop if no error else return with error;
> >
> > 	//option 2
> > 	//Pseudocode to illustrate the image loading
> > 	initialize restore_pblist to null;
> > 	initialize safe_pages_list to null;
> > 	Allocate safe page list, return error if failed;
> > 	load image;
> > loop:	Create pbe chain, return error if failed;
> > 	assign orig_addr and safe_page to pbe;
> > 	link pbe to restore_pblist;
> > 	return pbe to handle->buffer;
> > 	check handle->buffer;
> > 	goto loop if no error else return with error;
> > 	everything works correctly, continue the rest of the operation
> > 	invoke swsusp_arch_resume
> >
> > 	//@swsusp_arch_resume()
> > loop2: return error if restore_pblist is null
> > 	increment restore_pblist and goto loop2
> > 	create temp_pg_table
> > 	continue the rest of the resume operation
> > >
> > > Thanks,
> > > drew
Sia Jee Heng Feb. 28, 2023, 7:34 a.m. UTC | #24
> -----Original Message-----
> From: Andrew Jones <ajones@ventanamicro.com>
> Sent: Tuesday, 28 February, 2023 3:30 PM
> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> On Tue, Feb 28, 2023 at 06:33:59AM +0000, JeeHeng Sia wrote:
> >
> >
> > > -----Original Message-----
> > > From: Andrew Jones <ajones@ventanamicro.com>
> > > Sent: Tuesday, 28 February, 2023 1:05 PM
> > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > >
> > > On Tue, Feb 28, 2023 at 01:32:53AM +0000, JeeHeng Sia wrote:
> > > > > > > > 	load image;
> > > > > > > > loop:	Create pbe chain, return error if failed;
> > > > > > >
> > > > > > > This loop pseudocode is incomplete. It's
> > > > > > >
> > > > > > > loop:
> > > > > > >         if (swsusp_page_is_forbidden(page) && swsusp_page_is_free(page))
> > > > > > > 	   return page_address(page);
> > > > > > > 	Create pbe chain, return error if failed;
> > > > > > > 	...
> > > > > > >
> > > > > > > which I pointed out explicitly in my last reply. Also, as I asked in my
> > > > > > > last reply (and have been asking four times now, albeit less explicitly
> > > > > > > the first two times), how do we know at least one PBE will be linked?
> > > > > > 1 PBE correspond to 1 page, you shouldn't expect only 1 page is saved.
> > > > >
> > > > > I know PBEs correspond to pages. *Why* should I not expect only one page
> > > > > is saved? Or, more importantly, why should I expect more than zero pages
> > > > > are saved?
> > > > >
> > > > > Convincing answers might be because we *always* put the restore code in
> > > > > pages which get added to the PBE list or that the original page tables
> > > > > *always* get put in pages which get added to the PBE list. It's not very
> > > > > convincing to simply *assume* that at least one random page will always
> > > > > meet the PBE list criteria.
> > > > >
> > > > > > Hibernation core will do the calculation. If the PBEs (restore_pblist) linked successfully, the hibernated image will be restore
> else
> > > > > normal boot will take place.
> > > > > > > Or, even more specifically this time, where is the proof that for each
> > > > > > > hibernation resume, there exists some page such that
> > > > > > > !swsusp_page_is_forbidden(page) or !swsusp_page_is_free(page) is true?
> > > > > > forbidden_pages and free_pages are not contributed to the restore_pblist (as you already aware from the code). Infact, the
> > > > > forbidden_pages and free_pages are not save into the disk.
> > > > >
> > > > > Exactly, so those pages are *not* going to contribute to the greater than
> > > > > zero pages. What I've been asking for, from the beginning, is to know
> > > > > which page(s) are known to *always* contribute to the list. Or, IOW, how
> > > > > do you know the PBE list isn't empty, a.k.a restore_pblist isn't NULL?
> > > > Well, this is keep going around in a circle, thought the answer is in the hibernation code. restore_pblist get the pointer from the
> PBE,
> > > and the PBE already checked for validity.
> > >
> > > It keeps going around in circles because you keep avoiding my question by
> > > pointing out trivial linked list code. I'm not worried about the linked
> > > list code being correct. My concern is that you're using a linked list
> > > with an assumption that it is not empty. My question has been all along,
> > > how do you know it's not empty?
> > >
> > > I'll change the way I ask this time. Please take a look at your PBE list
> > > and let me know if there are PBEs on it that must be there on each
> > > hibernation resume, e.g. the resume code page is there or whatever.
> > Just to add on, it is not "my" PBE list but the list is from the hibernation core. As already draw out the scenarios for you, checking
> should be done at the initialization phase.
> 
> Your PBE list is your instance of the PBE list when you resume your
> hibernation test. I'm simply asking you to dump the PBE list while
> you resume a hibernation, and then tell me what's there.
> 
> Please stop thinking about the trivial details of the code, like which
> file a variable is in, and start thinking about how the code is being
> used. A PBE list is a concept, your PBE list is an instance of that
> concept, the code, which is the least interesting part, is just an
> implementation of that concept. First, I want to understand the concept,
> then we can worry about the code.
> 
Dear Andrew, perhaps a conference call is better? otherwise it is going to waste the time in typing...Let me know how to join the call with you....thank you.
> drew
> 
> > >
> > > > Can I suggest you to submit a patch to the hibernation core?
> > >
> > > Why? What's wrong with it?
> > >
> > > Thanks,
> > > drew
Andrew Jones Feb. 28, 2023, 7:37 a.m. UTC | #25
On Tue, Feb 28, 2023 at 07:29:40AM +0000, JeeHeng Sia wrote:
> 
> 
> > -----Original Message-----
> > From: Andrew Jones <ajones@ventanamicro.com>
> > Sent: Tuesday, 28 February, 2023 3:19 PM
> > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > 
> > On Tue, Feb 28, 2023 at 05:33:32AM +0000, JeeHeng Sia wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > Sent: Tuesday, 28 February, 2023 1:05 PM
> > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > >
> > > > On Tue, Feb 28, 2023 at 01:32:53AM +0000, JeeHeng Sia wrote:
> > > > > > > > > 	load image;
> > > > > > > > > loop:	Create pbe chain, return error if failed;
> > > > > > > >
> > > > > > > > This loop pseudocode is incomplete. It's
> > > > > > > >
> > > > > > > > loop:
> > > > > > > >         if (swsusp_page_is_forbidden(page) && swsusp_page_is_free(page))
> > > > > > > > 	   return page_address(page);
> > > > > > > > 	Create pbe chain, return error if failed;
> > > > > > > > 	...
> > > > > > > >
> > > > > > > > which I pointed out explicitly in my last reply. Also, as I asked in my
> > > > > > > > last reply (and have been asking four times now, albeit less explicitly
> > > > > > > > the first two times), how do we know at least one PBE will be linked?
> > > > > > > 1 PBE correspond to 1 page, you shouldn't expect only 1 page is saved.
> > > > > >
> > > > > > I know PBEs correspond to pages. *Why* should I not expect only one page
> > > > > > is saved? Or, more importantly, why should I expect more than zero pages
> > > > > > are saved?
> > > > > >
> > > > > > Convincing answers might be because we *always* put the restore code in
> > > > > > pages which get added to the PBE list or that the original page tables
> > > > > > *always* get put in pages which get added to the PBE list. It's not very
> > > > > > convincing to simply *assume* that at least one random page will always
> > > > > > meet the PBE list criteria.
> > > > > >
> > > > > > > Hibernation core will do the calculation. If the PBEs (restore_pblist) linked successfully, the hibernated image will be restore
> > else
> > > > > > normal boot will take place.
> > > > > > > > Or, even more specifically this time, where is the proof that for each
> > > > > > > > hibernation resume, there exists some page such that
> > > > > > > > !swsusp_page_is_forbidden(page) or !swsusp_page_is_free(page) is true?
> > > > > > > forbidden_pages and free_pages are not contributed to the restore_pblist (as you already aware from the code). Infact, the
> > > > > > forbidden_pages and free_pages are not save into the disk.
> > > > > >
> > > > > > Exactly, so those pages are *not* going to contribute to the greater than
> > > > > > zero pages. What I've been asking for, from the beginning, is to know
> > > > > > which page(s) are known to *always* contribute to the list. Or, IOW, how
> > > > > > do you know the PBE list isn't empty, a.k.a restore_pblist isn't NULL?
> > > > > Well, this is keep going around in a circle, thought the answer is in the hibernation code. restore_pblist get the pointer from the
> > PBE,
> > > > and the PBE already checked for validity.
> > > >
> > > > It keeps going around in circles because you keep avoiding my question by
> > > > pointing out trivial linked list code. I'm not worried about the linked
> > > > list code being correct. My concern is that you're using a linked list
> > > > with an assumption that it is not empty. My question has been all along,
> > > > how do you know it's not empty?
> > > >
> > > > I'll change the way I ask this time. Please take a look at your PBE list
> > > > and let me know if there are PBEs on it that must be there on each
> > > > hibernation resume, e.g. the resume code page is there or whatever.
> > > >
> > > > > Can I suggest you to submit a patch to the hibernation core?
> > > >
> > > > Why? What's wrong with it?
> > > Kindly let me draw 2 scenarios for you. Option 1 is to add the restore_pblist checking to the hibernation core and option 2 is to add
> > restore_pblist checking to the arch solution
> > > Although I really don't think it is needed. But if you really wanted to add the checking, I would suggest to go with option 1. again, I
> > really think that it is not needed!
> > 
> > This entire email thread is because you've first coded, and now stated,
> > that you don't think the PBE list will ever be empty. And now, below, I
> > see you're proposing to return an error when the PBE list is empty, why?
> > If there's nothing in the PBE list, then there's nothing to do for it.
> > Why is that an error condition?
> > 
> > Please explain to me why you think the PBE list *must* not be empty
> > (which is what I've been asking for over and over). OIOW, are there
> > any pages you have in mind which the resume kernel always uses and
> > are also always going to end up in the suspend image? I don't know,
> > but I assume clean, file-backed pages do not get added to the suspend
> > image, which would rule out most kernel code pages. Also, many pages
> > written during boot (which is where the resume kernel is at resume time)
> > were no longer resident at hibernate time, so they won't be in the
> > suspend image either. While it's quite likely I'm missing something
> > obvious, I'd rather be told what that is than to assume the PBE list
> > will never be empty. Which is why I keep asking about it...
> The answer already in the Linux kernel hibernation core, do you need me to write a white paper to explain in detail or you need a conference call? 

I'm not sure why you don't just write a paragraph or two here in this
email thread explaining what "the answer" is. Anyway, feel free to
invite me to a call if you think it'd be easier to hash out that way.

Thanks,
drew

> > 
> > Thanks,
> > drew
> > 
> > >
> > > 	//Option 1
> > > 	//Pseudocode to illustrate the image loading
> > > 	initialize restore_pblist to null;
> > > 	initialize safe_pages_list to null;
> > > 	Allocate safe page list, return error if failed;
> > > 	load image;
> > > loop:	Create pbe chain, return error if failed;
> > > 	assign orig_addr and safe_page to pbe;
> > > 	link pbe to restore_pblist;
> > > 	/* Add checking here */
> > > 	return error if restore_pblist equal to null;
> > > 	return pbe to handle->buffer;
> > > 	check handle->buffer;
> > > 	goto loop if no error else return with error;
> > >
> > > 	//option 2
> > > 	//Pseudocode to illustrate the image loading
> > > 	initialize restore_pblist to null;
> > > 	initialize safe_pages_list to null;
> > > 	Allocate safe page list, return error if failed;
> > > 	load image;
> > > loop:	Create pbe chain, return error if failed;
> > > 	assign orig_addr and safe_page to pbe;
> > > 	link pbe to restore_pblist;
> > > 	return pbe to handle->buffer;
> > > 	check handle->buffer;
> > > 	goto loop if no error else return with error;
> > > 	everything works correctly, continue the rest of the operation
> > > 	invoke swsusp_arch_resume
> > >
> > > 	//@swsusp_arch_resume()
> > > loop2: return error if restore_pblist is null
> > > 	increment restore_pblist and goto loop2
> > > 	create temp_pg_table
> > > 	continue the rest of the resume operation
> > > >
> > > > Thanks,
> > > > drew
Sia Jee Heng March 3, 2023, 1:53 a.m. UTC | #26
Hi Andrew,


> -----Original Message-----
> From: Andrew Jones <ajones@ventanamicro.com>
> Sent: Tuesday, February 28, 2023 3:37 PM
> To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> 
> On Tue, Feb 28, 2023 at 07:29:40AM +0000, JeeHeng Sia wrote:
> >
> >
> > > -----Original Message-----
> > > From: Andrew Jones <ajones@ventanamicro.com>
> > > Sent: Tuesday, 28 February, 2023 3:19 PM
> > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > >
> > > On Tue, Feb 28, 2023 at 05:33:32AM +0000, JeeHeng Sia wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Andrew Jones <ajones@ventanamicro.com>
> > > > > Sent: Tuesday, 28 February, 2023 1:05 PM
> > > > > To: JeeHeng Sia <jeeheng.sia@starfivetech.com>
> > > > > Cc: paul.walmsley@sifive.com; palmer@dabbelt.com; aou@eecs.berkeley.edu; linux-riscv@lists.infradead.org; linux-
> > > > > kernel@vger.kernel.org; Leyfoon Tan <leyfoon.tan@starfivetech.com>; Mason Huo <mason.huo@starfivetech.com>
> > > > > Subject: Re: [PATCH v4 4/4] RISC-V: Add arch functions to support hibernation/suspend-to-disk
> > > > >
> > > > > On Tue, Feb 28, 2023 at 01:32:53AM +0000, JeeHeng Sia wrote:
> > > > > > > > > > 	load image;
> > > > > > > > > > loop:	Create pbe chain, return error if failed;
> > > > > > > > >
> > > > > > > > > This loop pseudocode is incomplete. It's
> > > > > > > > >
> > > > > > > > > loop:
> > > > > > > > >         if (swsusp_page_is_forbidden(page) && swsusp_page_is_free(page))
> > > > > > > > > 	   return page_address(page);
> > > > > > > > > 	Create pbe chain, return error if failed;
> > > > > > > > > 	...
> > > > > > > > >
> > > > > > > > > which I pointed out explicitly in my last reply. Also, as I asked in my
> > > > > > > > > last reply (and have been asking four times now, albeit less explicitly
> > > > > > > > > the first two times), how do we know at least one PBE will be linked?
> > > > > > > > 1 PBE correspond to 1 page, you shouldn't expect only 1 page is saved.
> > > > > > >
> > > > > > > I know PBEs correspond to pages. *Why* should I not expect only one page
> > > > > > > is saved? Or, more importantly, why should I expect more than zero pages
> > > > > > > are saved?
> > > > > > >
> > > > > > > Convincing answers might be because we *always* put the restore code in
> > > > > > > pages which get added to the PBE list or that the original page tables
> > > > > > > *always* get put in pages which get added to the PBE list. It's not very
> > > > > > > convincing to simply *assume* that at least one random page will always
> > > > > > > meet the PBE list criteria.
> > > > > > >
> > > > > > > > Hibernation core will do the calculation. If the PBEs (restore_pblist) linked successfully, the hibernated image will be
> restore
> > > else
> > > > > > > normal boot will take place.
> > > > > > > > > Or, even more specifically this time, where is the proof that for each
> > > > > > > > > hibernation resume, there exists some page such that
> > > > > > > > > !swsusp_page_is_forbidden(page) or !swsusp_page_is_free(page) is true?
> > > > > > > > forbidden_pages and free_pages are not contributed to the restore_pblist (as you already aware from the code). Infact,
> the
> > > > > > > forbidden_pages and free_pages are not save into the disk.
> > > > > > >
> > > > > > > Exactly, so those pages are *not* going to contribute to the greater than
> > > > > > > zero pages. What I've been asking for, from the beginning, is to know
> > > > > > > which page(s) are known to *always* contribute to the list. Or, IOW, how
> > > > > > > do you know the PBE list isn't empty, a.k.a restore_pblist isn't NULL?
> > > > > > Well, this is keep going around in a circle, thought the answer is in the hibernation code. restore_pblist get the pointer from
> the
> > > PBE,
> > > > > and the PBE already checked for validity.
> > > > >
> > > > > It keeps going around in circles because you keep avoiding my question by
> > > > > pointing out trivial linked list code. I'm not worried about the linked
> > > > > list code being correct. My concern is that you're using a linked list
> > > > > with an assumption that it is not empty. My question has been all along,
> > > > > how do you know it's not empty?
> > > > >
> > > > > I'll change the way I ask this time. Please take a look at your PBE list
> > > > > and let me know if there are PBEs on it that must be there on each
> > > > > hibernation resume, e.g. the resume code page is there or whatever.
> > > > >
> > > > > > Can I suggest you to submit a patch to the hibernation core?
> > > > >
> > > > > Why? What's wrong with it?
> > > > Kindly let me draw 2 scenarios for you. Option 1 is to add the restore_pblist checking to the hibernation core and option 2 is to
> add
> > > restore_pblist checking to the arch solution
> > > > Although I really don't think it is needed. But if you really wanted to add the checking, I would suggest to go with option 1. again,
> I
> > > really think that it is not needed!
> > >
> > > This entire email thread is because you've first coded, and now stated,
> > > that you don't think the PBE list will ever be empty. And now, below, I
> > > see you're proposing to return an error when the PBE list is empty, why?
> > > If there's nothing in the PBE list, then there's nothing to do for it.
> > > Why is that an error condition?
> > >
> > > Please explain to me why you think the PBE list *must* not be empty
> > > (which is what I've been asking for over and over). OIOW, are there
> > > any pages you have in mind which the resume kernel always uses and
> > > are also always going to end up in the suspend image? I don't know,
> > > but I assume clean, file-backed pages do not get added to the suspend
> > > image, which would rule out most kernel code pages. Also, many pages
> > > written during boot (which is where the resume kernel is at resume time)
> > > were no longer resident at hibernate time, so they won't be in the
> > > suspend image either. While it's quite likely I'm missing something
> > > obvious, I'd rather be told what that is than to assume the PBE list
> > > will never be empty. Which is why I keep asking about it...
> > The answer already in the Linux kernel hibernation core, do you need me to write a white paper to explain in detail or you need a
> conference call?
> 
> I'm not sure why you don't just write a paragraph or two here in this
> email thread explaining what "the answer" is. Anyway, feel free to
> invite me to a call if you think it'd be easier to hash out that way.
Thank you very much to free up time to join the call. It was very nice to talk to you over the conference call and I did learn a lot from you.
Below is the summary of the experiment that benefit everyone:
To avoid inspecting a huge log, the experiment was carried out on the Qemu with 512MB of memory (128000 pages). 
During hibernation, there are 22770 pages (out of 128000 pages) were identified need to be stored to the disk. Those pages consists of the kernel text code, rodata, page table, stack/heap/kmalloc/vmalloc memory, user space app, rootfs.....etc. The number of pages need to be stored to the disk are depends on the "workload" on the system.
When resume, only 21651 pages were assigned to the restore_pblist. The rest of the pages consists of meta_data pages and forbidden pages which were handled by the "resume kernel". Arch code will handle the pages assigned to the restore_pblist.
From the experiment, we also know that the game that is activated before hibernation is still "alive" after resume from hibernation and can continue to play without problem.

Thanks
Jee Heng
Andrew Jones March 3, 2023, 8:09 a.m. UTC | #27
On Fri, Mar 03, 2023 at 01:53:19AM +0000, JeeHeng Sia wrote:
> Hi Andrew,
> 
> 
> > -----Original Message-----
> > From: Andrew Jones <ajones@ventanamicro.com>
...
> > I'm not sure why you don't just write a paragraph or two here in this
> > email thread explaining what "the answer" is. Anyway, feel free to
> > invite me to a call if you think it'd be easier to hash out that way.
> Thank you very much to free up time to join the call. It was very nice to talk to you over the conference call and I did learn a lot from you.
> Below is the summary of the experiment that benefit everyone:
> To avoid inspecting a huge log, the experiment was carried out on the Qemu with 512MB of memory (128000 pages). 
> During hibernation, there are 22770 pages (out of 128000 pages) were identified need to be stored to the disk. Those pages consists of the kernel text code, rodata, page table, stack/heap/kmalloc/vmalloc memory, user space app, rootfs.....etc. The number of pages need to be stored to the disk are depends on the "workload" on the system.
> When resume, only 21651 pages were assigned to the restore_pblist. The rest of the pages consists of meta_data pages and forbidden pages which were handled by the "resume kernel". Arch code will handle the pages assigned to the restore_pblist.
> From the experiment, we also know that the game that is activated before hibernation is still "alive" after resume from hibernation and can continue to play without problem.
>

Thank you, Jee Heng. Indeed it looks like the majority of the pages that
are selected for the suspend image end up on the PBE list. While we don't
have definitive "this page must be on the PBE list" type of result, I
agree that we shouldn't need to worry about the PBE list ever being empty.

Thanks,
drew
diff mbox series

Patch

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e2b656043abf..4555848a817f 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -690,6 +690,13 @@  menu "Power management options"
 
 source "kernel/power/Kconfig"
 
+config ARCH_HIBERNATION_POSSIBLE
+	def_bool y
+
+config ARCH_HIBERNATION_HEADER
+	def_bool y
+	depends on HIBERNATION
+
 endmenu # "Power management options"
 
 menu "CPU Power Management"
diff --git a/arch/riscv/include/asm/assembler.h b/arch/riscv/include/asm/assembler.h
index 727a97735493..68c46c0e0ea8 100644
--- a/arch/riscv/include/asm/assembler.h
+++ b/arch/riscv/include/asm/assembler.h
@@ -59,4 +59,24 @@ 
 		REG_L	s11, (SUSPEND_CONTEXT_REGS + PT_S11)(a0)
 	.endm
 
+/*
+ * copy_page - copy 1 page (4KB) of data from source to destination
+ * @a0 - destination
+ * @a1 - source
+ */
+	.macro	copy_page a0, a1
+		lui	a2, 0x1
+		add	a2, a2, a0
+1 :
+		REG_L	t0, 0(a1)
+		REG_L	t1, SZREG(a1)
+
+		REG_S	t0, 0(a0)
+		REG_S	t1, SZREG(a0)
+
+		addi	a0, a0, 2 * SZREG
+		addi	a1, a1, 2 * SZREG
+		bne	a2, a0, 1b
+	.endm
+
 #endif	/* __ASM_ASSEMBLER_H */
diff --git a/arch/riscv/include/asm/suspend.h b/arch/riscv/include/asm/suspend.h
index 75419c5ca272..3362da56a9d8 100644
--- a/arch/riscv/include/asm/suspend.h
+++ b/arch/riscv/include/asm/suspend.h
@@ -21,6 +21,11 @@  struct suspend_context {
 #endif
 };
 
+/*
+ * Used by hibernation core and cleared during resume sequence
+ */
+extern int in_suspend;
+
 /* Low-level CPU suspend entry function */
 int __cpu_suspend_enter(struct suspend_context *context);
 
@@ -36,4 +41,18 @@  int __cpu_resume_enter(unsigned long hartid, unsigned long context);
 /* Used to save and restore the csr */
 void suspend_save_csrs(struct suspend_context *context);
 void suspend_restore_csrs(struct suspend_context *context);
+
+/* Low-level API to support hibernation */
+int swsusp_arch_suspend(void);
+int swsusp_arch_resume(void);
+int arch_hibernation_header_save(void *addr, unsigned int max_size);
+int arch_hibernation_header_restore(void *addr);
+int __hibernate_cpu_resume(void);
+
+/* Used to resume on the CPU we hibernated on */
+int hibernate_resume_nonboot_cpu_disable(void);
+
+asmlinkage void hibernate_restore_image(unsigned long resume_satp, unsigned long satp_temp,
+					unsigned long cpu_resume);
+asmlinkage int hibernate_core_restore_code(void);
 #endif
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 4cf303a779ab..daab341d55e4 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -64,6 +64,7 @@  obj-$(CONFIG_MODULES)		+= module.o
 obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
 
 obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o
+obj-$(CONFIG_HIBERNATION)	+= hibernate.o hibernate-asm.o
 
 obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o ftrace.o
 obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index df9444397908..d6a75aac1d27 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -9,6 +9,7 @@ 
 #include <linux/kbuild.h>
 #include <linux/mm.h>
 #include <linux/sched.h>
+#include <linux/suspend.h>
 #include <asm/kvm_host.h>
 #include <asm/thread_info.h>
 #include <asm/ptrace.h>
@@ -116,6 +117,10 @@  void asm_offsets(void)
 
 	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
 
+	OFFSET(HIBERN_PBE_ADDR, pbe, address);
+	OFFSET(HIBERN_PBE_ORIG, pbe, orig_address);
+	OFFSET(HIBERN_PBE_NEXT, pbe, next);
+
 	OFFSET(KVM_ARCH_GUEST_ZERO, kvm_vcpu_arch, guest_context.zero);
 	OFFSET(KVM_ARCH_GUEST_RA, kvm_vcpu_arch, guest_context.ra);
 	OFFSET(KVM_ARCH_GUEST_SP, kvm_vcpu_arch, guest_context.sp);
diff --git a/arch/riscv/kernel/hibernate-asm.S b/arch/riscv/kernel/hibernate-asm.S
new file mode 100644
index 000000000000..846affe4dced
--- /dev/null
+++ b/arch/riscv/kernel/hibernate-asm.S
@@ -0,0 +1,77 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Hibernation low level support for RISCV.
+ *
+ * Copyright (C) 2023 StarFive Technology Co., Ltd.
+ *
+ * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
+ */
+
+#include <asm/asm.h>
+#include <asm/asm-offsets.h>
+#include <asm/assembler.h>
+#include <asm/csr.h>
+
+#include <linux/linkage.h>
+
+/*
+ * int __hibernate_cpu_resume(void)
+ * Switch back to the hibernated image's page table prior to restoring the CPU
+ * context.
+ *
+ * Always returns 0
+ */
+ENTRY(__hibernate_cpu_resume)
+	/* switch to hibernated image's page table. */
+	csrw CSR_SATP, s0
+	sfence.vma
+
+	REG_L	a0, hibernate_cpu_context
+
+	restore_csr
+	restore_reg
+
+	/* Return zero value. */
+	add	a0, zero, zero
+
+	ret
+END(__hibernate_cpu_resume)
+
+/*
+ * Prepare to restore the image.
+ * a0: satp of saved page tables.
+ * a1: satp of temporary page tables.
+ * a2: cpu_resume.
+ */
+ENTRY(hibernate_restore_image)
+	mv	s0, a0
+	mv	s1, a1
+	mv	s2, a2
+	REG_L	s4, restore_pblist
+	REG_L	a1, relocated_restore_code
+
+	jalr	a1
+END(hibernate_restore_image)
+
+/*
+ * The below code will be executed from a 'safe' page.
+ * It first switches to the temporary page table, then starts to copy the pages
+ * back to the original memory location. Finally, it jumps to __hibernate_cpu_resume()
+ * to restore the CPU context.
+ */
+ENTRY(hibernate_core_restore_code)
+	/* switch to temp page table. */
+	csrw satp, s1
+	sfence.vma
+.Lcopy:
+	/* The below code will restore the hibernated image. */
+	REG_L	a1, HIBERN_PBE_ADDR(s4)
+	REG_L	a0, HIBERN_PBE_ORIG(s4)
+
+	copy_page a0, a1
+
+	REG_L	s4, HIBERN_PBE_NEXT(s4)
+	bnez	s4, .Lcopy
+
+	jalr	s2
+END(hibernate_core_restore_code)
diff --git a/arch/riscv/kernel/hibernate.c b/arch/riscv/kernel/hibernate.c
new file mode 100644
index 000000000000..46a2f470db6e
--- /dev/null
+++ b/arch/riscv/kernel/hibernate.c
@@ -0,0 +1,447 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Hibernation support for RISCV
+ *
+ * Copyright (C) 2023 StarFive Technology Co., Ltd.
+ *
+ * Author: Jee Heng Sia <jeeheng.sia@starfivetech.com>
+ */
+
+#include <asm/barrier.h>
+#include <asm/cacheflush.h>
+#include <asm/mmu_context.h>
+#include <asm/page.h>
+#include <asm/pgalloc.h>
+#include <asm/pgtable.h>
+#include <asm/sections.h>
+#include <asm/set_memory.h>
+#include <asm/smp.h>
+#include <asm/suspend.h>
+
+#include <linux/cpu.h>
+#include <linux/memblock.h>
+#include <linux/pm.h>
+#include <linux/sched.h>
+#include <linux/suspend.h>
+#include <linux/utsname.h>
+
+/* The logical cpu number we should resume on, initialised to a non-cpu number. */
+static int sleep_cpu = -EINVAL;
+
+/* Pointer to the temporary resume page table. */
+static pgd_t *resume_pg_dir;
+
+/* CPU context to be saved. */
+struct suspend_context *hibernate_cpu_context;
+EXPORT_SYMBOL_GPL(hibernate_cpu_context);
+
+unsigned long relocated_restore_code;
+EXPORT_SYMBOL_GPL(relocated_restore_code);
+
+/**
+ * struct arch_hibernate_hdr_invariants - container to store kernel build version.
+ * @uts_version: to save the build number and date so that the we do not resume with
+ *		a different kernel.
+ */
+struct arch_hibernate_hdr_invariants {
+	char		uts_version[__NEW_UTS_LEN + 1];
+};
+
+/**
+ * struct arch_hibernate_hdr - helper parameters that help us to restore the image.
+ * @invariants: container to store kernel build version.
+ * @hartid: to make sure same boot_cpu executes the hibernate/restore code.
+ * @saved_satp: original page table used by the hibernated image.
+ * @restore_cpu_addr: the kernel's image address to restore the CPU context.
+ */
+static struct arch_hibernate_hdr {
+	struct arch_hibernate_hdr_invariants invariants;
+	unsigned long	hartid;
+	unsigned long	saved_satp;
+	unsigned long	restore_cpu_addr;
+} resume_hdr;
+
+static inline void arch_hdr_invariants(struct arch_hibernate_hdr_invariants *i)
+{
+	memset(i, 0, sizeof(*i));
+	memcpy(i->uts_version, init_utsname()->version, sizeof(i->uts_version));
+}
+
+/*
+ * Check if the given pfn is in the 'nosave' section.
+ */
+int pfn_is_nosave(unsigned long pfn)
+{
+	unsigned long nosave_begin_pfn = sym_to_pfn(&__nosave_begin);
+	unsigned long nosave_end_pfn = sym_to_pfn(&__nosave_end - 1);
+
+	return ((pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn));
+}
+
+void notrace save_processor_state(void)
+{
+	WARN_ON(num_online_cpus() != 1);
+}
+
+void notrace restore_processor_state(void)
+{
+}
+
+/*
+ * Helper parameters need to be saved to the hibernation image header.
+ */
+int arch_hibernation_header_save(void *addr, unsigned int max_size)
+{
+	struct arch_hibernate_hdr *hdr = addr;
+
+	if (max_size < sizeof(*hdr))
+		return -EOVERFLOW;
+
+	arch_hdr_invariants(&hdr->invariants);
+
+	hdr->hartid = cpuid_to_hartid_map(sleep_cpu);
+	hdr->saved_satp = csr_read(CSR_SATP);
+	hdr->restore_cpu_addr = (unsigned long)__hibernate_cpu_resume;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(arch_hibernation_header_save);
+
+/*
+ * Retrieve the helper parameters from the hibernation image header.
+ */
+int arch_hibernation_header_restore(void *addr)
+{
+	struct arch_hibernate_hdr_invariants invariants;
+	struct arch_hibernate_hdr *hdr = addr;
+	int ret = 0;
+
+	arch_hdr_invariants(&invariants);
+
+	if (memcmp(&hdr->invariants, &invariants, sizeof(invariants))) {
+		pr_crit("Hibernate image not generated by this kernel!\n");
+		return -EINVAL;
+	}
+
+	sleep_cpu = riscv_hartid_to_cpuid(hdr->hartid);
+	if (sleep_cpu < 0) {
+		pr_crit("Hibernated on a CPU not known to this kernel!\n");
+		sleep_cpu = -EINVAL;
+		return -EINVAL;
+	}
+
+#ifdef CONFIG_SMP
+	ret = bringup_hibernate_cpu(sleep_cpu);
+	if (ret) {
+		sleep_cpu = -EINVAL;
+		return ret;
+	}
+#endif
+	resume_hdr = *hdr;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(arch_hibernation_header_restore);
+
+int swsusp_arch_suspend(void)
+{
+	int ret = 0;
+
+	if (__cpu_suspend_enter(hibernate_cpu_context)) {
+		sleep_cpu = smp_processor_id();
+		suspend_save_csrs(hibernate_cpu_context);
+		ret = swsusp_save();
+	} else {
+		suspend_restore_csrs(hibernate_cpu_context);
+		flush_tlb_all();
+		flush_icache_all();
+
+		/*
+		 * Tell the hibernation core that we've just restored the memory.
+		 */
+		in_suspend = 0;
+		sleep_cpu = -EINVAL;
+	}
+
+	return ret;
+}
+
+static unsigned long _temp_pgtable_map_pte(pte_t *dst_ptep, pte_t *src_ptep,
+					   unsigned long addr, pgprot_t prot)
+{
+	pte_t pte = READ_ONCE(*src_ptep);
+
+	if (pte_present(pte))
+		set_pte(dst_ptep, __pte(pte_val(pte) | pgprot_val(prot)));
+
+	return 0;
+}
+
+static unsigned long temp_pgtable_map_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp,
+					  unsigned long start, unsigned long end,
+					  pgprot_t prot)
+{
+	unsigned long addr = start;
+	pte_t *src_ptep;
+	pte_t *dst_ptep;
+
+	if (pmd_none(READ_ONCE(*dst_pmdp))) {
+		dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
+		if (!dst_ptep)
+			return -ENOMEM;
+
+		pmd_populate_kernel(NULL, dst_pmdp, dst_ptep);
+	}
+
+	dst_ptep = pte_offset_kernel(dst_pmdp, start);
+	src_ptep = pte_offset_kernel(src_pmdp, start);
+
+	do {
+		_temp_pgtable_map_pte(dst_ptep, src_ptep, addr, prot);
+	} while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr < end);
+
+	return 0;
+}
+
+static unsigned long temp_pgtable_map_pmd(pud_t *dst_pudp, pud_t *src_pudp,
+					  unsigned long start, unsigned long end,
+					  pgprot_t prot)
+{
+	unsigned long addr = start;
+	unsigned long next;
+	unsigned long ret;
+	pmd_t *src_pmdp;
+	pmd_t *dst_pmdp;
+
+	if (pud_none(READ_ONCE(*dst_pudp))) {
+		dst_pmdp = (pmd_t *)get_safe_page(GFP_ATOMIC);
+		if (!dst_pmdp)
+			return -ENOMEM;
+
+		pud_populate(NULL, dst_pudp, dst_pmdp);
+	}
+
+	dst_pmdp = pmd_offset(dst_pudp, start);
+	src_pmdp = pmd_offset(src_pudp, start);
+
+	do {
+		pmd_t pmd = READ_ONCE(*src_pmdp);
+
+		next = pmd_addr_end(addr, end);
+
+		if (pmd_none(pmd))
+			continue;
+
+		if (pmd_leaf(pmd)) {
+			set_pmd(dst_pmdp, __pmd(pmd_val(pmd) | pgprot_val(prot)));
+		} else {
+			ret = temp_pgtable_map_pte(dst_pmdp, src_pmdp, addr, next, prot);
+			if (ret)
+				return -ENOMEM;
+		}
+	} while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
+
+	return 0;
+}
+
+static unsigned long temp_pgtable_map_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp,
+					  unsigned long start,
+					  unsigned long end, pgprot_t prot)
+{
+	unsigned long addr = start;
+	unsigned long next;
+	unsigned long ret;
+	pud_t *dst_pudp;
+	pud_t *src_pudp;
+
+	if (p4d_none(READ_ONCE(*dst_p4dp))) {
+		dst_pudp = (pud_t *)get_safe_page(GFP_ATOMIC);
+		if (!dst_pudp)
+			return -ENOMEM;
+
+		p4d_populate(NULL, dst_p4dp, dst_pudp);
+	}
+
+	dst_pudp = pud_offset(dst_p4dp, start);
+	src_pudp = pud_offset(src_p4dp, start);
+
+	do {
+		pud_t pud = READ_ONCE(*src_pudp);
+
+		next = pud_addr_end(addr, end);
+
+		if (pud_none(pud))
+			continue;
+
+		if (pud_leaf(pud)) {
+			set_pud(dst_pudp, __pud(pud_val(pud) | pgprot_val(prot)));
+		} else {
+			ret = temp_pgtable_map_pmd(dst_pudp, src_pudp, addr, next, prot);
+			if (ret)
+				return -ENOMEM;
+		}
+	} while (dst_pudp++, src_pudp++, addr = next, addr != end);
+
+	return 0;
+}
+
+static unsigned long temp_pgtable_map_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp,
+					  unsigned long start, unsigned long end,
+					  pgprot_t prot)
+{
+	unsigned long addr = start;
+	unsigned long next;
+	unsigned long ret;
+	p4d_t *dst_p4dp;
+	p4d_t *src_p4dp;
+
+	if (pgd_none(READ_ONCE(*dst_pgdp))) {
+		dst_p4dp = (p4d_t *)get_safe_page(GFP_ATOMIC);
+		if (!dst_p4dp)
+			return -ENOMEM;
+
+		pgd_populate(NULL, dst_pgdp, dst_p4dp);
+	}
+
+	dst_p4dp = p4d_offset(dst_pgdp, start);
+	src_p4dp = p4d_offset(src_pgdp, start);
+
+	do {
+		p4d_t p4d = READ_ONCE(*src_p4dp);
+
+		next = p4d_addr_end(addr, end);
+
+		if (p4d_none(READ_ONCE(*src_p4dp)))
+			continue;
+
+		if (p4d_leaf(p4d)) {
+			set_p4d(dst_p4dp, __p4d(p4d_val(p4d) | pgprot_val(prot)));
+		} else {
+			ret = temp_pgtable_map_pud(dst_p4dp, src_p4dp, addr, next, prot);
+			if (ret)
+				return -ENOMEM;
+		}
+	} while (dst_p4dp++, src_p4dp++, addr = next, addr != end);
+
+	return 0;
+}
+
+static unsigned long temp_pgtable_mapping(pgd_t *pgdp)
+{
+	unsigned long end = (unsigned long)pfn_to_virt(max_low_pfn);
+	unsigned long addr = PAGE_OFFSET;
+	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
+	pgd_t *src_pgdp = pgd_offset_k(addr);
+	unsigned long next;
+
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(READ_ONCE(*src_pgdp)))
+			continue;
+
+		if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, next, PAGE_KERNEL))
+			return -ENOMEM;
+	} while (dst_pgdp++, src_pgdp++, addr = next, addr != end);
+
+	return 0;
+}
+
+static unsigned long temp_pgtable_text_mapping(pgd_t *pgdp, unsigned long addr)
+{
+	pgd_t *dst_pgdp = pgd_offset_pgd(pgdp, addr);
+	pgd_t *src_pgdp = pgd_offset_k(addr);
+
+	if (pgd_none(READ_ONCE(*src_pgdp)))
+		return -EFAULT;
+
+	if (temp_pgtable_map_p4d(dst_pgdp, src_pgdp, addr, addr, PAGE_KERNEL_EXEC))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static unsigned long relocate_restore_code(void)
+{
+	unsigned long ret;
+	void *page = (void *)get_safe_page(GFP_ATOMIC);
+
+	if (!page)
+		return -ENOMEM;
+
+	copy_page(page, hibernate_core_restore_code);
+
+	/* Make the page containing the relocated code executable. */
+	set_memory_x((unsigned long)page, 1);
+
+	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)page);
+	if (ret)
+		return ret;
+
+	return (unsigned long)page;
+}
+
+int swsusp_arch_resume(void)
+{
+	unsigned long ret;
+
+	/*
+	 * Memory allocated by get_safe_page() will be dealt with by the hibernation core,
+	 * we don't need to free it here.
+	 */
+	resume_pg_dir = (pgd_t *)get_safe_page(GFP_ATOMIC);
+	if (!resume_pg_dir)
+		return -ENOMEM;
+
+	/*
+	 * The pages need to be writable when restoring the image.
+	 * Create a second copy of page table just for the linear map.
+	 * Use this temporary page table to restore the image.
+	 */
+	ret = temp_pgtable_mapping(resume_pg_dir);
+	if (ret)
+		return (int)ret;
+
+	/* Move the restore code to a new page so that it doesn't get overwritten by itself. */
+	relocated_restore_code = relocate_restore_code();
+	if (relocated_restore_code == -ENOMEM)
+		return -ENOMEM;
+
+	/*
+	 * Map the __hibernate_cpu_resume() address to the temporary page table so that the
+	 * restore code can jumps to it after finished restore the image. The next execution
+	 * code doesn't find itself in a different address space after switching over to the
+	 * original page table used by the hibernated image.
+	 */
+	ret = temp_pgtable_text_mapping(resume_pg_dir, (unsigned long)resume_hdr.restore_cpu_addr);
+	if (ret)
+		return ret;
+
+	hibernate_restore_image(resume_hdr.saved_satp, (PFN_DOWN(__pa(resume_pg_dir)) | satp_mode),
+				resume_hdr.restore_cpu_addr);
+
+	return 0;
+}
+
+#ifdef CONFIG_PM_SLEEP_SMP
+int hibernate_resume_nonboot_cpu_disable(void)
+{
+	if (sleep_cpu < 0) {
+		pr_err("Failing to resume from hibernate on an unknown CPU\n");
+		return -ENODEV;
+	}
+
+	return freeze_secondary_cpus(sleep_cpu);
+}
+#endif
+
+static int __init riscv_hibernate_init(void)
+{
+	hibernate_cpu_context = kzalloc(sizeof(*hibernate_cpu_context), GFP_KERNEL);
+
+	if (WARN_ON(!hibernate_cpu_context))
+		return -ENOMEM;
+
+	return 0;
+}
+
+early_initcall(riscv_hibernate_init);