diff mbox series

[v6,2/4] xen/riscv: introduce setup_initial_pages

Message ID d1a6fb6112b61000645eb1a4ab9ade8a208d4204.1683131359.git.oleksii.kurochko@gmail.com (mailing list archive)
State Superseded
Headers show
Series enable MMU for RISC-V | expand

Commit Message

Oleksii Kurochko May 3, 2023, 4:31 p.m. UTC
The idea was taken from xvisor but the following changes
were done:
* Use only a minimal part of the code enough to enable MMU
* rename {_}setup_initial_pagetables functions
* add an argument for setup_initial_mapping to have
  an opportunity to make set PTE flags.
* update setup_initial_pagetables function to map sections
  with correct PTE flags.
* Rewrite enable_mmu() to C.
* map linker addresses range to load addresses range without
  1:1 mapping. It will be 1:1 only in case when
  load_start_addr is equal to linker_start_addr.
* add safety checks such as:
  * Xen size is less than page size
  * linker addresses range doesn't overlap load addresses
    range
* Rework macros {THIRD,SECOND,FIRST,ZEROETH}_{SHIFT,MASK}
* change PTE_LEAF_DEFAULT to RW instead of RWX.
* Remove phys_offset as it is not used now
* Remove alignment  of {map, pa}_start &= XEN_PT_LEVEL_MAP_MASK(0);
  in  setup_inital_mapping() as they should be already aligned.
  Make a check that {map_pa}_start are aligned.
* Remove clear_pagetables() as initial pagetables will be
  zeroed during bss initialization
* Remove __attribute__((section(".entry")) for setup_initial_pagetables()
  as there is no such section in xen.lds.S
* Update the argument of pte_is_valid() to "const pte_t *p"
* Add check that Xen's load address is aligned at 4k boundary
* Refactor setup_initial_pagetables() so it is mapping linker
  address range to load address range. After setup needed
  permissions for specific section ( such as .text, .rodata, etc )
  otherwise RW permission will be set by default.
* Add function to check that requested SATP_MODE is supported

Origin: git@github.com:xvisor/xvisor.git 9be2fdd7
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V6:
 	- move PAGE_SHIFT, PADDR_BITS to the top of page-bits.h
 	- cast argument x of pte_to_addr() macros to paddr_t to avoid risk of overflow for RV32
 	- update type of num_levels from 'unsigned long' to 'unsigned int'
 	- define PGTBL_INITIAL_COUNT as ((CONFIG_PAGING_LEVELS - 1) + 1)
 	- update type of permission arguments. changed it from 'unsgined long' to 'unsigned int'
 	- fix code style
 	- switch while 'loop' to 'for' loop
 	- undef HANDLE_PGTBL
 	- clean root page table after MMU is disabled in check_pgtbl_mode_support() function
 	- align __bss_start properly
 	- remove unnecesssary const for paddr_to_pte, pte_to_paddr, pte_is_valid functions
 	- add switch_stack_and_jump macros and use it inside enable_mmu() before jump to
 	  cont_after_mmu_is_enabled() function
---
Changes in V5:
	* Indent fields of pte_t struct
	* Rename addr_to_pte() and ppn_to_paddr() to match their content
---
Changes in V4:
  * use GB() macros instead of defining SZ_1G
  * hardcode XEN_VIRT_START and add comment (ADDRESS_SPACE_END + 1 - GB(1))
  * remove unnecessary 'asm' word at the end of #error
  * encapsulate pte_t definition in a struct
  * rename addr_to_ppn() to ppn_to_paddr().
  * change type of paddr argument from const unsigned long to paddr_t
  * pte_to_paddr() update prototype.
  * calculate size of Xen binary based on an amount of page tables
  * use unsgined int instead of 'uint32_t' instead of uint32_t as
    their use isn't warranted.
  * remove extern of bss_{start,end} as they aren't used in mm.c anymore
  * fix code style
  * add argument for HANDLE_PGTBL macros instead of curr_lvl_num variable
  * make enable_mmu() as noinline to prevent under link-time optimization
    because of the nature of enable_mmu()
  * add function to check that SATP_MODE is supported.
  * update the commit message
  * update setup_initial_pagetables to set correct PTE flags in one pass
    instead of calling setup_pte_permissions after setup_initial_pagetables()
    as setup_initial_pagetables() isn't used to change permission flags.
---
Changes in V3:
 - update definition of pte_t structure to have a proper size of pte_t
   in case of RV32.
 - update asm/mm.h with new functions and remove unnecessary 'extern'.
 - remove LEVEL_* macros as only XEN_PT_LEVEL_* are enough.
 - update paddr_to_pte() to receive permissions as an argument.
 - add check that map_start & pa_start is properly aligned.
 - move  defines PAGETABLE_ORDER, PAGETABLE_ENTRIES, PTE_PPN_SHIFT to
   <asm/page-bits.h>
 - Rename PTE_SHIFT to PTE_PPN_SHIFT
 - refactor setup_initial_pagetables: map all LINK addresses to LOAD addresses
   and after setup PTEs permission for sections; update check that linker
   and load addresses don't overlap.
 - refactor setup_initial_mapping: allocate pagetable 'dynamically' if it is
   necessary.
 - rewrite enable_mmu in C; add the check that map_start and pa_start are
   aligned on 4k boundary.
 - update the comment for setup_initial_pagetable funcion
 - Add RV_STAGE1_MODE to support different MMU modes
 - set XEN_VIRT_START very high to not overlap with load address range
 - align bss section
---
Changes in V2:
 * update the commit message:
 * Remove {ZEROETH,FIRST,...}_{SHIFT,MASK, SIZE,...} and
   introduce instead of them XEN_PT_LEVEL_*() and LEVEL_*
 * Rework pt_linear_offset() and pt_index based on  XEN_PT_LEVEL_*()
 * Remove clear_pagetables() functions as pagetables were zeroed during
   .bss initialization
 * Rename _setup_initial_pagetables() to setup_initial_mapping()
 * Make PTE_DEFAULT equal to RX.
 * Update prototype of setup_initial_mapping(..., bool writable) -> 
   setup_initial_mapping(..., UL flags)  
 * Update calls of setup_initial_mapping according to new prototype
 * Remove unnecessary call of:
   _setup_initial_pagetables(..., load_addr_start, load_addr_end, load_addr_start, ...)
 * Define index* in the loop of setup_initial_mapping
 * Remove attribute "__attribute__((section(".entry")))" for setup_initial_pagetables()
   as we don't have such section
 * make arguments of paddr_to_pte() and pte_is_valid() as const.
 * make xen_second_pagetable static.
 * use <xen/kernel.h> instead of declaring extern unsigned long _stext, 0etext, _srodata, _erodata
 * update  'extern unsigned long __init_begin' to 'extern unsigned long __init_begin[]'
 * use aligned() instead of "__attribute__((__aligned__(PAGE_SIZE)))"
 * set __section(".bss.page_aligned") for page tables arrays
 * fix identatations
 * Change '__attribute__((section(".entry")))' to '__init'
 * Remove phys_offset as it isn't used now.
 * Remove alignment  of {map, pa}_start &= XEN_PT_LEVEL_MAP_MASK(0); in
   setup_inital_mapping() as they should be already aligned.
 * Remove clear_pagetables() as initial pagetables will be
   zeroed during bss initialization
 * Remove __attribute__((section(".entry")) for setup_initial_pagetables()
   as there is no such section in xen.lds.S
 * Update the argument of pte_is_valid() to "const pte_t *p"
---
 xen/arch/riscv/Makefile                |   1 +
 xen/arch/riscv/include/asm/config.h    |  13 +-
 xen/arch/riscv/include/asm/current.h   |  10 +
 xen/arch/riscv/include/asm/mm.h        |   9 +
 xen/arch/riscv/include/asm/page-bits.h |  10 +
 xen/arch/riscv/include/asm/page.h      |  62 +++++
 xen/arch/riscv/mm.c                    | 315 +++++++++++++++++++++++++
 xen/arch/riscv/riscv64/head.S          |   1 +
 xen/arch/riscv/setup.c                 |  11 +
 xen/arch/riscv/xen.lds.S               |   4 +
 10 files changed, 435 insertions(+), 1 deletion(-)
 create mode 100644 xen/arch/riscv/include/asm/current.h
 create mode 100644 xen/arch/riscv/include/asm/mm.h
 create mode 100644 xen/arch/riscv/include/asm/page.h
 create mode 100644 xen/arch/riscv/mm.c

Comments

Jan Beulich May 8, 2023, 8:58 a.m. UTC | #1
On 03.05.2023 18:31, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/config.h
> +++ b/xen/arch/riscv/include/asm/config.h
> @@ -70,12 +70,23 @@
>    name:
>  #endif
>  
> -#define XEN_VIRT_START  _AT(UL, 0x80200000)
> +#ifdef CONFIG_RISCV_64
> +#define XEN_VIRT_START 0xFFFFFFFFC0000000 /* (_AC(-1, UL) + 1 - GB(1)) */
> +#else
> +#error "RV32 isn't supported"
> +#endif
>  
>  #define SMP_CACHE_BYTES (1 << 6)
>  
>  #define STACK_SIZE PAGE_SIZE
>  
> +#ifdef CONFIG_RISCV_64
> +#define CONFIG_PAGING_LEVELS 3
> +#define RV_STAGE1_MODE SATP_MODE_SV39
> +#else

#define CONFIG_PAGING_LEVELS 2

(or else I would think ...

> +#define RV_STAGE1_MODE SATP_MODE_SV32

... this and hence the entire "#else" should also be omitted)

> --- /dev/null
> +++ b/xen/arch/riscv/include/asm/current.h
> @@ -0,0 +1,10 @@
> +#ifndef __ASM_CURRENT_H
> +#define __ASM_CURRENT_H
> +
> +#define switch_stack_and_jump(stack, fn)    \
> +    asm volatile (                          \
> +            "mv sp, %0 \n"                  \
> +            "j " #fn :: "r" (stack) :       \
> +    )

Nit: Style. Our normal way of writing this would be

#define switch_stack_and_jump(stack, fn)    \
    asm volatile (                          \
            "mv sp, %0\n"                   \
            "j " #fn :: "r" (stack) )

i.e. unnecessary colon omitted, no trailin blank on any generated
assembly line, and closing paren not placed on its own line (only
closing figure braces would normally live on their own lines).

However, as to the 3rd colon: Can you really get away here without a
memory clobber (i.e. the construct being a full compiler barrier)?

Further I think you want to make the use of "fn" visible to the
compiler, by using an "X" constraint just like Arm does.

Finally I think you want to add unreachable(), like both Arm and x86
have it. Plus the "noreturn" on relevant functions.

> --- /dev/null
> +++ b/xen/arch/riscv/include/asm/page.h
> @@ -0,0 +1,62 @@
> +#ifndef _ASM_RISCV_PAGE_H
> +#define _ASM_RISCV_PAGE_H
> +
> +#include <xen/const.h>
> +#include <xen/types.h>
> +
> +#define VPN_MASK                    ((unsigned long)(PAGETABLE_ENTRIES - 1))
> +
> +#define XEN_PT_LEVEL_ORDER(lvl)     ((lvl) * PAGETABLE_ORDER)
> +#define XEN_PT_LEVEL_SHIFT(lvl)     (XEN_PT_LEVEL_ORDER(lvl) + PAGE_SHIFT)
> +#define XEN_PT_LEVEL_SIZE(lvl)      (_AT(paddr_t, 1) << XEN_PT_LEVEL_SHIFT(lvl))
> +#define XEN_PT_LEVEL_MAP_MASK(lvl)  (~(XEN_PT_LEVEL_SIZE(lvl) - 1))
> +#define XEN_PT_LEVEL_MASK(lvl)      (VPN_MASK << XEN_PT_LEVEL_SHIFT(lvl))
> +
> +#define PTE_VALID                   BIT(0, UL)
> +#define PTE_READABLE                BIT(1, UL)
> +#define PTE_WRITABLE                BIT(2, UL)
> +#define PTE_EXECUTABLE              BIT(3, UL)
> +#define PTE_USER                    BIT(4, UL)
> +#define PTE_GLOBAL                  BIT(5, UL)
> +#define PTE_ACCESSED                BIT(6, UL)
> +#define PTE_DIRTY                   BIT(7, UL)
> +#define PTE_RSW                     (BIT(8, UL) | BIT(9, UL))
> +
> +#define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
> +#define PTE_TABLE                   (PTE_VALID)
> +
> +/* Calculate the offsets into the pagetables for a given VA */
> +#define pt_linear_offset(lvl, va)   ((va) >> XEN_PT_LEVEL_SHIFT(lvl))
> +
> +#define pt_index(lvl, va) pt_linear_offset(lvl, (va) & XEN_PT_LEVEL_MASK(lvl))

Maybe better

#define pt_index(lvl, va) (pt_linear_offset(lvl, va) & VPN_MASK)

as the involved constant will be easier to use for the compiler?

> +/* Page Table entry */
> +typedef struct {
> +#ifdef CONFIG_RISCV_64
> +    uint64_t pte;
> +#else
> +    uint32_t pte;
> +#endif
> +} pte_t;
> +
> +#define pte_to_addr(x) (((paddr_t)(x) >> PTE_PPN_SHIFT) << PAGE_SHIFT)
> +
> +#define addr_to_pte(x) (((x) >> PAGE_SHIFT) << PTE_PPN_SHIFT)

Are these two macros useful on their own? I ask because I still consider
them somewhat misnamed, as they don't produce / consume a PTE (but the
raw value). Yet generally you want to avoid any code operating on raw
values, using pte_t instead. IOW I would hope to be able to convince
you to ...

> +static inline pte_t paddr_to_pte(paddr_t paddr,
> +                                 unsigned int permissions)
> +{
> +    return (pte_t) { .pte = addr_to_pte(paddr) | permissions };
> +}
> +
> +static inline paddr_t pte_to_paddr(pte_t pte)
> +{
> +    return pte_to_addr(pte.pte);
> +}

... expand them in the two inline functions and then drop the macros.

> --- /dev/null
> +++ b/xen/arch/riscv/mm.c
> @@ -0,0 +1,315 @@
> +#include <xen/compiler.h>
> +#include <xen/init.h>
> +#include <xen/kernel.h>
> +#include <xen/pfn.h>
> +
> +#include <asm/early_printk.h>
> +#include <asm/config.h>

No inclusion of asm/config.h (or xen/config.h) in any new code please.
For quite some time xen/config.h has been forcefully included everywhere
by the build system.

> +/*
> + * It is expected that Xen won't be more then 2 MB.
> + * The check in xen.lds.S guarantees that.
> + * At least 3 page (in case of Sv39 )
> + * tables are needed to cover 2 MB.

I guess this reads a little better as "At least 3 page tables (in case of
Sv39) ..." or "At least 3 (in case of Sv39) page tables ..." Also could I
talk you into using the full 80 columns instead of wrapping early in the
middle of a sentence?

> One for each page level
> + * table with PAGE_SIZE = 4 Kb
> + *
> + * One L0 page table can cover 2 MB
> + * (512 entries of one page table * PAGE_SIZE).
> + *
> + */
> +#define PGTBL_INITIAL_COUNT ((CONFIG_PAGING_LEVELS - 1) + 1)

Hmm, did you lose the part of the comment explaining the "+ 1"? Or did
the need for that actually go away (and you should drop it here)?

> +#define PGTBL_ENTRY_AMOUNT  (PAGE_SIZE / sizeof(pte_t))

Isn't this PAGETABLE_ENTRIES (and the constant hence unnecessary)?

> +static void __init setup_initial_mapping(struct mmu_desc *mmu_desc,
> +                                         unsigned long map_start,
> +                                         unsigned long map_end,
> +                                         unsigned long pa_start,
> +                                         unsigned int permissions)

What use is the last parameter, when the sole caller passes
PTE_LEAF_DEFAULT (i.e. a build-time constant)? Then ...

> +{
> +    unsigned int index;
> +    pte_t *pgtbl;
> +    unsigned long page_addr;
> +    pte_t pte_to_be_written;
> +    unsigned long paddr;
> +    unsigned int tmp_permissions;

... this variable (which would better be of more narrow scope anyway, like
perhaps several more of the above) could also have its tmp_ prefix dropped
afaict.

> +    if ( (unsigned long)_start % XEN_PT_LEVEL_SIZE(0) )
> +    {
> +        early_printk("(XEN) Xen should be loaded at 4k boundary\n");
> +        die();
> +    }
> +
> +    if ( map_start & ~XEN_PT_LEVEL_MAP_MASK(0) ||
> +         pa_start & ~XEN_PT_LEVEL_MAP_MASK(0) )
> +    {
> +        early_printk("(XEN) map and pa start addresses should be aligned\n");
> +        /* panic(), BUG() or ASSERT() aren't ready now. */
> +        die();
> +    }
> +
> +    for ( page_addr = map_start; page_addr < map_end; page_addr += XEN_PT_LEVEL_SIZE(0) )

Nit: Style (line looks to be too long now).

    for ( page_addr = map_start; page_addr < map_end;
          page_addr += XEN_PT_LEVEL_SIZE(0) )

is the way we would typically wrap it, but

    for ( page_addr = map_start;
          page_addr < map_end;
          page_addr += XEN_PT_LEVEL_SIZE(0) )

would of course also be okay if you preferred that.

> +    {
> +        pgtbl = mmu_desc->pgtbl_base;
> +
> +        switch ( mmu_desc->num_levels )
> +        {
> +        case 4: /* Level 3 */
> +            HANDLE_PGTBL(3);
> +        case 3: /* Level 2 */
> +            HANDLE_PGTBL(2);
> +        case 2: /* Level 1 */
> +            HANDLE_PGTBL(1);
> +        case 1: /* Level 0 */
> +            index = pt_index(0, page_addr);
> +            paddr = (page_addr - map_start) + pa_start;
> +
> +            tmp_permissions = permissions;
> +
> +            if ( is_kernel_text(LINK_TO_LOAD(page_addr)) ||
> +                    is_kernel_inittext(LINK_TO_LOAD(page_addr)) )

Nit: Style (and I'm pretty sure I did comment on this kind of too deep
indentation already).

> +                tmp_permissions =
> +                    PTE_EXECUTABLE | PTE_READABLE | PTE_VALID;
> +
> +            if ( is_kernel_rodata(LINK_TO_LOAD(page_addr)) )
> +                tmp_permissions = PTE_READABLE | PTE_VALID;
> +
> +            pte_to_be_written = paddr_to_pte(paddr, tmp_permissions);
> +
> +            if ( !pte_is_valid(pgtbl[index]) )
> +                pgtbl[index] = pte_to_be_written;
> +            else
> +            {
> +                /*
> +                * get an adresses of current pte and that one to
> +                * be written
> +                */

Nit: Style (one missing blank each in the last three lines, and comment
text wants to start with a capital letter).

> +                unsigned long curr_pte =
> +                    pgtbl[index].pte & ~(PTE_DIRTY | PTE_ACCESSED);

While technically "unsigned long" is okay to use here (afaict), I'd still
recommend ...

> +                pte_to_be_written.pte &= ~(PTE_DIRTY | PTE_ACCESSED);
> +
> +                if ( curr_pte != pte_to_be_written.pte )

... doing everything in terms of pte_t:

                pte_t curr_pte = pgtbl[index];

                curr_pte.pte &= ~(PTE_DIRTY | PTE_ACCESSED);
                pte_to_be_written.pte &= ~(PTE_DIRTY | PTE_ACCESSED);

                if ( curr_pte.pte != pte_to_be_written.pte )

or perhaps even simply

                if ( (pgtbl[index].pte ^ pte_to_be_written.pte) &
                      ~(PTE_DIRTY | PTE_ACCESSED) )

> +                {
> +                    early_printk("PTE overridden has occurred\n");
> +                    /* panic(), <asm/bug.h> aren't ready now. */
> +                    die();
> +                }
> +            }
> +        }
> +    }
> +    #undef HANDLE_PGTBL

Nit: Such an #undef, even if technically okay either way, would imo
better live in the same scope (and have the same indentation) as the
corresponding #define. Since your #define is ahead of the function, it
would ...

> +}

... then live here.

> +static void __init calc_pgtbl_lvls_num(struct  mmu_desc *mmu_desc)

Nit: Double blank after "struct".

> +{
> +    unsigned long satp_mode = RV_STAGE1_MODE;
> +
> +    /* Number of page table levels */
> +    switch (satp_mode)
> +    {
> +    case SATP_MODE_SV32:
> +        mmu_desc->num_levels = 2;
> +        break;
> +    case SATP_MODE_SV39:
> +        mmu_desc->num_levels = 3;
> +        break;
> +    case SATP_MODE_SV48:
> +        mmu_desc->num_levels = 4;
> +        break;
> +    default:
> +        early_printk("(XEN) Unsupported SATP_MODE\n");
> +        die();
> +    }
> +}

Do you really need this function anymore? Isn't it now simply

    mmu_desc.num_levels = CONFIG_PAGING_LEVELS?

in the caller? WHich could then even be the variable's initializer
there?

> +static bool __init check_pgtbl_mode_support(struct mmu_desc *mmu_desc,
> +                                            unsigned long load_start,
> +                                            unsigned long satp_mode)
> +{
> +    bool is_mode_supported = false;
> +    unsigned int index;
> +    unsigned int page_table_level = (mmu_desc->num_levels - 1);
> +    unsigned level_map_mask = XEN_PT_LEVEL_MAP_MASK(page_table_level);
> +
> +    unsigned long aligned_load_start = load_start & level_map_mask;
> +    unsigned long aligned_page_size = XEN_PT_LEVEL_SIZE(page_table_level);
> +    unsigned long xen_size = (unsigned long)(_end - start);
> +
> +    if ( (load_start + xen_size) > (aligned_load_start + aligned_page_size) )
> +    {
> +        early_printk("please place Xen to be in range of PAGE_SIZE "
> +                     "where PAGE_SIZE is XEN_PT_LEVEL_SIZE( {L3 | L2 | L1} ) "
> +                     "depending on expected SATP_MODE \n"
> +                     "XEN_PT_LEVEL_SIZE is defined in <asm/page.h>\n");
> +        die();
> +    }
> +
> +    index = pt_index(page_table_level, aligned_load_start);
> +    stage1_pgtbl_root[index] = paddr_to_pte(aligned_load_start,
> +                                            PTE_LEAF_DEFAULT | PTE_EXECUTABLE);
> +
> +    asm volatile ( "sfence.vma" );
> +    csr_write(CSR_SATP,
> +              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
> +              satp_mode << SATP_MODE_SHIFT);
> +
> +    if ( (csr_read(CSR_SATP) >> SATP_MODE_SHIFT) == satp_mode )
> +        is_mode_supported = true;

Just as a remark: While plain shift is kind of okay here, because the field
is the top-most one in the register, generally you will want to get used to
MASK_EXTR() (and MASK_INSR()) in cases like this one.

> +    csr_write(CSR_SATP, 0);
> +
> +    /* Clean MMU root page table */
> +    stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0);
> +
> +    asm volatile ( "sfence.vma" );

Doesn't this want to move between the SATP write and the clearing of the
root table slot? Also here and elsewhere - don't these asm()s need memory
clobbers? And anyway - could I talk you into introducing an inline wrapper
(e.g. named sfence_vma()) so all uses end up consistent?

> +void __init setup_initial_pagetables(void)
> +{
> +    struct mmu_desc mmu_desc = { 0, 0, NULL, NULL };
> +
> +    /*
> +     * Access to _stard, _end is always PC-relative

Nit: Typo-ed symbol name. Also ...

> +     * thereby when access them we will get load adresses
> +     * of start and end of Xen
> +     * To get linker addresses LOAD_TO_LINK() is required
> +     * to use
> +     */

see the earlier line wrapping remark again. Finally in multi-sentence
comments full stops are required.

> +    unsigned long load_start    = (unsigned long)_start;
> +    unsigned long load_end      = (unsigned long)_end;
> +    unsigned long linker_start  = LOAD_TO_LINK(load_start);
> +    unsigned long linker_end    = LOAD_TO_LINK(load_end);
> +
> +    if ( (linker_start != load_start) &&
> +         (linker_start <= load_end) && (load_start <= linker_end) )
> +    {
> +        early_printk("(XEN) linker and load address ranges overlap\n");
> +        die();
> +    }
> +
> +    calc_pgtbl_lvls_num(&mmu_desc);
> +
> +    if ( !check_pgtbl_mode_support(&mmu_desc, load_start, RV_STAGE1_MODE) )

It is again questionable here whether passing a constant to a function
at its sole call site is really useful.

> +void __init noinline enable_mmu()
> +{
> +    /*
> +     * Calculate a linker time address of the mmu_is_enabled
> +     * label and update CSR_STVEC with it.
> +     * MMU is configured in a way where linker addresses are mapped
> +     * on load addresses so in a case when linker addresses are not equal
> +     * to load addresses, after MMU is enabled, it will cause
> +     * an exception and jump to linker time addresses.
> +     * Otherwise if load addresses are equal to linker addresses the code
> +     * after mmu_is_enabled label will be executed without exception.
> +     */
> +    csr_write(CSR_STVEC, LOAD_TO_LINK((unsigned long)&&mmu_is_enabled));
> +
> +    /* Ensure page table writes precede loading the SATP */
> +    asm volatile ( "sfence.vma" );
> +
> +    /* Enable the MMU and load the new pagetable for Xen */
> +    csr_write(CSR_SATP,
> +              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
> +              RV_STAGE1_MODE << SATP_MODE_SHIFT);
> +
> +    asm volatile ( ".align 2" );

May I suggest to avoid .align, and to prefer .balign and .p2align instead?
This helps people coming from different architectures, as which of the two
.align resolves to is arch-dependent.

> --- a/xen/arch/riscv/riscv64/head.S
> +++ b/xen/arch/riscv/riscv64/head.S
> @@ -1,4 +1,5 @@
>  #include <asm/asm.h>
> +#include <asm/asm-offsets.h>
>  #include <asm/riscv_encoding.h>
>  
>          .section .text.header, "ax", %progbits

How does a need for this arise all of the sudden, without other changes
to this file? Is this maybe a leftover which wants dropping?

> --- a/xen/arch/riscv/xen.lds.S
> +++ b/xen/arch/riscv/xen.lds.S
> @@ -137,6 +137,7 @@ SECTIONS
>      __init_end = .;
>  
>      .bss : {                     /* BSS */
> +        . = ALIGN(POINTER_ALIGN);
>          __bss_start = .;
>          *(.bss.stack_aligned)
>          . = ALIGN(PAGE_SIZE);

Doesn't this want to be a separate change?

Jan
Oleksii Kurochko May 9, 2023, 12:59 p.m. UTC | #2
On Mon, 2023-05-08 at 10:58 +0200, Jan Beulich wrote:
> On 03.05.2023 18:31, Oleksii Kurochko wrote:
> > --- a/xen/arch/riscv/include/asm/config.h
> > +++ b/xen/arch/riscv/include/asm/config.h
> > @@ -70,12 +70,23 @@
> >    name:
> >  #endif
> >  
> > -#define XEN_VIRT_START  _AT(UL, 0x80200000)
> > +#ifdef CONFIG_RISCV_64
> > +#define XEN_VIRT_START 0xFFFFFFFFC0000000 /* (_AC(-1, UL) + 1 -
> > GB(1)) */
> > +#else
> > +#error "RV32 isn't supported"
> > +#endif
> >  
> >  #define SMP_CACHE_BYTES (1 << 6)
> >  
> >  #define STACK_SIZE PAGE_SIZE
> >  
> > +#ifdef CONFIG_RISCV_64
> > +#define CONFIG_PAGING_LEVELS 3
> > +#define RV_STAGE1_MODE SATP_MODE_SV39
> > +#else
> 
> #define CONFIG_PAGING_LEVELS 2
> 
> (or else I would think ...
> 
> > +#define RV_STAGE1_MODE SATP_MODE_SV32
> 
> ... this and hence the entire "#else" should also be omitted)
Agree. it will be better to set CONFIG_PAGING_LEVELS for RV32 too.

> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/include/asm/current.h
> > @@ -0,0 +1,10 @@
> > +#ifndef __ASM_CURRENT_H
> > +#define __ASM_CURRENT_H
> > +
> > +#define switch_stack_and_jump(stack, fn)    \
> > +    asm volatile (                          \
> > +            "mv sp, %0 \n"                  \
> > +            "j " #fn :: "r" (stack) :       \
> > +    )
> 
> Nit: Style. Our normal way of writing this would be
> 
> #define switch_stack_and_jump(stack, fn)    \
>     asm volatile (                          \
>             "mv sp, %0\n"                   \
>             "j " #fn :: "r" (stack) )
> 
> i.e. unnecessary colon omitted, no trailin blank on any generated
> assembly line, and closing paren not placed on its own line (only
> closing figure braces would normally live on their own lines).
Thanks for clarification

> 
> However, as to the 3rd colon: Can you really get away here without a
> memory clobber (i.e. the construct being a full compiler barrier)?
I am not 100% sure so to be safe I would add memory clobber. Thanks.

> 
> Further I think you want to make the use of "fn" visible to the
> compiler, by using an "X" constraint just like Arm does.
> 
> Finally I think you want to add unreachable(), like both Arm and x86
> have it. Plus the "noreturn" on relevant functions.
It makes sense. I'll take into account.
> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/include/asm/page.h
> > @@ -0,0 +1,62 @@
> > +#ifndef _ASM_RISCV_PAGE_H
> > +#define _ASM_RISCV_PAGE_H
> > +
> > +#include <xen/const.h>
> > +#include <xen/types.h>
> > =)+
> > +#define VPN_MASK                    ((unsigned
> > long)(PAGETABLE_ENTRIES - 1))
> > +
> > +#define XEN_PT_LEVEL_ORDER(lvl)     ((lvl) * PAGETABLE_ORDER)
> > +#define XEN_PT_LEVEL_SHIFT(lvl)     (XEN_PT_LEVEL_ORDER(lvl) +
> > PAGE_SHIFT)
> > +#define XEN_PT_LEVEL_SIZE(lvl)      (_AT(paddr_t, 1) <<
> > XEN_PT_LEVEL_SHIFT(lvl))
> > +#define XEN_PT_LEVEL_MAP_MASK(lvl)  (~(XEN_PT_LEVEL_SIZE(lvl) -
> > 1))
> > +#define XEN_PT_LEVEL_MASK(lvl)      (VPN_MASK <<
> > XEN_PT_LEVEL_SHIFT(lvl))
> > +
> > +#define PTE_VALID                   BIT(0, UL)
> > +#define PTE_READABLE                BIT(1, UL)
> > +#define PTE_WRITABLE                BIT(2, UL)
> > +#define PTE_EXECUTABLE              BIT(3, UL)
> > +#define PTE_USER                    BIT(4, UL)
> > +#define PTE_GLOBAL                  BIT(5, UL)
> > +#define PTE_ACCESSED                BIT(6, UL)
> > +#define PTE_DIRTY                   BIT(7, UL)
> > +#define PTE_RSW                     (BIT(8, UL) | BIT(9, UL))
> > +
> > +#define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE |
> > PTE_WRITABLE)
> > +#define PTE_TABLE                   (PTE_VALID)
> > +
> > +/* Calculate the offsets into the pagetables for a given VA */
> > +#define pt_linear_offset(lvl, va)   ((va) >>
> > XEN_PT_LEVEL_SHIFT(lvl))
> > +
> > +#define pt_index(lvl, va) pt_linear_offset(lvl, (va) &
> > XEN_PT_LEVEL_MASK(lvl))
> 
> Maybe better
> 
> #define pt_index(lvl, va) (pt_linear_offset(lvl, va) & VPN_MASK)
> 
> as the involved constant will be easier to use for the compiler?
But VPN_MASK should be shifted by level shift value.

Or did you mean that it will be better to pre-calculate MASK for each
level instead of define MASK as (VPN_MASK <<
XEN_PT_LEVEL_SHIFT(lvl)).

Probably I have to re-check that but taking into account that all
values are defined during compile time so they will be pre-calculated
so it shouldn't be big difference for the compiler.
> 
> > +/* Page Table entry */
> > +typedef struct {
> > +#ifdef CONFIG_RISCV_64
> > +    uint64_t pte;
> > +#else
> > +    uint32_t pte;
> > +#endif
> > +} pte_t;
> > +
> > +#define pte_to_addr(x) (((paddr_t)(x) >> PTE_PPN_SHIFT) <<
> > PAGE_SHIFT)
> > +
> > +#define addr_to_pte(x) (((x) >> PAGE_SHIFT) << PTE_PPN_SHIFT)
> 
> Are these two macros useful on their own? I ask because I still
> consider
> them somewhat misnamed, as they don't produce / consume a PTE (but
> the
> raw value). Yet generally you want to avoid any code operating on raw
> values, using pte_t instead. IOW I would hope to be able to convince
> you to ...
> 
> > +static inline pte_t paddr_to_pte(paddr_t paddr,
> > +                                 unsigned int permissions)
> > +{
> > +    return (pte_t) { .pte = addr_to_pte(paddr) | permissions };
> > +}
> > +
> > +static inline paddr_t pte_to_paddr(pte_t pte)
> > +{
> > +    return pte_to_addr(pte.pte);
> > +}
> 
> ... expand them in the two inline functions and then drop the macros.
I thought about drop the macros as they are used now only in the
mentioned above functions. ( initially the macros were introduced for
another code but that code was re-written and the macros aren't needed
anymore).
So yes, lets remove them.

> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/mm.c
> > @@ -0,0 +1,315 @@
> > +#include <xen/compiler.h>
> > +#include <xen/init.h>
> > +#include <xen/kernel.h>
> > +#include <xen/pfn.h>
> > +
> > +#include <asm/early_printk.h>
> > +#include <asm/config.h>
> 
> No inclusion of asm/config.h (or xen/config.h) in any new code
> please.
> For quite some time xen/config.h has been forcefully included
> everywhere
> by the build system.
Sure. I'll remove the inclusion than.

> 
> > +/*
> > + * It is expected that Xen won't be more then 2 MB.
> > + * The check in xen.lds.S guarantees that.
> > + * At least 3 page (in case of Sv39 )
> > + * tables are needed to cover 2 MB.
> 
> I guess this reads a little better as "At least 3 page tables (in
> case of
> Sv39) ..." or "At least 3 (in case of Sv39) page tables ..." Also
> could I
> talk you into using the full 80 columns instead of wrapping early in
> the
> middle of a sentence?
Sure, I'll use full 80 columns.

> 
> > One for each page level
> > + * table with PAGE_SIZE = 4 Kb
> > + *
> > + * One L0 page table can cover 2 MB
> > + * (512 entries of one page table * PAGE_SIZE).
> > + *
> > + */
> > +#define PGTBL_INITIAL_COUNT ((CONFIG_PAGING_LEVELS - 1) + 1)
> 
> Hmm, did you lose the part of the comment explaining the "+ 1"? Or
> did
> the need for that actually go away (and you should drop it here)?
I lost it so it should be backed. Thanks.

> 
> > +#define PGTBL_ENTRY_AMOUNT  (PAGE_SIZE / sizeof(pte_t))
> 
> Isn't this PAGETABLE_ENTRIES (and the constant hence unnecessary)?
It is. I'll re-use PAGETABLE_ENTRIES
> 
> > +static void __init setup_initial_mapping(struct mmu_desc
> > *mmu_desc,
> > +                                         unsigned long map_start,
> > +                                         unsigned long map_end,
> > +                                         unsigned long pa_start,
> > +                                         unsigned int permissions)
> 
> What use is the last parameter, when the sole caller passes
> PTE_LEAF_DEFAULT (i.e. a build-time constant)? Then ...
It's not used anymore. and taking into account that I boot Dom0
successfully and used for setup_initial_mapping() always
PTE_LEAF_DEFAULT so i'll drop it.
> 
> > +{
> > +    unsigned int index;
> > +    pte_t *pgtbl;
> > +    unsigned long page_addr;
> > +    pte_t pte_to_be_written;
> > +    unsigned long paddr;
> > +    unsigned int tmp_permissions;
> 
> ... this variable (which would better be of more narrow scope anyway,
> like
> perhaps several more of the above) could also have its tmp_ prefix
> dropped
> afaict.
> 
> > +    if ( (unsigned long)_start % XEN_PT_LEVEL_SIZE(0) )
> > +    {
> > +        early_printk("(XEN) Xen should be loaded at 4k
> > boundary\n");
> > +        die();
> > +    }
> > +
> > +    if ( map_start & ~XEN_PT_LEVEL_MAP_MASK(0) ||
> > +         pa_start & ~XEN_PT_LEVEL_MAP_MASK(0) )
> > +    {
> > +        early_printk("(XEN) map and pa start addresses should be
> > aligned\n");
> > +        /* panic(), BUG() or ASSERT() aren't ready now. */
> > +        die();
> > +    }
> > +
> > +    for ( page_addr = map_start; page_addr < map_end; page_addr +=
> > XEN_PT_LEVEL_SIZE(0) )
> 
> Nit: Style (line looks to be too long now).
> 
>     for ( page_addr = map_start; page_addr < map_end;
>           page_addr += XEN_PT_LEVEL_SIZE(0) )
> 
> is the way we would typically wrap it, but
> 
>     for ( page_addr = map_start;
>           page_addr < map_end;
>           page_addr += XEN_PT_LEVEL_SIZE(0) )
> 
> would of course also be okay if you preferred that.
Last one option looks better to me. Thanks.
> 
> > +    {
> > +        pgtbl = mmu_desc->pgtbl_base;
> > +
> > +        switch ( mmu_desc->num_levels )
> > +        {
> > +        case 4: /* Level 3 */
> > +            HANDLE_PGTBL(3);
> > +        case 3: /* Level 2 */
> > +            HANDLE_PGTBL(2);
> > +        case 2: /* Level 1 */
> > +            HANDLE_PGTBL(1);
> > +        case 1: /* Level 0 */
> > +            index = pt_index(0, page_addr);
> > +            paddr = (page_addr - map_start) + pa_start;
> > +
> > +            tmp_permissions = permissions;
> > +
> > +            if ( is_kernel_text(LINK_TO_LOAD(page_addr)) ||
> > +                    is_kernel_inittext(LINK_TO_LOAD(page_addr)) )
> 
> Nit: Style (and I'm pretty sure I did comment on this kind of too
> deep
> indentation already).
Sorry. I'll double check that and fix.

> 
> > +                tmp_permissions =
> > +                    PTE_EXECUTABLE | PTE_READABLE | PTE_VALID;
> > +
> > +            if ( is_kernel_rodata(LINK_TO_LOAD(page_addr)) )
> > +                tmp_permissions = PTE_READABLE | PTE_VALID;
> > +
> > +            pte_to_be_written = paddr_to_pte(paddr,
> > tmp_permissions);
> > +
> > +            if ( !pte_is_valid(pgtbl[index]) )
> > +                pgtbl[index] = pte_to_be_written;
> > +            else
> > +            {
> > +                /*
> > +                * get an adresses of current pte and that one to
> > +                * be written
> > +                */
> 
> Nit: Style (one missing blank each in the last three lines, and
> comment
> text wants to start with a capital letter).
Thanks.

> 
> > +                unsigned long curr_pte =
> > +                    pgtbl[index].pte & ~(PTE_DIRTY |
> > PTE_ACCESSED);
> 
> While technically "unsigned long" is okay to use here (afaict), I'd
> still
> recommend ...
> 
> > +                pte_to_be_written.pte &= ~(PTE_DIRTY |
> > PTE_ACCESSED);
> > +
> > +                if ( curr_pte != pte_to_be_written.pte )
> 
> ... doing everything in terms of pte_t:
> 
>                 pte_t curr_pte = pgtbl[index];
> 
>                 curr_pte.pte &= ~(PTE_DIRTY | PTE_ACCESSED);
>                 pte_to_be_written.pte &= ~(PTE_DIRTY | PTE_ACCESSED);
> 
>                 if ( curr_pte.pte != pte_to_be_written.pte )
> 
> or perhaps even simply
It makes sense for me to use pte_t so will switch to it.

> 
>                 if ( (pgtbl[index].pte ^ pte_to_be_written.pte) &
>                       ~(PTE_DIRTY | PTE_ACCESSED) )
> 
> > +                {
> > +                    early_printk("PTE overridden has occurred\n");
> > +                    /* panic(), <asm/bug.h> aren't ready now. */
> > +                    die();
> > +                }
> > +            }
> > +        }
> > +    }
> > +    #undef HANDLE_PGTBL
> 
> Nit: Such an #undef, even if technically okay either way, would imo
> better live in the same scope (and have the same indentation) as the
> corresponding #define. Since your #define is ahead of the function,
> it
> would ...
> 
> > +}
> 
> ... then live here.
Thanks. I'll take into account.

> 
> > +static void __init calc_pgtbl_lvls_num(struct  mmu_desc *mmu_desc)
> 
> Nit: Double blank after "struct".
will remove it. thanks.

> 
> > +{
> > +    unsigned long satp_mode = RV_STAGE1_MODE;
> > +
> > +    /* Number of page table levels */
> > +    switch (satp_mode)
> > +    {
> > +    case SATP_MODE_SV32:
> > +        mmu_desc->num_levels = 2;
> > +        break;
> > +    case SATP_MODE_SV39:
> > +        mmu_desc->num_levels = 3;
> > +        break;
> > +    case SATP_MODE_SV48:
> > +        mmu_desc->num_levels = 4;
> > +        break;
> > +    default:
> > +        early_printk("(XEN) Unsupported SATP_MODE\n");
> > +        die();
> > +    }
> > +}
> 
> Do you really need this function anymore? Isn't it now simply
> 
>     mmu_desc.num_levels = CONFIG_PAGING_LEVELS?
> 
> in the caller? WHich could then even be the variable's initializer
> there?
You are right after introduction of CONFIG_PAGING_LEVELS for RISC-V we
can remove the code you mentioned.
> 
> > +static bool __init check_pgtbl_mode_support(struct mmu_desc
> > *mmu_desc,
> > +                                            unsigned long
> > load_start,
> > +                                            unsigned long
> > satp_mode)
> > +{
> > +    bool is_mode_supported = false;
> > +    unsigned int index;
> > +    unsigned int page_table_level = (mmu_desc->num_levels - 1);
> > +    unsigned level_map_mask =
> > XEN_PT_LEVEL_MAP_MASK(page_table_level);
> > +
> > +    unsigned long aligned_load_start = load_start &
> > level_map_mask;
> > +    unsigned long aligned_page_size =
> > XEN_PT_LEVEL_SIZE(page_table_level);
> > +    unsigned long xen_size = (unsigned long)(_end - start);
> > +
> > +    if ( (load_start + xen_size) > (aligned_load_start +
> > aligned_page_size) )
> > +    {
> > +        early_printk("please place Xen to be in range of PAGE_SIZE
> > "
> > +                     "where PAGE_SIZE is XEN_PT_LEVEL_SIZE( {L3 |
> > L2 | L1} ) "
> > +                     "depending on expected SATP_MODE \n"
> > +                     "XEN_PT_LEVEL_SIZE is defined in
> > <asm/page.h>\n");
> > +        die();
> > +    }
> > +
> > +    index = pt_index(page_table_level, aligned_load_start);
> > +    stage1_pgtbl_root[index] = paddr_to_pte(aligned_load_start,
> > +                                            PTE_LEAF_DEFAULT |
> > PTE_EXECUTABLE);
> > +
> > +    asm volatile ( "sfence.vma" );
> > +    csr_write(CSR_SATP,
> > +              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
> > +              satp_mode << SATP_MODE_SHIFT);
> > +
> > +    if ( (csr_read(CSR_SATP) >> SATP_MODE_SHIFT) == satp_mode )
> > +        is_mode_supported = true;
> 
> Just as a remark: While plain shift is kind of okay here, because the
> field
> is the top-most one in the register, generally you will want to get
> used to
> MASK_EXTR() (and MASK_INSR()) in cases like this one.
Thanks. I'll look at them. 
> 
> > +    csr_write(CSR_SATP, 0);
> > +
> > +    /* Clean MMU root page table */
> > +    stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0);
> > +
> > +    asm volatile ( "sfence.vma" );
> 
> Doesn't this want to move between the SATP write and the clearing of
> the
> root table slot? Also here and elsewhere - don't these asm()s need
> memory
> clobbers? And anyway - could I talk you into introducing an inline
> wrapper
> (e.g. named sfence_vma()) so all uses end up consistent?
I think clearing of root page table should be done before "sfence.vma"
because we have to first clear slot of MMU's root page table and then
make updating root page table visible for all. ( by usage of sfence
instruction )

We can add memory clobber to be sure but it looks like it is a
duplication of "sfence" instruction.

I agree that it would be nice to introduce a wrapper.

> 
> > +void __init setup_initial_pagetables(void)
> > +{
> > +    struct mmu_desc mmu_desc = { 0, 0, NULL, NULL };
> > +
> > +    /*
> > +     * Access to _stard, _end is always PC-relative
> 
> Nit: Typo-ed symbol name. Also ...
> 
> > +     * thereby when access them we will get load adresses
> > +     * of start and end of Xen
> > +     * To get linker addresses LOAD_TO_LINK() is required
> > +     * to use
> > +     */
> 
> see the earlier line wrapping remark again. Finally in multi-sentence
> comments full stops are required.
Full stops mean '.' at the end of sentences?
> 
> > +    unsigned long load_start    = (unsigned long)_start;
> > +    unsigned long load_end      = (unsigned long)_end;
> > +    unsigned long linker_start  = LOAD_TO_LINK(load_start);
> > +    unsigned long linker_end    = LOAD_TO_LINK(load_end);
> > +
> > +    if ( (linker_start != load_start) &&
> > +         (linker_start <= load_end) && (load_start <= linker_end)
> > )
> > +    {
> > +        early_printk("(XEN) linker and load address ranges
> > overlap\n");
> > +        die();
> > +    }
> > +
> > +    calc_pgtbl_lvls_num(&mmu_desc);
> > +
> > +    if ( !check_pgtbl_mode_support(&mmu_desc, load_start,
> > RV_STAGE1_MODE) )
> 
> It is again questionable here whether passing a constant to a
> function
> at its sole call site is really useful.
It looks like there is no too much sense to pass a constant directly to
a function. So I'll update that.
> 
> > +void __init noinline enable_mmu()
> > +{
> > +    /*
> > +     * Calculate a linker time address of the mmu_is_enabled
> > +     * label and update CSR_STVEC with it.
> > +     * MMU is configured in a way where linker addresses are
> > mapped
> > +     * on load addresses so in a case when linker addresses are
> > not equal
> > +     * to load addresses, after MMU is enabled, it will cause
> > +     * an exception and jump to linker time addresses.
> > +     * Otherwise if load addresses are equal to linker addresses
> > the code
> > +     * after mmu_is_enabled label will be executed without
> > exception.
> > +     */
> > +    csr_write(CSR_STVEC, LOAD_TO_LINK((unsigned
> > long)&&mmu_is_enabled));
> > +
> > +    /* Ensure page table writes precede loading the SATP */
> > +    asm volatile ( "sfence.vma" );
> > +
> > +    /* Enable the MMU and load the new pagetable for Xen */
> > +    csr_write(CSR_SATP,
> > +              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
> > +              RV_STAGE1_MODE << SATP_MODE_SHIFT);
> > +
> > +    asm volatile ( ".align 2" );
> 
> May I suggest to avoid .align, and to prefer .balign and .p2align
> instead?
> This helps people coming from different architectures, as which of
> the two
> .align resolves to is arch-dependent.
Sure. Thanks. I thought that .p2align mostly used for assembly code.
But yeah I'll re-use it instead of ' asm volatile ( ".align 2" );'.

> 
> > --- a/xen/arch/riscv/riscv64/head.S
> > +++ b/xen/arch/riscv/riscv64/head.S
> > @@ -1,4 +1,5 @@
> >  #include <asm/asm.h>
> > +#include <asm/asm-offsets.h>
> >  #include <asm/riscv_encoding.h>
> >  
> >          .section .text.header, "ax", %progbits
> 
> How does a need for this arise all of the sudden, without other
> changes
> to this file? Is this maybe a leftover which wants dropping?
Yeah, it is really weird. I'll check that.

> 
> > --- a/xen/arch/riscv/xen.lds.S
> > +++ b/xen/arch/riscv/xen.lds.S
> > @@ -137,6 +137,7 @@ SECTIONS
> >      __init_end = .;
> >  
> >      .bss : {                     /* BSS */
> > +        . = ALIGN(POINTER_ALIGN);
> >          __bss_start = .;
> >          *(.bss.stack_aligned)
> >          . = ALIGN(PAGE_SIZE);
> 
> Doesn't this want to be a separate change?
Yes, it should be. I'll do that in new verstion of the patch series.

~ Oleksii
Jan Beulich May 9, 2023, 2:38 p.m. UTC | #3
On 09.05.2023 14:59, Oleksii wrote:
> On Mon, 2023-05-08 at 10:58 +0200, Jan Beulich wrote:
>> On 03.05.2023 18:31, Oleksii Kurochko wrote:
>>> --- /dev/null
>>> +++ b/xen/arch/riscv/include/asm/page.h
>>> @@ -0,0 +1,62 @@
>>> +#ifndef _ASM_RISCV_PAGE_H
>>> +#define _ASM_RISCV_PAGE_H
>>> +
>>> +#include <xen/const.h>
>>> +#include <xen/types.h>
>>> =)+
>>> +#define VPN_MASK                    ((unsigned
>>> long)(PAGETABLE_ENTRIES - 1))
>>> +
>>> +#define XEN_PT_LEVEL_ORDER(lvl)     ((lvl) * PAGETABLE_ORDER)
>>> +#define XEN_PT_LEVEL_SHIFT(lvl)     (XEN_PT_LEVEL_ORDER(lvl) +
>>> PAGE_SHIFT)
>>> +#define XEN_PT_LEVEL_SIZE(lvl)      (_AT(paddr_t, 1) <<
>>> XEN_PT_LEVEL_SHIFT(lvl))
>>> +#define XEN_PT_LEVEL_MAP_MASK(lvl)  (~(XEN_PT_LEVEL_SIZE(lvl) -
>>> 1))
>>> +#define XEN_PT_LEVEL_MASK(lvl)      (VPN_MASK <<
>>> XEN_PT_LEVEL_SHIFT(lvl))
>>> +
>>> +#define PTE_VALID                   BIT(0, UL)
>>> +#define PTE_READABLE                BIT(1, UL)
>>> +#define PTE_WRITABLE                BIT(2, UL)
>>> +#define PTE_EXECUTABLE              BIT(3, UL)
>>> +#define PTE_USER                    BIT(4, UL)
>>> +#define PTE_GLOBAL                  BIT(5, UL)
>>> +#define PTE_ACCESSED                BIT(6, UL)
>>> +#define PTE_DIRTY                   BIT(7, UL)
>>> +#define PTE_RSW                     (BIT(8, UL) | BIT(9, UL))
>>> +
>>> +#define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE |
>>> PTE_WRITABLE)
>>> +#define PTE_TABLE                   (PTE_VALID)
>>> +
>>> +/* Calculate the offsets into the pagetables for a given VA */
>>> +#define pt_linear_offset(lvl, va)   ((va) >>
>>> XEN_PT_LEVEL_SHIFT(lvl))
>>> +
>>> +#define pt_index(lvl, va) pt_linear_offset(lvl, (va) &
>>> XEN_PT_LEVEL_MASK(lvl))
>>
>> Maybe better
>>
>> #define pt_index(lvl, va) (pt_linear_offset(lvl, va) & VPN_MASK)
>>
>> as the involved constant will be easier to use for the compiler?
> But VPN_MASK should be shifted by level shift value.

Why? pt_linear_offset() already does the necessary shifting.

>>> +    csr_write(CSR_SATP, 0);
>>> +
>>> +    /* Clean MMU root page table */
>>> +    stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0);
>>> +
>>> +    asm volatile ( "sfence.vma" );
>>
>> Doesn't this want to move between the SATP write and the clearing of
>> the
>> root table slot? Also here and elsewhere - don't these asm()s need
>> memory
>> clobbers? And anyway - could I talk you into introducing an inline
>> wrapper
>> (e.g. named sfence_vma()) so all uses end up consistent?
> I think clearing of root page table should be done before "sfence.vma"
> because we have to first clear slot of MMU's root page table and then
> make updating root page table visible for all. ( by usage of sfence
> instruction )

I disagree. The SATP write has removed the connection of the CPU
to the page tables. That's the action you want to fence, not the
altering of some (then) no longer referenced data structure.

>>> +void __init setup_initial_pagetables(void)
>>> +{
>>> +    struct mmu_desc mmu_desc = { 0, 0, NULL, NULL };
>>> +
>>> +    /*
>>> +     * Access to _stard, _end is always PC-relative
>>
>> Nit: Typo-ed symbol name. Also ...
>>
>>> +     * thereby when access them we will get load adresses
>>> +     * of start and end of Xen
>>> +     * To get linker addresses LOAD_TO_LINK() is required
>>> +     * to use
>>> +     */
>>
>> see the earlier line wrapping remark again. Finally in multi-sentence
>> comments full stops are required.
> Full stops mean '.' at the end of sentences?

Yes. Please see ./CODING_STYLE.

Jan
Oleksii Kurochko May 11, 2023, 8:14 a.m. UTC | #4
On Tue, 2023-05-09 at 16:38 +0200, Jan Beulich wrote:
> On 09.05.2023 14:59, Oleksii wrote:
> > On Mon, 2023-05-08 at 10:58 +0200, Jan Beulich wrote:
> > > On 03.05.2023 18:31, Oleksii Kurochko wrote:
> > > > --- /dev/null
> > > > +++ b/xen/arch/riscv/include/asm/page.h
> > > > @@ -0,0 +1,62 @@
> > > > +#ifndef _ASM_RISCV_PAGE_H
> > > > +#define _ASM_RISCV_PAGE_H
> > > > +
> > > > +#include <xen/const.h>
> > > > +#include <xen/types.h>
> > > > =)+
> > > > +#define VPN_MASK                    ((unsigned
> > > > long)(PAGETABLE_ENTRIES - 1))
> > > > +
> > > > +#define XEN_PT_LEVEL_ORDER(lvl)     ((lvl) * PAGETABLE_ORDER)
> > > > +#define XEN_PT_LEVEL_SHIFT(lvl)     (XEN_PT_LEVEL_ORDER(lvl) +
> > > > PAGE_SHIFT)
> > > > +#define XEN_PT_LEVEL_SIZE(lvl)      (_AT(paddr_t, 1) <<
> > > > XEN_PT_LEVEL_SHIFT(lvl))
> > > > +#define XEN_PT_LEVEL_MAP_MASK(lvl)  (~(XEN_PT_LEVEL_SIZE(lvl)
> > > > -
> > > > 1))
> > > > +#define XEN_PT_LEVEL_MASK(lvl)      (VPN_MASK <<
> > > > XEN_PT_LEVEL_SHIFT(lvl))
> > > > +
> > > > +#define PTE_VALID                   BIT(0, UL)
> > > > +#define PTE_READABLE                BIT(1, UL)
> > > > +#define PTE_WRITABLE                BIT(2, UL)
> > > > +#define PTE_EXECUTABLE              BIT(3, UL)
> > > > +#define PTE_USER                    BIT(4, UL)
> > > > +#define PTE_GLOBAL                  BIT(5, UL)
> > > > +#define PTE_ACCESSED                BIT(6, UL)
> > > > +#define PTE_DIRTY                   BIT(7, UL)
> > > > +#define PTE_RSW                     (BIT(8, UL) | BIT(9, UL))
> > > > +
> > > > +#define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE
> > > > |
> > > > PTE_WRITABLE)
> > > > +#define PTE_TABLE                   (PTE_VALID)
> > > > +
> > > > +/* Calculate the offsets into the pagetables for a given VA */
> > > > +#define pt_linear_offset(lvl, va)   ((va) >>
> > > > XEN_PT_LEVEL_SHIFT(lvl))
> > > > +
> > > > +#define pt_index(lvl, va) pt_linear_offset(lvl, (va) &
> > > > XEN_PT_LEVEL_MASK(lvl))
> > > 
> > > Maybe better
> > > 
> > > #define pt_index(lvl, va) (pt_linear_offset(lvl, va) & VPN_MASK)
> > > 
> > > as the involved constant will be easier to use for the compiler?
> > But VPN_MASK should be shifted by level shift value.
> 
> Why? pt_linear_offset() already does the necessary shifting.
I missed a way how you offered to define pt_index(). I thought you
offered to define it as "pt_linear_offset(lvl, va & VPN_MASK)" instead
of
"(pt_linear_offset(lvl, va) & VPN_MASK)".

So you are right we can re-write as you offered.

> 
> > > > +    csr_write(CSR_SATP, 0);
> > > > +
> > > > +    /* Clean MMU root page table */
> > > > +    stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0);
> > > > +
> > > > +    asm volatile ( "sfence.vma" );
> > > 
> > > Doesn't this want to move between the SATP write and the clearing
> > > of
> > > the
> > > root table slot? Also here and elsewhere - don't these asm()s
> > > need
> > > memory
> > > clobbers? And anyway - could I talk you into introducing an
> > > inline
> > > wrapper
> > > (e.g. named sfence_vma()) so all uses end up consistent?
> > I think clearing of root page table should be done before
> > "sfence.vma"
> > because we have to first clear slot of MMU's root page table and
> > then
> > make updating root page table visible for all. ( by usage of sfence
> > instruction )
> 
> I disagree. The SATP write has removed the connection of the CPU
> to the page tables. That's the action you want to fence, not the
> altering of some (then) no longer referenced data structure.
From that point of view you are right. Thanks for clarification.
> 
> > > > +void __init setup_initial_pagetables(void)
> > > > +{
> > > > +    struct mmu_desc mmu_desc = { 0, 0, NULL, NULL };
> > > > +
> > > > +    /*
> > > > +     * Access to _stard, _end is always PC-relative
> > > 
> > > Nit: Typo-ed symbol name. Also ...
> > > 
> > > > +     * thereby when access them we will get load adresses
> > > > +     * of start and end of Xen
> > > > +     * To get linker addresses LOAD_TO_LINK() is required
> > > > +     * to use
> > > > +     */
> > > 
> > > see the earlier line wrapping remark again. Finally in multi-
> > > sentence
> > > comments full stops are required.
> > Full stops mean '.' at the end of sentences?
> 
> Yes. Please see ./CODING_STYLE.
Thanks. I'll read it again.

~ Oleksii
diff mbox series

Patch

diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index 443f6bf15f..956ceb02df 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -1,5 +1,6 @@ 
 obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
 obj-y += entry.o
+obj-y += mm.o
 obj-$(CONFIG_RISCV_64) += riscv64/
 obj-y += sbi.o
 obj-y += setup.o
diff --git a/xen/arch/riscv/include/asm/config.h b/xen/arch/riscv/include/asm/config.h
index 73b86ce789..3bd206766e 100644
--- a/xen/arch/riscv/include/asm/config.h
+++ b/xen/arch/riscv/include/asm/config.h
@@ -70,12 +70,23 @@ 
   name:
 #endif
 
-#define XEN_VIRT_START  _AT(UL, 0x80200000)
+#ifdef CONFIG_RISCV_64
+#define XEN_VIRT_START 0xFFFFFFFFC0000000 /* (_AC(-1, UL) + 1 - GB(1)) */
+#else
+#error "RV32 isn't supported"
+#endif
 
 #define SMP_CACHE_BYTES (1 << 6)
 
 #define STACK_SIZE PAGE_SIZE
 
+#ifdef CONFIG_RISCV_64
+#define CONFIG_PAGING_LEVELS 3
+#define RV_STAGE1_MODE SATP_MODE_SV39
+#else
+#define RV_STAGE1_MODE SATP_MODE_SV32
+#endif
+
 #endif /* __RISCV_CONFIG_H__ */
 /*
  * Local variables:
diff --git a/xen/arch/riscv/include/asm/current.h b/xen/arch/riscv/include/asm/current.h
new file mode 100644
index 0000000000..1cb5946fe9
--- /dev/null
+++ b/xen/arch/riscv/include/asm/current.h
@@ -0,0 +1,10 @@ 
+#ifndef __ASM_CURRENT_H
+#define __ASM_CURRENT_H
+
+#define switch_stack_and_jump(stack, fn)    \
+    asm volatile (                          \
+            "mv sp, %0 \n"                  \
+            "j " #fn :: "r" (stack) :       \
+    )
+
+#endif /* __ASM_CURRENT_H */
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
new file mode 100644
index 0000000000..e16ce66fae
--- /dev/null
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -0,0 +1,9 @@ 
+#ifndef _ASM_RISCV_MM_H
+#define _ASM_RISCV_MM_H
+
+void setup_initial_pagetables(void);
+
+void enable_mmu(void);
+void cont_after_mmu_is_enabled(void);
+
+#endif /* _ASM_RISCV_MM_H */
diff --git a/xen/arch/riscv/include/asm/page-bits.h b/xen/arch/riscv/include/asm/page-bits.h
index 1801820294..4a3e33589a 100644
--- a/xen/arch/riscv/include/asm/page-bits.h
+++ b/xen/arch/riscv/include/asm/page-bits.h
@@ -4,4 +4,14 @@ 
 #define PAGE_SHIFT              12 /* 4 KiB Pages */
 #define PADDR_BITS              56 /* 44-bit PPN */
 
+#ifdef CONFIG_RISCV_64
+#define PAGETABLE_ORDER         (9)
+#else /* CONFIG_RISCV_32 */
+#define PAGETABLE_ORDER         (10)
+#endif
+
+#define PAGETABLE_ENTRIES       (1 << PAGETABLE_ORDER)
+
+#define PTE_PPN_SHIFT           10
+
 #endif /* __RISCV_PAGE_BITS_H__ */
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
new file mode 100644
index 0000000000..b4fc67484b
--- /dev/null
+++ b/xen/arch/riscv/include/asm/page.h
@@ -0,0 +1,62 @@ 
+#ifndef _ASM_RISCV_PAGE_H
+#define _ASM_RISCV_PAGE_H
+
+#include <xen/const.h>
+#include <xen/types.h>
+
+#define VPN_MASK                    ((unsigned long)(PAGETABLE_ENTRIES - 1))
+
+#define XEN_PT_LEVEL_ORDER(lvl)     ((lvl) * PAGETABLE_ORDER)
+#define XEN_PT_LEVEL_SHIFT(lvl)     (XEN_PT_LEVEL_ORDER(lvl) + PAGE_SHIFT)
+#define XEN_PT_LEVEL_SIZE(lvl)      (_AT(paddr_t, 1) << XEN_PT_LEVEL_SHIFT(lvl))
+#define XEN_PT_LEVEL_MAP_MASK(lvl)  (~(XEN_PT_LEVEL_SIZE(lvl) - 1))
+#define XEN_PT_LEVEL_MASK(lvl)      (VPN_MASK << XEN_PT_LEVEL_SHIFT(lvl))
+
+#define PTE_VALID                   BIT(0, UL)
+#define PTE_READABLE                BIT(1, UL)
+#define PTE_WRITABLE                BIT(2, UL)
+#define PTE_EXECUTABLE              BIT(3, UL)
+#define PTE_USER                    BIT(4, UL)
+#define PTE_GLOBAL                  BIT(5, UL)
+#define PTE_ACCESSED                BIT(6, UL)
+#define PTE_DIRTY                   BIT(7, UL)
+#define PTE_RSW                     (BIT(8, UL) | BIT(9, UL))
+
+#define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
+#define PTE_TABLE                   (PTE_VALID)
+
+/* Calculate the offsets into the pagetables for a given VA */
+#define pt_linear_offset(lvl, va)   ((va) >> XEN_PT_LEVEL_SHIFT(lvl))
+
+#define pt_index(lvl, va) pt_linear_offset(lvl, (va) & XEN_PT_LEVEL_MASK(lvl))
+
+/* Page Table entry */
+typedef struct {
+#ifdef CONFIG_RISCV_64
+    uint64_t pte;
+#else
+    uint32_t pte;
+#endif
+} pte_t;
+
+#define pte_to_addr(x) (((paddr_t)(x) >> PTE_PPN_SHIFT) << PAGE_SHIFT)
+
+#define addr_to_pte(x) (((x) >> PAGE_SHIFT) << PTE_PPN_SHIFT)
+
+static inline pte_t paddr_to_pte(paddr_t paddr,
+                                 unsigned int permissions)
+{
+    return (pte_t) { .pte = addr_to_pte(paddr) | permissions };
+}
+
+static inline paddr_t pte_to_paddr(pte_t pte)
+{
+    return pte_to_addr(pte.pte);
+}
+
+static inline bool pte_is_valid(pte_t p)
+{
+    return p.pte & PTE_VALID;
+}
+
+#endif /* _ASM_RISCV_PAGE_H */
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
new file mode 100644
index 0000000000..b13f15f75f
--- /dev/null
+++ b/xen/arch/riscv/mm.c
@@ -0,0 +1,315 @@ 
+#include <xen/compiler.h>
+#include <xen/init.h>
+#include <xen/kernel.h>
+#include <xen/pfn.h>
+
+#include <asm/early_printk.h>
+#include <asm/config.h>
+#include <asm/csr.h>
+#include <asm/current.h>
+#include <asm/mm.h>
+#include <asm/page.h>
+#include <asm/processor.h>
+
+struct mmu_desc {
+    unsigned int num_levels;
+    unsigned int pgtbl_count;
+    pte_t *next_pgtbl;
+    pte_t *pgtbl_base;
+};
+
+extern unsigned char cpu0_boot_stack[STACK_SIZE];
+
+#define PHYS_OFFSET ((unsigned long)_start - XEN_VIRT_START)
+#define LOAD_TO_LINK(addr) ((addr) - PHYS_OFFSET)
+#define LINK_TO_LOAD(addr) ((addr) + PHYS_OFFSET)
+
+
+/*
+ * It is expected that Xen won't be more then 2 MB.
+ * The check in xen.lds.S guarantees that.
+ * At least 3 page (in case of Sv39 )
+ * tables are needed to cover 2 MB. One for each page level
+ * table with PAGE_SIZE = 4 Kb
+ *
+ * One L0 page table can cover 2 MB
+ * (512 entries of one page table * PAGE_SIZE).
+ *
+ */
+#define PGTBL_INITIAL_COUNT ((CONFIG_PAGING_LEVELS - 1) + 1)
+
+#define PGTBL_ENTRY_AMOUNT  (PAGE_SIZE / sizeof(pte_t))
+
+pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
+stage1_pgtbl_root[PGTBL_ENTRY_AMOUNT];
+
+pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
+stage1_pgtbl_nonroot[PGTBL_INITIAL_COUNT * PGTBL_ENTRY_AMOUNT];
+
+#define HANDLE_PGTBL(curr_lvl_num)                                          \
+    index = pt_index(curr_lvl_num, page_addr);                              \
+    if ( pte_is_valid(pgtbl[index]) )                                       \
+    {                                                                       \
+        /* Find L{ 0-3 } table */                                           \
+        pgtbl = (pte_t *)pte_to_paddr(pgtbl[index]);                        \
+    }                                                                       \
+    else                                                                    \
+    {                                                                       \
+        /* Allocate new L{0-3} page table */                                \
+        if ( mmu_desc->pgtbl_count == PGTBL_INITIAL_COUNT )                 \
+        {                                                                   \
+            early_printk("(XEN) No initial table available\n");             \
+            /* panic(), BUG() or ASSERT() aren't ready now. */              \
+            die();                                                          \
+        }                                                                   \
+        mmu_desc->pgtbl_count++;                                            \
+        pgtbl[index] = paddr_to_pte((unsigned long)mmu_desc->next_pgtbl,    \
+                                    PTE_VALID);                             \
+        pgtbl = mmu_desc->next_pgtbl;                                       \
+        mmu_desc->next_pgtbl += PGTBL_ENTRY_AMOUNT;                         \
+    }
+
+static void __init setup_initial_mapping(struct mmu_desc *mmu_desc,
+                                         unsigned long map_start,
+                                         unsigned long map_end,
+                                         unsigned long pa_start,
+                                         unsigned int permissions)
+{
+    unsigned int index;
+    pte_t *pgtbl;
+    unsigned long page_addr;
+    pte_t pte_to_be_written;
+    unsigned long paddr;
+    unsigned int tmp_permissions;
+
+    if ( (unsigned long)_start % XEN_PT_LEVEL_SIZE(0) )
+    {
+        early_printk("(XEN) Xen should be loaded at 4k boundary\n");
+        die();
+    }
+
+    if ( map_start & ~XEN_PT_LEVEL_MAP_MASK(0) ||
+         pa_start & ~XEN_PT_LEVEL_MAP_MASK(0) )
+    {
+        early_printk("(XEN) map and pa start addresses should be aligned\n");
+        /* panic(), BUG() or ASSERT() aren't ready now. */
+        die();
+    }
+
+    for ( page_addr = map_start; page_addr < map_end; page_addr += XEN_PT_LEVEL_SIZE(0) )
+    {
+        pgtbl = mmu_desc->pgtbl_base;
+
+        switch ( mmu_desc->num_levels )
+        {
+        case 4: /* Level 3 */
+            HANDLE_PGTBL(3);
+        case 3: /* Level 2 */
+            HANDLE_PGTBL(2);
+        case 2: /* Level 1 */
+            HANDLE_PGTBL(1);
+        case 1: /* Level 0 */
+            index = pt_index(0, page_addr);
+            paddr = (page_addr - map_start) + pa_start;
+
+            tmp_permissions = permissions;
+
+            if ( is_kernel_text(LINK_TO_LOAD(page_addr)) ||
+                    is_kernel_inittext(LINK_TO_LOAD(page_addr)) )
+                tmp_permissions =
+                    PTE_EXECUTABLE | PTE_READABLE | PTE_VALID;
+
+            if ( is_kernel_rodata(LINK_TO_LOAD(page_addr)) )
+                tmp_permissions = PTE_READABLE | PTE_VALID;
+
+            pte_to_be_written = paddr_to_pte(paddr, tmp_permissions);
+
+            if ( !pte_is_valid(pgtbl[index]) )
+                pgtbl[index] = pte_to_be_written;
+            else
+            {
+                /*
+                * get an adresses of current pte and that one to
+                * be written
+                */
+                unsigned long curr_pte =
+                    pgtbl[index].pte & ~(PTE_DIRTY | PTE_ACCESSED);
+
+                pte_to_be_written.pte &= ~(PTE_DIRTY | PTE_ACCESSED);
+
+                if ( curr_pte != pte_to_be_written.pte )
+                {
+                    early_printk("PTE overridden has occurred\n");
+                    /* panic(), <asm/bug.h> aren't ready now. */
+                    die();
+                }
+            }
+        }
+    }
+    #undef HANDLE_PGTBL
+}
+
+static void __init calc_pgtbl_lvls_num(struct  mmu_desc *mmu_desc)
+{
+    unsigned long satp_mode = RV_STAGE1_MODE;
+
+    /* Number of page table levels */
+    switch (satp_mode)
+    {
+    case SATP_MODE_SV32:
+        mmu_desc->num_levels = 2;
+        break;
+    case SATP_MODE_SV39:
+        mmu_desc->num_levels = 3;
+        break;
+    case SATP_MODE_SV48:
+        mmu_desc->num_levels = 4;
+        break;
+    default:
+        early_printk("(XEN) Unsupported SATP_MODE\n");
+        die();
+    }
+}
+
+static bool __init check_pgtbl_mode_support(struct mmu_desc *mmu_desc,
+                                            unsigned long load_start,
+                                            unsigned long satp_mode)
+{
+    bool is_mode_supported = false;
+    unsigned int index;
+    unsigned int page_table_level = (mmu_desc->num_levels - 1);
+    unsigned level_map_mask = XEN_PT_LEVEL_MAP_MASK(page_table_level);
+
+    unsigned long aligned_load_start = load_start & level_map_mask;
+    unsigned long aligned_page_size = XEN_PT_LEVEL_SIZE(page_table_level);
+    unsigned long xen_size = (unsigned long)(_end - start);
+
+    if ( (load_start + xen_size) > (aligned_load_start + aligned_page_size) )
+    {
+        early_printk("please place Xen to be in range of PAGE_SIZE "
+                     "where PAGE_SIZE is XEN_PT_LEVEL_SIZE( {L3 | L2 | L1} ) "
+                     "depending on expected SATP_MODE \n"
+                     "XEN_PT_LEVEL_SIZE is defined in <asm/page.h>\n");
+        die();
+    }
+
+    index = pt_index(page_table_level, aligned_load_start);
+    stage1_pgtbl_root[index] = paddr_to_pte(aligned_load_start,
+                                            PTE_LEAF_DEFAULT | PTE_EXECUTABLE);
+
+    asm volatile ( "sfence.vma" );
+    csr_write(CSR_SATP,
+              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
+              satp_mode << SATP_MODE_SHIFT);
+
+    if ( (csr_read(CSR_SATP) >> SATP_MODE_SHIFT) == satp_mode )
+        is_mode_supported = true;
+
+    csr_write(CSR_SATP, 0);
+
+    /* Clean MMU root page table */
+    stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0);
+
+    asm volatile ( "sfence.vma" );
+
+    return is_mode_supported;
+}
+
+/*
+ * setup_initial_pagetables:
+ *
+ * Build the page tables for Xen that map the following:
+ *  1. Calculate page table's level numbers.
+ *  2. Init mmu description structure.
+ *  3. Check that linker addresses range doesn't overlap
+ *     with load addresses range
+ *  4. Map all linker addresses and load addresses ( it shouldn't
+ *     be 1:1 mapped and will be 1:1 mapped only in case if
+ *     linker address is equal to load address ) with
+ *     RW permissions by default.
+ *  5. Setup proper PTE permissions for each section.
+ */
+void __init setup_initial_pagetables(void)
+{
+    struct mmu_desc mmu_desc = { 0, 0, NULL, NULL };
+
+    /*
+     * Access to _stard, _end is always PC-relative
+     * thereby when access them we will get load adresses
+     * of start and end of Xen
+     * To get linker addresses LOAD_TO_LINK() is required
+     * to use
+     */
+    unsigned long load_start    = (unsigned long)_start;
+    unsigned long load_end      = (unsigned long)_end;
+    unsigned long linker_start  = LOAD_TO_LINK(load_start);
+    unsigned long linker_end    = LOAD_TO_LINK(load_end);
+
+    if ( (linker_start != load_start) &&
+         (linker_start <= load_end) && (load_start <= linker_end) )
+    {
+        early_printk("(XEN) linker and load address ranges overlap\n");
+        die();
+    }
+
+    calc_pgtbl_lvls_num(&mmu_desc);
+
+    if ( !check_pgtbl_mode_support(&mmu_desc, load_start, RV_STAGE1_MODE) )
+    {
+        early_printk("requested MMU mode isn't supported by CPU\n"
+                     "Please choose different in <asm/config.h>\n");
+        die();
+    }
+
+    mmu_desc.pgtbl_base = stage1_pgtbl_root;
+    mmu_desc.next_pgtbl = stage1_pgtbl_nonroot;
+
+    setup_initial_mapping(&mmu_desc,
+                          linker_start,
+                          linker_end,
+                          load_start,
+                          PTE_LEAF_DEFAULT);
+}
+
+void __init noinline enable_mmu()
+{
+    /*
+     * Calculate a linker time address of the mmu_is_enabled
+     * label and update CSR_STVEC with it.
+     * MMU is configured in a way where linker addresses are mapped
+     * on load addresses so in a case when linker addresses are not equal
+     * to load addresses, after MMU is enabled, it will cause
+     * an exception and jump to linker time addresses.
+     * Otherwise if load addresses are equal to linker addresses the code
+     * after mmu_is_enabled label will be executed without exception.
+     */
+    csr_write(CSR_STVEC, LOAD_TO_LINK((unsigned long)&&mmu_is_enabled));
+
+    /* Ensure page table writes precede loading the SATP */
+    asm volatile ( "sfence.vma" );
+
+    /* Enable the MMU and load the new pagetable for Xen */
+    csr_write(CSR_SATP,
+              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
+              RV_STAGE1_MODE << SATP_MODE_SHIFT);
+
+    asm volatile ( ".align 2" );
+ mmu_is_enabled:
+    /*
+     * Stack should be re-inited as:
+     * 1. Right now an address of the stack is relative to load time
+     *    addresses what will cause an issue in case of load start address
+     *    isn't equal to linker start address.
+     * 2. Addresses in stack are all load time relative which can be an
+     *    issue in case when load start address isn't equal to linker
+     *    start address.
+     *
+     * We can't return to the caller because the stack was reseted
+     * and it may have stash some variable on the stack.
+     * Jump to a brand new function as the stack was reseted
+     */
+
+    switch_stack_and_jump((unsigned long)cpu0_boot_stack + STACK_SIZE,
+                          cont_after_mmu_is_enabled);
+}
+
diff --git a/xen/arch/riscv/riscv64/head.S b/xen/arch/riscv/riscv64/head.S
index 8887f0cbd4..983757e498 100644
--- a/xen/arch/riscv/riscv64/head.S
+++ b/xen/arch/riscv/riscv64/head.S
@@ -1,4 +1,5 @@ 
 #include <asm/asm.h>
+#include <asm/asm-offsets.h>
 #include <asm/riscv_encoding.h>
 
         .section .text.header, "ax", %progbits
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index 3786f337e0..315804aa87 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -2,6 +2,7 @@ 
 #include <xen/init.h>
 
 #include <asm/early_printk.h>
+#include <asm/mm.h>
 
 /* Xen stack for bringing up the first CPU. */
 unsigned char __initdata cpu0_boot_stack[STACK_SIZE]
@@ -26,3 +27,13 @@  void __init noreturn start_xen(unsigned long bootcpu_id,
 
     unreachable();
 }
+
+void __init noreturn cont_after_mmu_is_enabled(void)
+{
+    early_printk("All set up\n");
+
+    for ( ;; )
+        asm volatile ("wfi");
+
+    unreachable();
+}
diff --git a/xen/arch/riscv/xen.lds.S b/xen/arch/riscv/xen.lds.S
index 31e0d3576c..f9d89b69b9 100644
--- a/xen/arch/riscv/xen.lds.S
+++ b/xen/arch/riscv/xen.lds.S
@@ -137,6 +137,7 @@  SECTIONS
     __init_end = .;
 
     .bss : {                     /* BSS */
+        . = ALIGN(POINTER_ALIGN);
         __bss_start = .;
         *(.bss.stack_aligned)
         . = ALIGN(PAGE_SIZE);
@@ -172,3 +173,6 @@  ASSERT(IS_ALIGNED(__bss_end,        POINTER_ALIGN), "__bss_end is misaligned")
 
 ASSERT(!SIZEOF(.got),      ".got non-empty")
 ASSERT(!SIZEOF(.got.plt),  ".got.plt non-empty")
+
+ASSERT(_end - _start <= MB(2), "Xen too large for early-boot assumptions")
+