diff mbox series

[v12,5/8] xen/riscv: add minimal stuff to mm.h to build full Xen

Message ID d00b86f41ef2c7d928a28dadd8c34fb845f23d0a.1717008161.git.oleksii.kurochko@gmail.com (mailing list archive)
State New, archived
Headers show
Series *Enable build of full Xen for RISC-V | expand

Commit Message

Oleksii Kurochko May 29, 2024, 7:55 p.m. UTC
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V8-V12:
 - Nothing changed only rebase.
---
Changes in V7:
 - update argument type of maddr_to_virt() function: unsigned long -> paddr_t
 - rename argument of PFN_ORDER(): pfn -> pg.
 - add Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V6:
 - drop __virt_to_maddr() ( transform to macro ) and __maddr_to_virt ( rename to maddr_to_virt ).
 - parenthesize va in definition of vmap_to_mfn().
 - Code style fixes.
---
Changes in V5:
 - update the comment around "struct domain *domain;" : zero -> NULL
 - fix ident. for unsigned long val;
 - put page_to_virt() and virt_to_page() close to each other.
 - drop unnessary leading underscore
 - drop a space before the comment: /* Count of uses of this frame as its current type. */
 - drop comment about a page 'not as a shadow'. it is not necessary for RISC-V
---
Changes in V4:
 - update an argument name of PFN_ORDERN macros.
 - drop pad at the end of 'struct page_info'.
 - Change message -> subject in "Changes in V3"
 - delete duplicated macros from riscv/mm.h
 - fix identation in struct page_info
 - align comment for PGC_ macros
 - update definitions of domain_set_alloc_bitsize() and domain_clamp_alloc_bitsize()
 - drop unnessary comments.
 - s/BUG/BUG_ON("...")
 - define __virt_to_maddr, __maddr_to_virt as stubs
 - add inclusion of xen/mm-frame.h for mfn_x and others
 - include "xen/mm.h" instead of "asm/mm.h" to fix compilation issues:
	 In file included from arch/riscv/setup.c:7:
	./arch/riscv/include/asm/mm.h:60:28: error: field 'list' has incomplete type
	   60 |     struct page_list_entry list;
	      |                            ^~~~
	./arch/riscv/include/asm/mm.h:81:43: error: 'MAX_ORDER' undeclared here (not in a function)
	   81 |                 unsigned long first_dirty:MAX_ORDER + 1;
	      |                                           ^~~~~~~~~
	./arch/riscv/include/asm/mm.h:81:31: error: bit-field 'first_dirty' width not an integer constant
	   81 |                 unsigned long first_dirty:MAX_ORDER + 1;
 - Define __virt_to_mfn() and __mfn_to_virt() using maddr_to_mfn() and mfn_to_maddr().
---
Changes in V3:
 - update the commit title
 - introduce DIRECTMAP_VIRT_START.
 - drop changes related pfn_to_paddr() and paddr_to_pfn as they were remvoe in
   [PATCH v2 32/39] xen/riscv: add minimal stuff to asm/page.h to build full Xen
 - code style fixes.
 - drop get_page_nr  and put_page_nr as they don't need for time being
 - drop CONFIG_STATIC_MEMORY related things
 - code style fixes
---
Changes in V2:
 - define stub for arch_get_dma_bitsize(void)
---
 xen/arch/riscv/include/asm/mm.h | 240 ++++++++++++++++++++++++++++++++
 xen/arch/riscv/mm.c             |   2 +-
 xen/arch/riscv/setup.c          |   2 +-
 3 files changed, 242 insertions(+), 2 deletions(-)

Comments

Andrew Cooper May 30, 2024, 5:23 p.m. UTC | #1
On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> Acked-by: Jan Beulich <jbeulich@suse.com>

This patch looks like it can go in independently?  Or does it depend on
having bitops.h working in practice?

However, one very strong suggestion...


> diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
> index 07c7a0abba..cc4a07a71c 100644
> --- a/xen/arch/riscv/include/asm/mm.h
> +++ b/xen/arch/riscv/include/asm/mm.h
> @@ -3,11 +3,246 @@
> <snip>
> +/* PDX of the first page in the frame table. */
> +extern unsigned long frametable_base_pdx;
> +
> +/* Convert between machine frame numbers and page-info structures. */
> +#define mfn_to_page(mfn)                                            \
> +    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
> +#define page_to_mfn(pg)                                             \
> +    pdx_to_mfn((unsigned long)((pg) - frame_table) + frametable_base_pdx)

Do yourself a favour and not introduce frametable_base_pdx to begin with.

Every RISC-V board I can find has things starting from 0 in physical
address space, with RAM starting immediately after.

Taking the microchip board as an example, RAM actually starts at
0x8000000, which means that having frametable_base_pdx and assuming it
does get set to 0x8000 (which isn't even a certainty, given that I think
you'll need struct pages covering the PLICs), then what you are trading
off is:

* Saving 32k of virtual address space only (no need to even allocate
memory for this range of the framtable), by
* Having an extra memory load and add/sub in every page <-> mfn
conversion, which is a screaming hotpath all over Xen.

It's a terribly short-sighted tradeoff.

32k of VA space might be worth saving in a 32bit build (I personally
wouldn't - especially as there's no need to share Xen's VA space with
guests, given no PV guests on ARM/RISC-V), but it's absolutely not at
all in an a 64bit build with TB of VA space available.

Even if we do find a board with the first interesting thing in the
frametable starting sufficiently away from 0 that it might be worth
considering this slide, then it should still be Kconfig-able in a
similar way to PDX_COMPRESSION.

You don't want to be penalising everyone because of a theoretical/weird
corner case.

~Andrew
Oleksii Kurochko May 30, 2024, 6:22 p.m. UTC | #2
On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote:
> On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
> > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> > Acked-by: Jan Beulich <jbeulich@suse.com>
> 
> This patch looks like it can go in independently?  Or does it depend
> on
> having bitops.h working in practice?
> 
> However, one very strong suggestion...
> 
> 
> > diff --git a/xen/arch/riscv/include/asm/mm.h
> > b/xen/arch/riscv/include/asm/mm.h
> > index 07c7a0abba..cc4a07a71c 100644
> > --- a/xen/arch/riscv/include/asm/mm.h
> > +++ b/xen/arch/riscv/include/asm/mm.h
> > @@ -3,11 +3,246 @@
> > <snip>
> > +/* PDX of the first page in the frame table. */
> > +extern unsigned long frametable_base_pdx;
> > +
> > +/* Convert between machine frame numbers and page-info structures.
> > */
> > +#define
> > mfn_to_page(mfn)                                            \
> > +    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
> > +#define
> > page_to_mfn(pg)                                             \
> > +    pdx_to_mfn((unsigned long)((pg) - frame_table) +
> > frametable_base_pdx)
> 
> Do yourself a favour and not introduce frametable_base_pdx to begin
> with.
> 
> Every RISC-V board I can find has things starting from 0 in physical
> address space, with RAM starting immediately after.
I checked Linux kernel and grep there:
   [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/ --
   exclude "*.tmp" -I
   arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive-2.dtsi:33:    
   memory@40000000 {
   arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28:     memory@80000000
   {
   arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49:      ddrc_cache:
   memory@1000000000 {
   arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33:   ddrc_cache_lo:
   memory@80000000 {
   arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37:   ddrc_cache_hi:
   memory@1040000000 {
   arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34:      ddrc_cache_lo:
   memory@80000000 {
   arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40:      ddrc_cache_hi:
   memory@1000000000 {
   arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22:   ddrc_cache_lo:
   memory@80000000 {
   arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27:   ddrc_cache_hi:
   memory@1000000000 {
   arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57:   ddrc_cache_lo:
   memory@80000000 {
   arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63:   ddrc_cache_hi:
   memory@1040000000 {
   arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32:  memory@0 {
   arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14:     
   memory@0 {
   arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26:    memory@80000000
   {
   arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12:     memory@80000000 {
   arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26: memory@80000000
   {
   arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25: memory@80000000
   {
   arch/riscv/boot/dts/canaan/k210.dtsi:82:        sram: memory@80000000 {
   
And based on  that majority of supported by Linux kernel boards has RAM
starting not from 0 in physical address space. Am I confusing
something?

> 
> Taking the microchip board as an example, RAM actually starts at
> 0x8000000,
Today we had conversation with the guy from SiFive in xen-devel channel
and he mentioned that they are using "starfive visionfive2 and sifive
unleashed platforms." which based on the grep above has RAM not at 0
address.

Also, QEMU uses 0x8000000.

>  which means that having frametable_base_pdx and assuming it
> does get set to 0x8000 (which isn't even a certainty, given that I
> think
> you'll need struct pages covering the PLICs), then what you are
> trading
> off is:
> 
> * Saving 32k of virtual address space only (no need to even allocate
> memory for this range of the framtable), by
> * Having an extra memory load and add/sub in every page <-> mfn
> conversion, which is a screaming hotpath all over Xen.
> 
> It's a terribly short-sighted tradeoff.
> 
> 32k of VA space might be worth saving in a 32bit build (I personally
> wouldn't - especially as there's no need to share Xen's VA space with
> guests, given no PV guests on ARM/RISC-V), but it's absolutely not at
> all in an a 64bit build with TB of VA space available.
> 
> Even if we do find a board with the first interesting thing in the
> frametable starting sufficiently away from 0 that it might be worth
> considering this slide, then it should still be Kconfig-able in a
> similar way to PDX_COMPRESSION.
I found your tradeoffs reasonable, but I don't understand how it will
work if RAM does not start from 0, as the frametable address and RAM
address are linked.
I tried to look at the PDX_COMPRESSION config and couldn't find any
"slide" there. Could you please clarify this for me?
If to use this "slide" would it help to avoid the mentioned above
tradeoffs?

One more question: if we decide to go without frametable_base_pdx,
would it be sufficient to simply remove mentions of it from the code (
at least, for now )?

~ Oleksii
> 
> You don't want to be penalising everyone because of a
> theoretical/weird
> corner case.


> 
> ~Andrew
Jan Beulich May 31, 2024, 6:26 a.m. UTC | #3
On 30.05.2024 19:23, Andrew Cooper wrote:
> On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
>> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
>> Acked-by: Jan Beulich <jbeulich@suse.com>
> 
> This patch looks like it can go in independently?  Or does it depend on
> having bitops.h working in practice?

The latter, iirc. In fact I had already tried at least twice to pull this ahead.

Jan
Oleksii Kurochko June 11, 2024, 3:25 p.m. UTC | #4
On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote:
> On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
> > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> > Acked-by: Jan Beulich <jbeulich@suse.com>
> 
> This patch looks like it can go in independently?  Or does it depend
> on
> having bitops.h working in practice?
> 
> However, one very strong suggestion...
> 
> 
> > diff --git a/xen/arch/riscv/include/asm/mm.h
> > b/xen/arch/riscv/include/asm/mm.h
> > index 07c7a0abba..cc4a07a71c 100644
> > --- a/xen/arch/riscv/include/asm/mm.h
> > +++ b/xen/arch/riscv/include/asm/mm.h
> > @@ -3,11 +3,246 @@
> > <snip>
> > +/* PDX of the first page in the frame table. */
> > +extern unsigned long frametable_base_pdx;
> > +
> > +/* Convert between machine frame numbers and page-info structures.
> > */
> > +#define
> > mfn_to_page(mfn)                                            \
> > +    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
> > +#define
> > page_to_mfn(pg)                                             \
> > +    pdx_to_mfn((unsigned long)((pg) - frame_table) +
> > frametable_base_pdx)
> 
> Do yourself a favour and not introduce frametable_base_pdx to begin
> with.

To drop frametable_base_pdx if the following changes will be enough:
   diff --git a/xen/arch/riscv/include/asm/mm.h
   b/xen/arch/riscv/include/asm/mm.h
   index cc4a07a71c..fdac7e0646 100644
   --- a/xen/arch/riscv/include/asm/mm.h
   +++ b/xen/arch/riscv/include/asm/mm.h
   @@ -107,14 +107,11 @@ struct page_info
    
    #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START)
    
   -/* PDX of the first page in the frame table. */
   -extern unsigned long frametable_base_pdx;
   -
    /* Convert between machine frame numbers and page-info structures. */
    #define mfn_to_page(mfn)                                            \
   -    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
   +    (frame_table + mfn - FRAMETABLE_BASE_OFFSET))
    #define page_to_mfn(pg)                                             \
   -    pdx_to_mfn((unsigned long)((pg) - frame_table) +
   frametable_base_pdx)
   +    ((unsigned long)((pg) - frame_table) + FRAMETABLE_BASE_OFFSET)
    
    static inline void *page_to_virt(const struct page_info *pg)
    {
   diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
   index 9c0fd80588..8f6dbdc699 100644
   --- a/xen/arch/riscv/mm.c
   +++ b/xen/arch/riscv/mm.c
   @@ -15,7 +15,7 @@
    #include <asm/page.h>
    #include <asm/processor.h>
    
   -unsigned long __ro_after_init frametable_base_pdx;
    unsigned long __ro_after_init frametable_virt_end;
    
    struct mmu_desc {


> 
> Every RISC-V board I can find has things starting from 0 in physical
> address space, with RAM starting immediately after.
> 
> Taking the microchip board as an example, RAM actually starts at
> 0x8000000, which means that having frametable_base_pdx and assuming
> it
> does get set to 0x8000 (which isn't even a certainty, given that I
> think
> you'll need struct pages covering the PLICs), then what you are
> trading
> off is:
> 
> * Saving 32k of virtual address space only (no need to even allocate
> memory for this range of the framtable), by
> * Having an extra memory load and add/sub in every page <-> mfn
> conversion, which is a screaming hotpath all over Xen.
Are you referring here to `mfn_to_pdx()` used in `mfn_to_page()` and
`pdx_to_mfn()` in `page_to_mfn()`?

My expectation was that when CONFIG_PDX_COMPRESSION is disabled then
this macros doesn't do anything:
/* pdx<->pfn == identity */
#define pdx_to_pfn(x) (x)
#define pfn_to_pdx(x) (x)


> 
> It's a terribly short-sighted tradeoff.
> 
> 32k of VA space might be worth saving in a 32bit build (I personally
> wouldn't - especially as there's no need to share Xen's VA space with
> guests, given no PV guests on ARM/RISC-V), but it's absolutely not at
> all in an a 64bit build with TB of VA space available.
Why 32k? If RAM is started at 0x8000_0000 then we have to cover 0x80000
entries, and the size of it is 0x8_0000 = 524288 * 64 = 33554432, so it
is 32 Mb.
Am I confusing something?

P.S.: Should I map this start 32 mb? Or it is enough to have a slide (
FRAMETABLE_BASE_OFFSET ).

~ Oleksii

> 
> Even if we do find a board with the first interesting thing in the
> frametable starting sufficiently away from 0 that it might be worth
> considering this slide, then it should still be Kconfig-able in a
> similar way to PDX_COMPRESSION.
> 
> You don't want to be penalising everyone because of a
> theoretical/weird
> corner case.
> 
> ~Andrew
Andrew Cooper June 11, 2024, 3:53 p.m. UTC | #5
On 30/05/2024 7:22 pm, Oleksii K. wrote:
> On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote:
>> On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
>>> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
>>> Acked-by: Jan Beulich <jbeulich@suse.com>
>> This patch looks like it can go in independently?  Or does it depend
>> on
>> having bitops.h working in practice?
>>
>> However, one very strong suggestion...
>>
>>
>>> diff --git a/xen/arch/riscv/include/asm/mm.h
>>> b/xen/arch/riscv/include/asm/mm.h
>>> index 07c7a0abba..cc4a07a71c 100644
>>> --- a/xen/arch/riscv/include/asm/mm.h
>>> +++ b/xen/arch/riscv/include/asm/mm.h
>>> @@ -3,11 +3,246 @@
>>> <snip>
>>> +/* PDX of the first page in the frame table. */
>>> +extern unsigned long frametable_base_pdx;
>>> +
>>> +/* Convert between machine frame numbers and page-info structures.
>>> */
>>> +#define
>>> mfn_to_page(mfn)                                            \
>>> +    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
>>> +#define
>>> page_to_mfn(pg)                                             \
>>> +    pdx_to_mfn((unsigned long)((pg) - frame_table) +
>>> frametable_base_pdx)
>> Do yourself a favour and not introduce frametable_base_pdx to begin
>> with.
>>
>> Every RISC-V board I can find has things starting from 0 in physical
>> address space, with RAM starting immediately after.
> I checked Linux kernel and grep there:
>    [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/ --
>    exclude "*.tmp" -I
>    arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive-2.dtsi:33:    
>    memory@40000000 {
>    arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28:     memory@80000000
>    {
>    arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49:      ddrc_cache:
>    memory@1000000000 {
>    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33:   ddrc_cache_lo:
>    memory@80000000 {
>    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37:   ddrc_cache_hi:
>    memory@1040000000 {
>    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34:      ddrc_cache_lo:
>    memory@80000000 {
>    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40:      ddrc_cache_hi:
>    memory@1000000000 {
>    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22:   ddrc_cache_lo:
>    memory@80000000 {
>    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27:   ddrc_cache_hi:
>    memory@1000000000 {
>    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57:   ddrc_cache_lo:
>    memory@80000000 {
>    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63:   ddrc_cache_hi:
>    memory@1040000000 {
>    arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32:  memory@0 {
>    arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14:     
>    memory@0 {
>    arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26:    memory@80000000
>    {
>    arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12:     memory@80000000 {
>    arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26: memory@80000000
>    {
>    arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25: memory@80000000
>    {
>    arch/riscv/boot/dts/canaan/k210.dtsi:82:        sram: memory@80000000 {
>    
> And based on  that majority of supported by Linux kernel boards has RAM
> starting not from 0 in physical address space. Am I confusing
> something?
>
>> Taking the microchip board as an example, RAM actually starts at
>> 0x8000000,
> Today we had conversation with the guy from SiFive in xen-devel channel
> and he mentioned that they are using "starfive visionfive2 and sifive
> unleashed platforms." which based on the grep above has RAM not at 0
> address.
>
> Also, QEMU uses 0x8000000.
>
>>  which means that having frametable_base_pdx and assuming it
>> does get set to 0x8000 (which isn't even a certainty, given that I
>> think
>> you'll need struct pages covering the PLICs), then what you are
>> trading
>> off is:
>>
>> * Saving 32k of virtual address space only (no need to even allocate
>> memory for this range of the framtable), by
>> * Having an extra memory load and add/sub in every page <-> mfn
>> conversion, which is a screaming hotpath all over Xen.
>>
>> It's a terribly short-sighted tradeoff.
>>
>> 32k of VA space might be worth saving in a 32bit build (I personally
>> wouldn't - especially as there's no need to share Xen's VA space with
>> guests, given no PV guests on ARM/RISC-V), but it's absolutely not at
>> all in an a 64bit build with TB of VA space available.
>>
>> Even if we do find a board with the first interesting thing in the
>> frametable starting sufficiently away from 0 that it might be worth
>> considering this slide, then it should still be Kconfig-able in a
>> similar way to PDX_COMPRESSION.
> I found your tradeoffs reasonable, but I don't understand how it will
> work if RAM does not start from 0, as the frametable address and RAM
> address are linked.
> I tried to look at the PDX_COMPRESSION config and couldn't find any
> "slide" there. Could you please clarify this for me?
> If to use this "slide" would it help to avoid the mentioned above
> tradeoffs?
>
> One more question: if we decide to go without frametable_base_pdx,
> would it be sufficient to simply remove mentions of it from the code (
> at least, for now )?

There is a relationship between system/host physical addresses (what Xen
called maddr/mfn), and the frametable.  The frametable has one entry per
mfn.

In the most simple case, there's a 1:1 relationship.  i.e. frametable[0]
= maddr(0), frametable[1] = maddr(4k) etc.  This is very simple, and
very easy to calculate (page_to_mfn()/mfn_to_page()).

The frametable is one big array.  It starts at a compile-time fixed
address, and needs to be long enough to cover everything interesting in
memory.  Therefore it potentially takes a large amount of virtual
address space.

However, only interesting maddrs need to have data in the frametable, so
it's fine for the backing RAM to be sparsely allocated/mapped in the
frametable virtual addresses.

For 64bit, that's really all you need, because there's always far more
virtual address space than physical RAM in the system, even when you're
looking at TB-scale giant servers.


For 32bit, virtual address space is a limited resource.  (Also to an
extent, 64bit x86 with PV guests because we give 98% of the virtual
address space to the guest kernel to use.)

There are two tricks to reduce the virtual address space used, but they
both cost performance in fastpaths.

1) PDX Compression.

PDX compression makes a non-linear mfn <-> maddr mapping.  This is for a
usecase where you've got multiple RAM banks which are separated by a
large distance (and evenly spaced), then you can "compress" a single
range of 0's out of the middle of the system/host physical address.

The cost is that all page <-> mfn conversions need to read two masks and
a shift-count from variables in memory, to split/shift/recombine the
address bits.

2) A slide, which is frametable_base_pdx in this context.

When there's a big gap between 0 and the start of something interesting,
you could chop out that range by just subtracting base_pdx.  What
qualifies as "big" is subjective, but Qemu starting at 128M certainly
does not qualify as big enough to warrant frametable_base_pdx.

This is less expensive that PDX compression.  It only adds one memory
read to the fastpath, but it also doesn't save as much virtual address
space as PDX compression.


When virtual address space is a major constraint (32 bit builds), both
of these techniques are worth doing.  But when there's no constraint on
virtual address space (64 bit builds), there's no reason to use either;
and the performance will definitely improve as a result.

~Andrew
Oleksii Kurochko June 11, 2024, 6:23 p.m. UTC | #6
On Tue, 2024-06-11 at 16:53 +0100, Andrew Cooper wrote:
> On 30/05/2024 7:22 pm, Oleksii K. wrote:
> > On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote:
> > > On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
> > > > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> > > > Acked-by: Jan Beulich <jbeulich@suse.com>
> > > This patch looks like it can go in independently?  Or does it
> > > depend
> > > on
> > > having bitops.h working in practice?
> > > 
> > > However, one very strong suggestion...
> > > 
> > > 
> > > > diff --git a/xen/arch/riscv/include/asm/mm.h
> > > > b/xen/arch/riscv/include/asm/mm.h
> > > > index 07c7a0abba..cc4a07a71c 100644
> > > > --- a/xen/arch/riscv/include/asm/mm.h
> > > > +++ b/xen/arch/riscv/include/asm/mm.h
> > > > @@ -3,11 +3,246 @@
> > > > <snip>
> > > > +/* PDX of the first page in the frame table. */
> > > > +extern unsigned long frametable_base_pdx;
> > > > +
> > > > +/* Convert between machine frame numbers and page-info
> > > > structures.
> > > > */
> > > > +#define
> > > > mfn_to_page(mfn)                                            \
> > > > +    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
> > > > +#define
> > > > page_to_mfn(pg)                                             \
> > > > +    pdx_to_mfn((unsigned long)((pg) - frame_table) +
> > > > frametable_base_pdx)
> > > Do yourself a favour and not introduce frametable_base_pdx to
> > > begin
> > > with.
> > > 
> > > Every RISC-V board I can find has things starting from 0 in
> > > physical
> > > address space, with RAM starting immediately after.
> > I checked Linux kernel and grep there:
> >    [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/
> > --
> >    exclude "*.tmp" -I
> >    arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive-
> > 2.dtsi:33:    
> >    memory@40000000 {
> >    arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28:    
> > memory@80000000
> >    {
> >    arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49:     
> > ddrc_cache:
> >    memory@1000000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33:  
> > ddrc_cache_lo:
> >    memory@80000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37:  
> > ddrc_cache_hi:
> >    memory@1040000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34:     
> > ddrc_cache_lo:
> >    memory@80000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40:     
> > ddrc_cache_hi:
> >    memory@1000000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22:  
> > ddrc_cache_lo:
> >    memory@80000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27:  
> > ddrc_cache_hi:
> >    memory@1000000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57:  
> > ddrc_cache_lo:
> >    memory@80000000 {
> >    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63:  
> > ddrc_cache_hi:
> >    memory@1040000000 {
> >    arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32:  memory@0
> > {
> >    arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14:     
> >    memory@0 {
> >    arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26:   
> > memory@80000000
> >    {
> >    arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12:     memory@80000000
> > {
> >    arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26:
> > memory@80000000
> >    {
> >    arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25:
> > memory@80000000
> >    {
> >    arch/riscv/boot/dts/canaan/k210.dtsi:82:        sram:
> > memory@80000000 {
> >    
> > And based on  that majority of supported by Linux kernel boards has
> > RAM
> > starting not from 0 in physical address space. Am I confusing
> > something?
> > 
> > > Taking the microchip board as an example, RAM actually starts at
> > > 0x8000000,
> > Today we had conversation with the guy from SiFive in xen-devel
> > channel
> > and he mentioned that they are using "starfive visionfive2 and
> > sifive
> > unleashed platforms." which based on the grep above has RAM not at
> > 0
> > address.
> > 
> > Also, QEMU uses 0x8000000.
> > 
> > >  which means that having frametable_base_pdx and assuming it
> > > does get set to 0x8000 (which isn't even a certainty, given that
> > > I
> > > think
> > > you'll need struct pages covering the PLICs), then what you are
> > > trading
> > > off is:
> > > 
> > > * Saving 32k of virtual address space only (no need to even
> > > allocate
> > > memory for this range of the framtable), by
> > > * Having an extra memory load and add/sub in every page <-> mfn
> > > conversion, which is a screaming hotpath all over Xen.
> > > 
> > > It's a terribly short-sighted tradeoff.
> > > 
> > > 32k of VA space might be worth saving in a 32bit build (I
> > > personally
> > > wouldn't - especially as there's no need to share Xen's VA space
> > > with
> > > guests, given no PV guests on ARM/RISC-V), but it's absolutely
> > > not at
> > > all in an a 64bit build with TB of VA space available.
> > > 
> > > Even if we do find a board with the first interesting thing in
> > > the
> > > frametable starting sufficiently away from 0 that it might be
> > > worth
> > > considering this slide, then it should still be Kconfig-able in a
> > > similar way to PDX_COMPRESSION.
> > I found your tradeoffs reasonable, but I don't understand how it
> > will
> > work if RAM does not start from 0, as the frametable address and
> > RAM
> > address are linked.
> > I tried to look at the PDX_COMPRESSION config and couldn't find any
> > "slide" there. Could you please clarify this for me?
> > If to use this "slide" would it help to avoid the mentioned above
> > tradeoffs?
> > 
> > One more question: if we decide to go without frametable_base_pdx,
> > would it be sufficient to simply remove mentions of it from the
> > code (
> > at least, for now )?
> 
> There is a relationship between system/host physical addresses (what
> Xen
> called maddr/mfn), and the frametable.  The frametable has one entry
> per
> mfn.
> 
> In the most simple case, there's a 1:1 relationship.  i.e.
> frametable[0]
> = maddr(0), frametable[1] = maddr(4k) etc.  This is very simple, and
> very easy to calculate (page_to_mfn()/mfn_to_page()).
> 
> The frametable is one big array.  It starts at a compile-time fixed
> address, and needs to be long enough to cover everything interesting
> in
> memory.  Therefore it potentially takes a large amount of virtual
> address space.
> 
> However, only interesting maddrs need to have data in the frametable,
> so
> it's fine for the backing RAM to be sparsely allocated/mapped in the
> frametable virtual addresses.
> 
> For 64bit, that's really all you need, because there's always far
> more
> virtual address space than physical RAM in the system, even when
> you're
> looking at TB-scale giant servers.
> 
> 
> For 32bit, virtual address space is a limited resource.  (Also to an
> extent, 64bit x86 with PV guests because we give 98% of the virtual
> address space to the guest kernel to use.)
> 
> There are two tricks to reduce the virtual address space used, but
> they
> both cost performance in fastpaths.
> 
> 1) PDX Compression.
> 
> PDX compression makes a non-linear mfn <-> maddr mapping.  This is
> for a
> usecase where you've got multiple RAM banks which are separated by a
> large distance (and evenly spaced), then you can "compress" a single
> range of 0's out of the middle of the system/host physical address.
> 
> The cost is that all page <-> mfn conversions need to read two masks
> and
> a shift-count from variables in memory, to split/shift/recombine the
> address bits.
> 
> 2) A slide, which is frametable_base_pdx in this context.
> 
> When there's a big gap between 0 and the start of something
> interesting,
> you could chop out that range by just subtracting base_pdx.  What
> qualifies as "big" is subjective, but Qemu starting at 128M certainly
> does not qualify as big enough to warrant frametable_base_pdx.
> 
> This is less expensive that PDX compression.  It only adds one memory
> read to the fastpath, but it also doesn't save as much virtual
> address
> space as PDX compression.
> 
> 
> When virtual address space is a major constraint (32 bit builds),
> both
> of these techniques are worth doing.  But when there's no constraint
> on
> virtual address space (64 bit builds), there's no reason to use
> either;
> and the performance will definitely improve as a result.

Thanks for such good explanation.

For RISC-V we have PDX config disabled as I haven't seen multiple RAM
banks at boards which has hypervisor extension. Thereby mfn_to_pdx()
and pdx_to_mfn() do nothing. The same for frametable_base_pdx, in case
of PDX disabled, it just an offset ( or a slide ).

IIUUC, you meant that it make sense to map frametable not to the
address of where RAM is started, but to 0, so then we don't need this
+-frametable_base_pdx. The price for that is loosing of VA space for
the range from 0 to RAM start address.

Right now, we are trying to support 3 boards with the following RAM
address:
1. 0x8000_0000 - QEMU and SiFive board
2. 0x40_0000_0000 - Microchip board

So if we mapping frametable to 0 ( not RAM start ) we will loose:
1. 0x8000_0 ( amount of page entries to cover range [0, 0x8000_0000] )
* 64 ( size of struct page_info ) = 32 MB
2. 0x40_0000_0 ( amount of page entries to cover range [0,
0x40_0000_0000] ) * 64 ( size of struct page_info ) = 4 Gb

In terms of available virtual address space for RV-64 we can consider
both options as acceptable.

[OPTION 1] If we accepting of loosing 4 Gb of VA then we could
implement mfn_to_page() and page_to_mfn() in the following way:
```
   diff --git a/xen/arch/riscv/include/asm/mm.h
   b/xen/arch/riscv/include/asm/mm.h
   index cc4a07a71c..fdac7e0646 100644
   --- a/xen/arch/riscv/include/asm/mm.h
   +++ b/xen/arch/riscv/include/asm/mm.h
   @@ -107,14 +107,11 @@ struct page_info
   
    #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START)
   
   -/* PDX of the first page in the frame table. */
   -extern unsigned long frametable_base_pdx;
   -
    /* Convert between machine frame numbers and page-info structures.
*/
    #define mfn_to_page(mfn)                                          
\
   -    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
   +    (frame_table + mfn))
    #define page_to_mfn(pg)                                           
\
   -    pdx_to_mfn((unsigned long)((pg) - frame_table) +
   frametable_base_pdx)
   +    ((unsigned long)((pg) - frame_table))
   
    static inline void *page_to_virt(const struct page_info *pg)
    {
   diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
   index 9c0fd80588..8f6dbdc699 100644
   --- a/xen/arch/riscv/mm.c
   +++ b/xen/arch/riscv/mm.c
   @@ -15,7 +15,7 @@
    #include <asm/page.h>
    #include <asm/processor.h>
   
   -unsigned long __ro_after_init frametable_base_pdx;
    unsigned long __ro_after_init frametable_virt_end;
   
    struct mmu_desc {
```

[OPTION 2] If we are not accepting of loosing 4 Gb of VA I think that
there is no sense to rework that in the following way:
```
   diff --git a/xen/arch/riscv/include/asm/mm.h
   b/xen/arch/riscv/include/asm/mm.h
   index cc4a07a71c..fdac7e0646 100644
   --- a/xen/arch/riscv/include/asm/mm.h
   +++ b/xen/arch/riscv/include/asm/mm.h
   @@ -107,14 +107,11 @@ struct page_info
   
    #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START)
   
   -/* PDX of the first page in the frame table. */
   -extern unsigned long frametable_base_pdx;
   -
    /* Convert between machine frame numbers and page-info structures.
*/
    #define mfn_to_page(mfn)                                          
\
   -    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
   +    (frame_table + mfn - FRAMETABLE_BASE_OFFSET))
    #define page_to_mfn(pg)                                           
\
   -    pdx_to_mfn((unsigned long)((pg) - frame_table) +
   frametable_base_pdx)
   +    ((unsigned long)((pg) - frame_table) + FRAMETABLE_BASE_OFFSET)
   
    static inline void *page_to_virt(const struct page_info *pg)
    {
   diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
   index 9c0fd80588..8f6dbdc699 100644
   --- a/xen/arch/riscv/mm.c
   +++ b/xen/arch/riscv/mm.c
   @@ -15,7 +15,7 @@
    #include <asm/page.h>
    #include <asm/processor.h>
   
   -unsigned long __ro_after_init frametable_base_pdx;
    unsigned long __ro_after_init frametable_virt_end;
   
    struct mmu_desc {
```

And I am not sure that there is any sense in option 2 as basically it
the same to have the following macros definition with PDX disabled:
```
   /* Convert between machine frame numbers and page-info structures.
   */
   #define mfn_to_page(mfn)                                           
   \
       (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
   #define page_to_mfn(pg)                                            
   \
       pdx_to_mfn((unsigned long)((pg) - frame_table) +
   frametable_base_pdx)
```
The only sense of option 2 is the case when FRAMETABLE_BASE_OFFSET is
equal to 0, so the compiler will generate the code without additional
sub/add of FRAMETABLE_BASE_OFFSET.

Could you please clarify if my understanding is correct?

Should we still change the definition of mfn_to_page() and
page_to_mfn() with the fact that PDX is disabled? If yes, OPTION_1 is
okay or I am missing something?

Thanks in advance.

~ Oleksii
Andrew Cooper June 14, 2024, 9:47 a.m. UTC | #7
On 11/06/2024 7:23 pm, Oleksii K. wrote:
> On Tue, 2024-06-11 at 16:53 +0100, Andrew Cooper wrote:
>> On 30/05/2024 7:22 pm, Oleksii K. wrote:
>>> On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote:
>>>> On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
>>>>> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
>>>>> Acked-by: Jan Beulich <jbeulich@suse.com>
>>>> This patch looks like it can go in independently?  Or does it
>>>> depend
>>>> on
>>>> having bitops.h working in practice?
>>>>
>>>> However, one very strong suggestion...
>>>>
>>>>
>>>>> diff --git a/xen/arch/riscv/include/asm/mm.h
>>>>> b/xen/arch/riscv/include/asm/mm.h
>>>>> index 07c7a0abba..cc4a07a71c 100644
>>>>> --- a/xen/arch/riscv/include/asm/mm.h
>>>>> +++ b/xen/arch/riscv/include/asm/mm.h
>>>>> @@ -3,11 +3,246 @@
>>>>> <snip>
>>>>> +/* PDX of the first page in the frame table. */
>>>>> +extern unsigned long frametable_base_pdx;
>>>>> +
>>>>> +/* Convert between machine frame numbers and page-info
>>>>> structures.
>>>>> */
>>>>> +#define
>>>>> mfn_to_page(mfn)                                            \
>>>>> +    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
>>>>> +#define
>>>>> page_to_mfn(pg)                                             \
>>>>> +    pdx_to_mfn((unsigned long)((pg) - frame_table) +
>>>>> frametable_base_pdx)
>>>> Do yourself a favour and not introduce frametable_base_pdx to
>>>> begin
>>>> with.
>>>>
>>>> Every RISC-V board I can find has things starting from 0 in
>>>> physical
>>>> address space, with RAM starting immediately after.
>>> I checked Linux kernel and grep there:
>>>    [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/
>>> --
>>>    exclude "*.tmp" -I
>>>    arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive-
>>> 2.dtsi:33:    
>>>    memory@40000000 {
>>>    arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28:    
>>> memory@80000000
>>>    {
>>>    arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49:     
>>> ddrc_cache:
>>>    memory@1000000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33:  
>>> ddrc_cache_lo:
>>>    memory@80000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37:  
>>> ddrc_cache_hi:
>>>    memory@1040000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34:     
>>> ddrc_cache_lo:
>>>    memory@80000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40:     
>>> ddrc_cache_hi:
>>>    memory@1000000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22:  
>>> ddrc_cache_lo:
>>>    memory@80000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27:  
>>> ddrc_cache_hi:
>>>    memory@1000000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57:  
>>> ddrc_cache_lo:
>>>    memory@80000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63:  
>>> ddrc_cache_hi:
>>>    memory@1040000000 {
>>>    arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32:  memory@0
>>> {
>>>    arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14:     
>>>    memory@0 {
>>>    arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26:   
>>> memory@80000000
>>>    {
>>>    arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12:     memory@80000000
>>> {
>>>    arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26:
>>> memory@80000000
>>>    {
>>>    arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25:
>>> memory@80000000
>>>    {
>>>    arch/riscv/boot/dts/canaan/k210.dtsi:82:        sram:
>>> memory@80000000 {
>>>    
>>> And based on  that majority of supported by Linux kernel boards has
>>> RAM
>>> starting not from 0 in physical address space. Am I confusing
>>> something?
>>>
>>>> Taking the microchip board as an example, RAM actually starts at
>>>> 0x8000000,
>>> Today we had conversation with the guy from SiFive in xen-devel
>>> channel
>>> and he mentioned that they are using "starfive visionfive2 and
>>> sifive
>>> unleashed platforms." which based on the grep above has RAM not at
>>> 0
>>> address.
>>>
>>> Also, QEMU uses 0x8000000.
>>>
>>>>  which means that having frametable_base_pdx and assuming it
>>>> does get set to 0x8000 (which isn't even a certainty, given that
>>>> I
>>>> think
>>>> you'll need struct pages covering the PLICs), then what you are
>>>> trading
>>>> off is:
>>>>
>>>> * Saving 32k of virtual address space only (no need to even
>>>> allocate
>>>> memory for this range of the framtable), by
>>>> * Having an extra memory load and add/sub in every page <-> mfn
>>>> conversion, which is a screaming hotpath all over Xen.
>>>>
>>>> It's a terribly short-sighted tradeoff.
>>>>
>>>> 32k of VA space might be worth saving in a 32bit build (I
>>>> personally
>>>> wouldn't - especially as there's no need to share Xen's VA space
>>>> with
>>>> guests, given no PV guests on ARM/RISC-V), but it's absolutely
>>>> not at
>>>> all in an a 64bit build with TB of VA space available.
>>>>
>>>> Even if we do find a board with the first interesting thing in
>>>> the
>>>> frametable starting sufficiently away from 0 that it might be
>>>> worth
>>>> considering this slide, then it should still be Kconfig-able in a
>>>> similar way to PDX_COMPRESSION.
>>> I found your tradeoffs reasonable, but I don't understand how it
>>> will
>>> work if RAM does not start from 0, as the frametable address and
>>> RAM
>>> address are linked.
>>> I tried to look at the PDX_COMPRESSION config and couldn't find any
>>> "slide" there. Could you please clarify this for me?
>>> If to use this "slide" would it help to avoid the mentioned above
>>> tradeoffs?
>>>
>>> One more question: if we decide to go without frametable_base_pdx,
>>> would it be sufficient to simply remove mentions of it from the
>>> code (
>>> at least, for now )?
>> There is a relationship between system/host physical addresses (what
>> Xen
>> called maddr/mfn), and the frametable.  The frametable has one entry
>> per
>> mfn.
>>
>> In the most simple case, there's a 1:1 relationship.  i.e.
>> frametable[0]
>> = maddr(0), frametable[1] = maddr(4k) etc.  This is very simple, and
>> very easy to calculate (page_to_mfn()/mfn_to_page()).
>>
>> The frametable is one big array.  It starts at a compile-time fixed
>> address, and needs to be long enough to cover everything interesting
>> in
>> memory.  Therefore it potentially takes a large amount of virtual
>> address space.
>>
>> However, only interesting maddrs need to have data in the frametable,
>> so
>> it's fine for the backing RAM to be sparsely allocated/mapped in the
>> frametable virtual addresses.
>>
>> For 64bit, that's really all you need, because there's always far
>> more
>> virtual address space than physical RAM in the system, even when
>> you're
>> looking at TB-scale giant servers.
>>
>>
>> For 32bit, virtual address space is a limited resource.  (Also to an
>> extent, 64bit x86 with PV guests because we give 98% of the virtual
>> address space to the guest kernel to use.)
>>
>> There are two tricks to reduce the virtual address space used, but
>> they
>> both cost performance in fastpaths.
>>
>> 1) PDX Compression.
>>
>> PDX compression makes a non-linear mfn <-> maddr mapping.  This is
>> for a
>> usecase where you've got multiple RAM banks which are separated by a
>> large distance (and evenly spaced), then you can "compress" a single
>> range of 0's out of the middle of the system/host physical address.
>>
>> The cost is that all page <-> mfn conversions need to read two masks
>> and
>> a shift-count from variables in memory, to split/shift/recombine the
>> address bits.
>>
>> 2) A slide, which is frametable_base_pdx in this context.
>>
>> When there's a big gap between 0 and the start of something
>> interesting,
>> you could chop out that range by just subtracting base_pdx.  What
>> qualifies as "big" is subjective, but Qemu starting at 128M certainly
>> does not qualify as big enough to warrant frametable_base_pdx.
>>
>> This is less expensive that PDX compression.  It only adds one memory
>> read to the fastpath, but it also doesn't save as much virtual
>> address
>> space as PDX compression.
>>
>>
>> When virtual address space is a major constraint (32 bit builds),
>> both
>> of these techniques are worth doing.  But when there's no constraint
>> on
>> virtual address space (64 bit builds), there's no reason to use
>> either;
>> and the performance will definitely improve as a result.
> Thanks for such good explanation.
>
> For RISC-V we have PDX config disabled as I haven't seen multiple RAM
> banks at boards which has hypervisor extension. Thereby mfn_to_pdx()
> and pdx_to_mfn() do nothing. The same for frametable_base_pdx, in case
> of PDX disabled, it just an offset ( or a slide ).
>
> IIUUC, you meant that it make sense to map frametable not to the
> address of where RAM is started, but to 0, so then we don't need this
> +-frametable_base_pdx. The price for that is loosing of VA space for
> the range from 0 to RAM start address.
>
> Right now, we are trying to support 3 boards with the following RAM
> address:
> 1. 0x8000_0000 - QEMU and SiFive board
> 2. 0x40_0000_0000 - Microchip board
>
> So if we mapping frametable to 0 ( not RAM start ) we will loose:
> 1. 0x8000_0 ( amount of page entries to cover range [0, 0x8000_0000] )
> * 64 ( size of struct page_info ) = 32 MB
> 2. 0x40_0000_0 ( amount of page entries to cover range [0,
> 0x40_0000_0000] ) * 64 ( size of struct page_info ) = 4 Gb
>
> In terms of available virtual address space for RV-64 we can consider
> both options as acceptable.

For Qemu and SiFive, 32M is definitely not worth worrying about.

I personally wouldn't worry about Microchip either.  That's 4G of 1T VA
space (assuming Sv39).

> [OPTION 1] If we accepting of loosing 4 Gb of VA then we could
> implement mfn_to_page() and page_to_mfn() in the following way:
> ```
>    diff --git a/xen/arch/riscv/include/asm/mm.h
>    b/xen/arch/riscv/include/asm/mm.h
>    index cc4a07a71c..fdac7e0646 100644
>    --- a/xen/arch/riscv/include/asm/mm.h
>    +++ b/xen/arch/riscv/include/asm/mm.h
>    @@ -107,14 +107,11 @@ struct page_info
>    
>     #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START)
>    
>    -/* PDX of the first page in the frame table. */
>    -extern unsigned long frametable_base_pdx;
>    -
>     /* Convert between machine frame numbers and page-info structures.
> */
>     #define mfn_to_page(mfn)                                          
> \
>    -    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
>    +    (frame_table + mfn))
>     #define page_to_mfn(pg)                                           
> \
>    -    pdx_to_mfn((unsigned long)((pg) - frame_table) +
>    frametable_base_pdx)
>    +    ((unsigned long)((pg) - frame_table))
>    
>     static inline void *page_to_virt(const struct page_info *pg)
>     {
>    diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
>    index 9c0fd80588..8f6dbdc699 100644
>    --- a/xen/arch/riscv/mm.c
>    +++ b/xen/arch/riscv/mm.c
>    @@ -15,7 +15,7 @@
>     #include <asm/page.h>
>     #include <asm/processor.h>
>    
>    -unsigned long __ro_after_init frametable_base_pdx;
>     unsigned long __ro_after_init frametable_virt_end;
>    
>     struct mmu_desc {
> ```

I firmly recommend option 1, especially at this point.

If specific boards really have a problem with losing 4G of VA space,
then option 2 can be added easily at a later point.

That said, I'd think carefully about doing option 2.  Even subtracting a
constant - which is far better than subtracting a variable - is still
extra overhead in fastpaths.  Option 2 needs careful consideration on a
board-by-board case.

~Andrew
Oleksii Kurochko June 18, 2024, 8:21 a.m. UTC | #8
On Fri, 2024-06-14 at 10:47 +0100, Andrew Cooper wrote:
> On 11/06/2024 7:23 pm, Oleksii K. wrote:
> > On Tue, 2024-06-11 at 16:53 +0100, Andrew Cooper wrote:
> > > On 30/05/2024 7:22 pm, Oleksii K. wrote:
> > > > On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote:
> > > > > On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
> > > > > > Signed-off-by: Oleksii Kurochko
> > > > > > <oleksii.kurochko@gmail.com>
> > > > > > Acked-by: Jan Beulich <jbeulich@suse.com>
> > > > > This patch looks like it can go in independently?  Or does it
> > > > > depend
> > > > > on
> > > > > having bitops.h working in practice?
> > > > > 
> > > > > However, one very strong suggestion...
> > > > > 
> > > > > 
> > > > > > diff --git a/xen/arch/riscv/include/asm/mm.h
> > > > > > b/xen/arch/riscv/include/asm/mm.h
> > > > > > index 07c7a0abba..cc4a07a71c 100644
> > > > > > --- a/xen/arch/riscv/include/asm/mm.h
> > > > > > +++ b/xen/arch/riscv/include/asm/mm.h
> > > > > > @@ -3,11 +3,246 @@
> > > > > > <snip>
> > > > > > +/* PDX of the first page in the frame table. */
> > > > > > +extern unsigned long frametable_base_pdx;
> > > > > > +
> > > > > > +/* Convert between machine frame numbers and page-info
> > > > > > structures.
> > > > > > */
> > > > > > +#define
> > > > > > mfn_to_page(mfn)                                           
> > > > > > \
> > > > > > +    (frame_table + (mfn_to_pdx(mfn) -
> > > > > > frametable_base_pdx))
> > > > > > +#define
> > > > > > page_to_mfn(pg)                                            
> > > > > > \
> > > > > > +    pdx_to_mfn((unsigned long)((pg) - frame_table) +
> > > > > > frametable_base_pdx)
> > > > > Do yourself a favour and not introduce frametable_base_pdx to
> > > > > begin
> > > > > with.
> > > > > 
> > > > > Every RISC-V board I can find has things starting from 0 in
> > > > > physical
> > > > > address space, with RAM starting immediately after.
> > > > I checked Linux kernel and grep there:
> > > >    [ok@fedora linux-aia]$ grep -Rni "memory@"
> > > > arch/riscv/boot/dts/
> > > > --
> > > >    exclude "*.tmp" -I
> > > >    arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive-
> > > > 2.dtsi:33:    
> > > >    memory@40000000 {
> > > >    arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28:    
> > > > memory@80000000
> > > >    {
> > > >    arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49:     
> > > > ddrc_cache:
> > > >    memory@1000000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33:  
> > > > ddrc_cache_lo:
> > > >    memory@80000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37:  
> > > > ddrc_cache_hi:
> > > >    memory@1040000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34:     
> > > > ddrc_cache_lo:
> > > >    memory@80000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40:     
> > > > ddrc_cache_hi:
> > > >    memory@1000000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22:  
> > > > ddrc_cache_lo:
> > > >    memory@80000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27:  
> > > > ddrc_cache_hi:
> > > >    memory@1000000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57:  
> > > > ddrc_cache_lo:
> > > >    memory@80000000 {
> > > >    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63:  
> > > > ddrc_cache_hi:
> > > >    memory@1040000000 {
> > > >    arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32: 
> > > > memory@0
> > > > {
> > > >    arch/riscv/boot/dts/thead/th1520-lichee-module-
> > > > 4a.dtsi:14:     
> > > >    memory@0 {
> > > >    arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26:   
> > > > memory@80000000
> > > >    {
> > > >    arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12:    
> > > > memory@80000000
> > > > {
> > > >    arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26:
> > > > memory@80000000
> > > >    {
> > > >    arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25:
> > > > memory@80000000
> > > >    {
> > > >    arch/riscv/boot/dts/canaan/k210.dtsi:82:        sram:
> > > > memory@80000000 {
> > > >    
> > > > And based on  that majority of supported by Linux kernel boards
> > > > has
> > > > RAM
> > > > starting not from 0 in physical address space. Am I confusing
> > > > something?
> > > > 
> > > > > Taking the microchip board as an example, RAM actually starts
> > > > > at
> > > > > 0x8000000,
> > > > Today we had conversation with the guy from SiFive in xen-devel
> > > > channel
> > > > and he mentioned that they are using "starfive visionfive2 and
> > > > sifive
> > > > unleashed platforms." which based on the grep above has RAM not
> > > > at
> > > > 0
> > > > address.
> > > > 
> > > > Also, QEMU uses 0x8000000.
> > > > 
> > > > >  which means that having frametable_base_pdx and assuming it
> > > > > does get set to 0x8000 (which isn't even a certainty, given
> > > > > that
> > > > > I
> > > > > think
> > > > > you'll need struct pages covering the PLICs), then what you
> > > > > are
> > > > > trading
> > > > > off is:
> > > > > 
> > > > > * Saving 32k of virtual address space only (no need to even
> > > > > allocate
> > > > > memory for this range of the framtable), by
> > > > > * Having an extra memory load and add/sub in every page <->
> > > > > mfn
> > > > > conversion, which is a screaming hotpath all over Xen.
> > > > > 
> > > > > It's a terribly short-sighted tradeoff.
> > > > > 
> > > > > 32k of VA space might be worth saving in a 32bit build (I
> > > > > personally
> > > > > wouldn't - especially as there's no need to share Xen's VA
> > > > > space
> > > > > with
> > > > > guests, given no PV guests on ARM/RISC-V), but it's
> > > > > absolutely
> > > > > not at
> > > > > all in an a 64bit build with TB of VA space available.
> > > > > 
> > > > > Even if we do find a board with the first interesting thing
> > > > > in
> > > > > the
> > > > > frametable starting sufficiently away from 0 that it might be
> > > > > worth
> > > > > considering this slide, then it should still be Kconfig-able
> > > > > in a
> > > > > similar way to PDX_COMPRESSION.
> > > > I found your tradeoffs reasonable, but I don't understand how
> > > > it
> > > > will
> > > > work if RAM does not start from 0, as the frametable address
> > > > and
> > > > RAM
> > > > address are linked.
> > > > I tried to look at the PDX_COMPRESSION config and couldn't find
> > > > any
> > > > "slide" there. Could you please clarify this for me?
> > > > If to use this "slide" would it help to avoid the mentioned
> > > > above
> > > > tradeoffs?
> > > > 
> > > > One more question: if we decide to go without
> > > > frametable_base_pdx,
> > > > would it be sufficient to simply remove mentions of it from the
> > > > code (
> > > > at least, for now )?
> > > There is a relationship between system/host physical addresses
> > > (what
> > > Xen
> > > called maddr/mfn), and the frametable.  The frametable has one
> > > entry
> > > per
> > > mfn.
> > > 
> > > In the most simple case, there's a 1:1 relationship.  i.e.
> > > frametable[0]
> > > = maddr(0), frametable[1] = maddr(4k) etc.  This is very simple,
> > > and
> > > very easy to calculate (page_to_mfn()/mfn_to_page()).
> > > 
> > > The frametable is one big array.  It starts at a compile-time
> > > fixed
> > > address, and needs to be long enough to cover everything
> > > interesting
> > > in
> > > memory.  Therefore it potentially takes a large amount of virtual
> > > address space.
> > > 
> > > However, only interesting maddrs need to have data in the
> > > frametable,
> > > so
> > > it's fine for the backing RAM to be sparsely allocated/mapped in
> > > the
> > > frametable virtual addresses.
> > > 
> > > For 64bit, that's really all you need, because there's always far
> > > more
> > > virtual address space than physical RAM in the system, even when
> > > you're
> > > looking at TB-scale giant servers.
> > > 
> > > 
> > > For 32bit, virtual address space is a limited resource.  (Also to
> > > an
> > > extent, 64bit x86 with PV guests because we give 98% of the
> > > virtual
> > > address space to the guest kernel to use.)
> > > 
> > > There are two tricks to reduce the virtual address space used,
> > > but
> > > they
> > > both cost performance in fastpaths.
> > > 
> > > 1) PDX Compression.
> > > 
> > > PDX compression makes a non-linear mfn <-> maddr mapping.  This
> > > is
> > > for a
> > > usecase where you've got multiple RAM banks which are separated
> > > by a
> > > large distance (and evenly spaced), then you can "compress" a
> > > single
> > > range of 0's out of the middle of the system/host physical
> > > address.
> > > 
> > > The cost is that all page <-> mfn conversions need to read two
> > > masks
> > > and
> > > a shift-count from variables in memory, to split/shift/recombine
> > > the
> > > address bits.
> > > 
> > > 2) A slide, which is frametable_base_pdx in this context.
> > > 
> > > When there's a big gap between 0 and the start of something
> > > interesting,
> > > you could chop out that range by just subtracting base_pdx.  What
> > > qualifies as "big" is subjective, but Qemu starting at 128M
> > > certainly
> > > does not qualify as big enough to warrant frametable_base_pdx.
> > > 
> > > This is less expensive that PDX compression.  It only adds one
> > > memory
> > > read to the fastpath, but it also doesn't save as much virtual
> > > address
> > > space as PDX compression.
> > > 
> > > 
> > > When virtual address space is a major constraint (32 bit builds),
> > > both
> > > of these techniques are worth doing.  But when there's no
> > > constraint
> > > on
> > > virtual address space (64 bit builds), there's no reason to use
> > > either;
> > > and the performance will definitely improve as a result.
> > Thanks for such good explanation.
> > 
> > For RISC-V we have PDX config disabled as I haven't seen multiple
> > RAM
> > banks at boards which has hypervisor extension. Thereby
> > mfn_to_pdx()
> > and pdx_to_mfn() do nothing. The same for frametable_base_pdx, in
> > case
> > of PDX disabled, it just an offset ( or a slide ).
> > 
> > IIUUC, you meant that it make sense to map frametable not to the
> > address of where RAM is started, but to 0, so then we don't need
> > this
> > +-frametable_base_pdx. The price for that is loosing of VA space
> > for
> > the range from 0 to RAM start address.
> > 
> > Right now, we are trying to support 3 boards with the following RAM
> > address:
> > 1. 0x8000_0000 - QEMU and SiFive board
> > 2. 0x40_0000_0000 - Microchip board
> > 
> > So if we mapping frametable to 0 ( not RAM start ) we will loose:
> > 1. 0x8000_0 ( amount of page entries to cover range [0,
> > 0x8000_0000] )
> > * 64 ( size of struct page_info ) = 32 MB
> > 2. 0x40_0000_0 ( amount of page entries to cover range [0,
> > 0x40_0000_0000] ) * 64 ( size of struct page_info ) = 4 Gb
> > 
> > In terms of available virtual address space for RV-64 we can
> > consider
> > both options as acceptable.
> 
> For Qemu and SiFive, 32M is definitely not worth worrying about.
> 
> I personally wouldn't worry about Microchip either.  That's 4G of 1T
> VA
> space (assuming Sv39).
> 
> > [OPTION 1] If we accepting of loosing 4 Gb of VA then we could
> > implement mfn_to_page() and page_to_mfn() in the following way:
> > ```
> >    diff --git a/xen/arch/riscv/include/asm/mm.h
> >    b/xen/arch/riscv/include/asm/mm.h
> >    index cc4a07a71c..fdac7e0646 100644
> >    --- a/xen/arch/riscv/include/asm/mm.h
> >    +++ b/xen/arch/riscv/include/asm/mm.h
> >    @@ -107,14 +107,11 @@ struct page_info
> >    
> >     #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START)
> >    
> >    -/* PDX of the first page in the frame table. */
> >    -extern unsigned long frametable_base_pdx;
> >    -
> >     /* Convert between machine frame numbers and page-info
> > structures.
> > */
> >     #define
> > mfn_to_page(mfn)                                          
> > \
> >    -    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
> >    +    (frame_table + mfn))
> >     #define
> > page_to_mfn(pg)                                           
> > \
> >    -    pdx_to_mfn((unsigned long)((pg) - frame_table) +
> >    frametable_base_pdx)
> >    +    ((unsigned long)((pg) - frame_table))
> >    
> >     static inline void *page_to_virt(const struct page_info *pg)
> >     {
> >    diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
> >    index 9c0fd80588..8f6dbdc699 100644
> >    --- a/xen/arch/riscv/mm.c
> >    +++ b/xen/arch/riscv/mm.c
> >    @@ -15,7 +15,7 @@
> >     #include <asm/page.h>
> >     #include <asm/processor.h>
> >    
> >    -unsigned long __ro_after_init frametable_base_pdx;
> >     unsigned long __ro_after_init frametable_virt_end;
> >    
> >     struct mmu_desc {
> > ```
> 
> I firmly recommend option 1, especially at this point.
Jan, as you gave your Acked before, don't you mind to define
mfn_to_page() and page_to_mfn as mentioned above ( Option 1 )?

~ Oleksi

> 
> If specific boards really have a problem with losing 4G of VA space,
> then option 2 can be added easily at a later point.
> 
> That said, I'd think carefully about doing option 2.  Even
> subtracting a
> constant - which is far better than subtracting a variable - is still
> extra overhead in fastpaths.  Option 2 needs careful consideration on
> a
> board-by-board case.
> 
> ~Andrew
Jan Beulich June 18, 2024, 8:26 a.m. UTC | #9
On 18.06.2024 10:21, Oleksii K. wrote:
> On Fri, 2024-06-14 at 10:47 +0100, Andrew Cooper wrote:
>> On 11/06/2024 7:23 pm, Oleksii K. wrote:
>>> [OPTION 1] If we accepting of loosing 4 Gb of VA then we could
>>> implement mfn_to_page() and page_to_mfn() in the following way:
>>> ```
>>>    diff --git a/xen/arch/riscv/include/asm/mm.h
>>>    b/xen/arch/riscv/include/asm/mm.h
>>>    index cc4a07a71c..fdac7e0646 100644
>>>    --- a/xen/arch/riscv/include/asm/mm.h
>>>    +++ b/xen/arch/riscv/include/asm/mm.h
>>>    @@ -107,14 +107,11 @@ struct page_info
>>>    
>>>     #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START)
>>>    
>>>    -/* PDX of the first page in the frame table. */
>>>    -extern unsigned long frametable_base_pdx;
>>>    -
>>>     /* Convert between machine frame numbers and page-info
>>> structures.
>>> */
>>>     #define
>>> mfn_to_page(mfn)                                          
>>> \
>>>    -    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
>>>    +    (frame_table + mfn))
>>>     #define
>>> page_to_mfn(pg)                                           
>>> \
>>>    -    pdx_to_mfn((unsigned long)((pg) - frame_table) +
>>>    frametable_base_pdx)
>>>    +    ((unsigned long)((pg) - frame_table))
>>>    
>>>     static inline void *page_to_virt(const struct page_info *pg)
>>>     {
>>>    diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
>>>    index 9c0fd80588..8f6dbdc699 100644
>>>    --- a/xen/arch/riscv/mm.c
>>>    +++ b/xen/arch/riscv/mm.c
>>>    @@ -15,7 +15,7 @@
>>>     #include <asm/page.h>
>>>     #include <asm/processor.h>
>>>    
>>>    -unsigned long __ro_after_init frametable_base_pdx;
>>>     unsigned long __ro_after_init frametable_virt_end;
>>>    
>>>     struct mmu_desc {
>>> ```
>>
>> I firmly recommend option 1, especially at this point.
> Jan, as you gave your Acked before, don't you mind to define
> mfn_to_page() and page_to_mfn as mentioned above ( Option 1 )?

No, I don't mind. And please feel free to keep my ack if no other significant
changes are made.

Jan
diff mbox series

Patch

diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index 07c7a0abba..cc4a07a71c 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -3,11 +3,246 @@ 
 #ifndef _ASM_RISCV_MM_H
 #define _ASM_RISCV_MM_H
 
+#include <public/xen.h>
+#include <xen/bug.h>
+#include <xen/mm-frame.h>
+#include <xen/pdx.h>
+#include <xen/types.h>
+
 #include <asm/page-bits.h>
 
 #define pfn_to_paddr(pfn) ((paddr_t)(pfn) << PAGE_SHIFT)
 #define paddr_to_pfn(pa)  ((unsigned long)((pa) >> PAGE_SHIFT))
 
+#define paddr_to_pdx(pa)    mfn_to_pdx(maddr_to_mfn(pa))
+#define gfn_to_gaddr(gfn)   pfn_to_paddr(gfn_x(gfn))
+#define gaddr_to_gfn(ga)    _gfn(paddr_to_pfn(ga))
+#define mfn_to_maddr(mfn)   pfn_to_paddr(mfn_x(mfn))
+#define maddr_to_mfn(ma)    _mfn(paddr_to_pfn(ma))
+#define vmap_to_mfn(va)     maddr_to_mfn(virt_to_maddr((vaddr_t)(va)))
+#define vmap_to_page(va)    mfn_to_page(vmap_to_mfn(va))
+
+static inline void *maddr_to_virt(paddr_t ma)
+{
+    BUG_ON("unimplemented");
+    return NULL;
+}
+
+#define virt_to_maddr(va) ({ BUG_ON("unimplemented"); 0; })
+
+/* Convert between Xen-heap virtual addresses and machine frame numbers. */
+#define __virt_to_mfn(va)  mfn_x(maddr_to_mfn(virt_to_maddr(va)))
+#define __mfn_to_virt(mfn) maddr_to_virt(mfn_to_maddr(_mfn(mfn)))
+
+/*
+ * We define non-underscored wrappers for above conversion functions.
+ * These are overriden in various source files while underscored version
+ * remain intact.
+ */
+#define virt_to_mfn(va)     __virt_to_mfn(va)
+#define mfn_to_virt(mfn)    __mfn_to_virt(mfn)
+
+struct page_info
+{
+    /* Each frame can be threaded onto a doubly-linked list. */
+    struct page_list_entry list;
+
+    /* Reference count and various PGC_xxx flags and fields. */
+    unsigned long count_info;
+
+    /* Context-dependent fields follow... */
+    union {
+        /* Page is in use: ((count_info & PGC_count_mask) != 0). */
+        struct {
+            /* Type reference count and various PGT_xxx flags and fields. */
+            unsigned long type_info;
+        } inuse;
+
+        /* Page is on a free list: ((count_info & PGC_count_mask) == 0). */
+        union {
+            struct {
+                /*
+                 * Index of the first *possibly* unscrubbed page in the buddy.
+                 * One more bit than maximum possible order to accommodate
+                 * INVALID_DIRTY_IDX.
+                 */
+#define INVALID_DIRTY_IDX ((1UL << (MAX_ORDER + 1)) - 1)
+                unsigned long first_dirty:MAX_ORDER + 1;
+
+                /* Do TLBs need flushing for safety before next page use? */
+                bool need_tlbflush:1;
+
+#define BUDDY_NOT_SCRUBBING    0
+#define BUDDY_SCRUBBING        1
+#define BUDDY_SCRUB_ABORT      2
+                unsigned long scrub_state:2;
+            };
+
+            unsigned long val;
+        } free;
+    } u;
+
+    union {
+        /* Page is in use */
+        struct {
+            /* Owner of this page (NULL if page is anonymous). */
+            struct domain *domain;
+        } inuse;
+
+        /* Page is on a free list. */
+        struct {
+            /* Order-size of the free chunk this page is the head of. */
+            unsigned int order;
+        } free;
+    } v;
+
+    union {
+        /*
+         * Timestamp from 'TLB clock', used to avoid extra safety flushes.
+         * Only valid for: a) free pages, and b) pages with zero type count
+         */
+        uint32_t tlbflush_timestamp;
+    };
+};
+
+#define frame_table ((struct page_info *)FRAMETABLE_VIRT_START)
+
+/* PDX of the first page in the frame table. */
+extern unsigned long frametable_base_pdx;
+
+/* Convert between machine frame numbers and page-info structures. */
+#define mfn_to_page(mfn)                                            \
+    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
+#define page_to_mfn(pg)                                             \
+    pdx_to_mfn((unsigned long)((pg) - frame_table) + frametable_base_pdx)
+
+static inline void *page_to_virt(const struct page_info *pg)
+{
+    return mfn_to_virt(mfn_x(page_to_mfn(pg)));
+}
+
+/* Convert between Xen-heap virtual addresses and page-info structures. */
+static inline struct page_info *virt_to_page(const void *v)
+{
+    BUG_ON("unimplemented");
+    return NULL;
+}
+
+/*
+ * Common code requires get_page_type and put_page_type.
+ * We don't care about typecounts so we just do the minimum to make it
+ * happy.
+ */
+static inline int get_page_type(struct page_info *page, unsigned long type)
+{
+    return 1;
+}
+
+static inline void put_page_type(struct page_info *page)
+{
+}
+
+static inline void put_page_and_type(struct page_info *page)
+{
+    put_page_type(page);
+    put_page(page);
+}
+
+/*
+ * RISC-V does not have an M2P, but common code expects a handful of
+ * M2P-related defines and functions. Provide dummy versions of these.
+ */
+#define INVALID_M2P_ENTRY        (~0UL)
+#define SHARED_M2P_ENTRY         (~0UL - 1UL)
+#define SHARED_M2P(_e)           ((_e) == SHARED_M2P_ENTRY)
+
+#define set_gpfn_from_mfn(mfn, pfn) do { (void)(mfn), (void)(pfn); } while (0)
+#define mfn_to_gfn(d, mfn) ((void)(d), _gfn(mfn_x(mfn)))
+
+#define PDX_GROUP_SHIFT (PAGE_SHIFT + VPN_BITS)
+
+static inline unsigned long domain_get_maximum_gpfn(struct domain *d)
+{
+    BUG_ON("unimplemented");
+    return 0;
+}
+
+static inline long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg)
+{
+    BUG_ON("unimplemented");
+    return 0;
+}
+
+/*
+ * On RISCV, all the RAM is currently direct mapped in Xen.
+ * Hence return always true.
+ */
+static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
+{
+    return true;
+}
+
+#define PG_shift(idx)   (BITS_PER_LONG - (idx))
+#define PG_mask(x, idx) (x ## UL << PG_shift(idx))
+
+#define PGT_none          PG_mask(0, 1)  /* no special uses of this page   */
+#define PGT_writable_page PG_mask(1, 1)  /* has writable mappings?         */
+#define PGT_type_mask     PG_mask(1, 1)  /* Bits 31 or 63.                 */
+
+/* Count of uses of this frame as its current type. */
+#define PGT_count_width   PG_shift(2)
+#define PGT_count_mask    ((1UL << PGT_count_width) - 1)
+
+/*
+ * Page needs to be scrubbed. Since this bit can only be set on a page that is
+ * free (i.e. in PGC_state_free) we can reuse PGC_allocated bit.
+ */
+#define _PGC_need_scrub   _PGC_allocated
+#define PGC_need_scrub    PGC_allocated
+
+/* Cleared when the owning guest 'frees' this page. */
+#define _PGC_allocated    PG_shift(1)
+#define PGC_allocated     PG_mask(1, 1)
+/* Page is Xen heap? */
+#define _PGC_xen_heap     PG_shift(2)
+#define PGC_xen_heap      PG_mask(1, 2)
+/* Page is broken? */
+#define _PGC_broken       PG_shift(7)
+#define PGC_broken        PG_mask(1, 7)
+/* Mutually-exclusive page states: { inuse, offlining, offlined, free }. */
+#define PGC_state         PG_mask(3, 9)
+#define PGC_state_inuse   PG_mask(0, 9)
+#define PGC_state_offlining PG_mask(1, 9)
+#define PGC_state_offlined PG_mask(2, 9)
+#define PGC_state_free    PG_mask(3, 9)
+#define page_state_is(pg, st) (((pg)->count_info&PGC_state) == PGC_state_##st)
+
+/* Count of references to this frame. */
+#define PGC_count_width   PG_shift(9)
+#define PGC_count_mask    ((1UL << PGC_count_width) - 1)
+
+#define _PGC_extra        PG_shift(10)
+#define PGC_extra         PG_mask(1, 10)
+
+#define is_xen_heap_page(page) ((page)->count_info & PGC_xen_heap)
+#define is_xen_heap_mfn(mfn) \
+    (mfn_valid(mfn) && is_xen_heap_page(mfn_to_page(mfn)))
+
+#define is_xen_fixed_mfn(mfn)                                   \
+    ((mfn_to_maddr(mfn) >= virt_to_maddr((vaddr_t)_start)) &&   \
+     (mfn_to_maddr(mfn) <= virt_to_maddr((vaddr_t)_end - 1)))
+
+#define page_get_owner(p)    (p)->v.inuse.domain
+#define page_set_owner(p, d) ((p)->v.inuse.domain = (d))
+
+/* TODO: implement */
+#define mfn_valid(mfn) ({ (void)(mfn); 0; })
+
+#define domain_set_alloc_bitsize(d) ((void)(d))
+#define domain_clamp_alloc_bitsize(d, b) ((void)(d), (b))
+
+#define PFN_ORDER(pg) ((pg)->v.free.order)
+
 extern unsigned char cpu0_boot_stack[];
 
 void setup_initial_pagetables(void);
@@ -20,4 +255,9 @@  unsigned long calc_phys_offset(void);
 
 void turn_on_mmu(unsigned long ra);
 
+static inline unsigned int arch_get_dma_bitsize(void)
+{
+    return 32; /* TODO */
+}
+
 #endif /* _ASM_RISCV_MM_H */
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index 053f043a3d..fe3a43be20 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -5,12 +5,12 @@ 
 #include <xen/init.h>
 #include <xen/kernel.h>
 #include <xen/macros.h>
+#include <xen/mm.h>
 #include <xen/pfn.h>
 
 #include <asm/early_printk.h>
 #include <asm/csr.h>
 #include <asm/current.h>
-#include <asm/mm.h>
 #include <asm/page.h>
 #include <asm/processor.h>
 
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index 6593f601c1..98a94c4c48 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -2,9 +2,9 @@ 
 
 #include <xen/compile.h>
 #include <xen/init.h>
+#include <xen/mm.h>
 
 #include <asm/early_printk.h>
-#include <asm/mm.h>
 
 /* Xen stack for bringing up the first CPU. */
 unsigned char __initdata cpu0_boot_stack[STACK_SIZE]