Message ID | d00b86f41ef2c7d928a28dadd8c34fb845f23d0a.1717008161.git.oleksii.kurochko@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | *Enable build of full Xen for RISC-V | expand |
On 29/05/2024 8:55 pm, Oleksii Kurochko wrote: > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> > Acked-by: Jan Beulich <jbeulich@suse.com> This patch looks like it can go in independently? Or does it depend on having bitops.h working in practice? However, one very strong suggestion... > diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h > index 07c7a0abba..cc4a07a71c 100644 > --- a/xen/arch/riscv/include/asm/mm.h > +++ b/xen/arch/riscv/include/asm/mm.h > @@ -3,11 +3,246 @@ > <snip> > +/* PDX of the first page in the frame table. */ > +extern unsigned long frametable_base_pdx; > + > +/* Convert between machine frame numbers and page-info structures. */ > +#define mfn_to_page(mfn) \ > + (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) > +#define page_to_mfn(pg) \ > + pdx_to_mfn((unsigned long)((pg) - frame_table) + frametable_base_pdx) Do yourself a favour and not introduce frametable_base_pdx to begin with. Every RISC-V board I can find has things starting from 0 in physical address space, with RAM starting immediately after. Taking the microchip board as an example, RAM actually starts at 0x8000000, which means that having frametable_base_pdx and assuming it does get set to 0x8000 (which isn't even a certainty, given that I think you'll need struct pages covering the PLICs), then what you are trading off is: * Saving 32k of virtual address space only (no need to even allocate memory for this range of the framtable), by * Having an extra memory load and add/sub in every page <-> mfn conversion, which is a screaming hotpath all over Xen. It's a terribly short-sighted tradeoff. 32k of VA space might be worth saving in a 32bit build (I personally wouldn't - especially as there's no need to share Xen's VA space with guests, given no PV guests on ARM/RISC-V), but it's absolutely not at all in an a 64bit build with TB of VA space available. Even if we do find a board with the first interesting thing in the frametable starting sufficiently away from 0 that it might be worth considering this slide, then it should still be Kconfig-able in a similar way to PDX_COMPRESSION. You don't want to be penalising everyone because of a theoretical/weird corner case. ~Andrew
On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote: > On 29/05/2024 8:55 pm, Oleksii Kurochko wrote: > > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> > > Acked-by: Jan Beulich <jbeulich@suse.com> > > This patch looks like it can go in independently? Or does it depend > on > having bitops.h working in practice? > > However, one very strong suggestion... > > > > diff --git a/xen/arch/riscv/include/asm/mm.h > > b/xen/arch/riscv/include/asm/mm.h > > index 07c7a0abba..cc4a07a71c 100644 > > --- a/xen/arch/riscv/include/asm/mm.h > > +++ b/xen/arch/riscv/include/asm/mm.h > > @@ -3,11 +3,246 @@ > > <snip> > > +/* PDX of the first page in the frame table. */ > > +extern unsigned long frametable_base_pdx; > > + > > +/* Convert between machine frame numbers and page-info structures. > > */ > > +#define > > mfn_to_page(mfn) \ > > + (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) > > +#define > > page_to_mfn(pg) \ > > + pdx_to_mfn((unsigned long)((pg) - frame_table) + > > frametable_base_pdx) > > Do yourself a favour and not introduce frametable_base_pdx to begin > with. > > Every RISC-V board I can find has things starting from 0 in physical > address space, with RAM starting immediately after. I checked Linux kernel and grep there: [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/ -- exclude "*.tmp" -I arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive-2.dtsi:33: memory@40000000 { arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28: memory@80000000 { arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49: ddrc_cache: memory@1000000000 { arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33: ddrc_cache_lo: memory@80000000 { arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37: ddrc_cache_hi: memory@1040000000 { arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34: ddrc_cache_lo: memory@80000000 { arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40: ddrc_cache_hi: memory@1000000000 { arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22: ddrc_cache_lo: memory@80000000 { arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27: ddrc_cache_hi: memory@1000000000 { arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57: ddrc_cache_lo: memory@80000000 { arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63: ddrc_cache_hi: memory@1040000000 { arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32: memory@0 { arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14: memory@0 { arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26: memory@80000000 { arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12: memory@80000000 { arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26: memory@80000000 { arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25: memory@80000000 { arch/riscv/boot/dts/canaan/k210.dtsi:82: sram: memory@80000000 { And based on that majority of supported by Linux kernel boards has RAM starting not from 0 in physical address space. Am I confusing something? > > Taking the microchip board as an example, RAM actually starts at > 0x8000000, Today we had conversation with the guy from SiFive in xen-devel channel and he mentioned that they are using "starfive visionfive2 and sifive unleashed platforms." which based on the grep above has RAM not at 0 address. Also, QEMU uses 0x8000000. > which means that having frametable_base_pdx and assuming it > does get set to 0x8000 (which isn't even a certainty, given that I > think > you'll need struct pages covering the PLICs), then what you are > trading > off is: > > * Saving 32k of virtual address space only (no need to even allocate > memory for this range of the framtable), by > * Having an extra memory load and add/sub in every page <-> mfn > conversion, which is a screaming hotpath all over Xen. > > It's a terribly short-sighted tradeoff. > > 32k of VA space might be worth saving in a 32bit build (I personally > wouldn't - especially as there's no need to share Xen's VA space with > guests, given no PV guests on ARM/RISC-V), but it's absolutely not at > all in an a 64bit build with TB of VA space available. > > Even if we do find a board with the first interesting thing in the > frametable starting sufficiently away from 0 that it might be worth > considering this slide, then it should still be Kconfig-able in a > similar way to PDX_COMPRESSION. I found your tradeoffs reasonable, but I don't understand how it will work if RAM does not start from 0, as the frametable address and RAM address are linked. I tried to look at the PDX_COMPRESSION config and couldn't find any "slide" there. Could you please clarify this for me? If to use this "slide" would it help to avoid the mentioned above tradeoffs? One more question: if we decide to go without frametable_base_pdx, would it be sufficient to simply remove mentions of it from the code ( at least, for now )? ~ Oleksii > > You don't want to be penalising everyone because of a > theoretical/weird > corner case. > > ~Andrew
On 30.05.2024 19:23, Andrew Cooper wrote: > On 29/05/2024 8:55 pm, Oleksii Kurochko wrote: >> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> >> Acked-by: Jan Beulich <jbeulich@suse.com> > > This patch looks like it can go in independently? Or does it depend on > having bitops.h working in practice? The latter, iirc. In fact I had already tried at least twice to pull this ahead. Jan
On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote: > On 29/05/2024 8:55 pm, Oleksii Kurochko wrote: > > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> > > Acked-by: Jan Beulich <jbeulich@suse.com> > > This patch looks like it can go in independently? Or does it depend > on > having bitops.h working in practice? > > However, one very strong suggestion... > > > > diff --git a/xen/arch/riscv/include/asm/mm.h > > b/xen/arch/riscv/include/asm/mm.h > > index 07c7a0abba..cc4a07a71c 100644 > > --- a/xen/arch/riscv/include/asm/mm.h > > +++ b/xen/arch/riscv/include/asm/mm.h > > @@ -3,11 +3,246 @@ > > <snip> > > +/* PDX of the first page in the frame table. */ > > +extern unsigned long frametable_base_pdx; > > + > > +/* Convert between machine frame numbers and page-info structures. > > */ > > +#define > > mfn_to_page(mfn) \ > > + (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) > > +#define > > page_to_mfn(pg) \ > > + pdx_to_mfn((unsigned long)((pg) - frame_table) + > > frametable_base_pdx) > > Do yourself a favour and not introduce frametable_base_pdx to begin > with. To drop frametable_base_pdx if the following changes will be enough: diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h index cc4a07a71c..fdac7e0646 100644 --- a/xen/arch/riscv/include/asm/mm.h +++ b/xen/arch/riscv/include/asm/mm.h @@ -107,14 +107,11 @@ struct page_info #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START) -/* PDX of the first page in the frame table. */ -extern unsigned long frametable_base_pdx; - /* Convert between machine frame numbers and page-info structures. */ #define mfn_to_page(mfn) \ - (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) + (frame_table + mfn - FRAMETABLE_BASE_OFFSET)) #define page_to_mfn(pg) \ - pdx_to_mfn((unsigned long)((pg) - frame_table) + frametable_base_pdx) + ((unsigned long)((pg) - frame_table) + FRAMETABLE_BASE_OFFSET) static inline void *page_to_virt(const struct page_info *pg) { diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c index 9c0fd80588..8f6dbdc699 100644 --- a/xen/arch/riscv/mm.c +++ b/xen/arch/riscv/mm.c @@ -15,7 +15,7 @@ #include <asm/page.h> #include <asm/processor.h> -unsigned long __ro_after_init frametable_base_pdx; unsigned long __ro_after_init frametable_virt_end; struct mmu_desc { > > Every RISC-V board I can find has things starting from 0 in physical > address space, with RAM starting immediately after. > > Taking the microchip board as an example, RAM actually starts at > 0x8000000, which means that having frametable_base_pdx and assuming > it > does get set to 0x8000 (which isn't even a certainty, given that I > think > you'll need struct pages covering the PLICs), then what you are > trading > off is: > > * Saving 32k of virtual address space only (no need to even allocate > memory for this range of the framtable), by > * Having an extra memory load and add/sub in every page <-> mfn > conversion, which is a screaming hotpath all over Xen. Are you referring here to `mfn_to_pdx()` used in `mfn_to_page()` and `pdx_to_mfn()` in `page_to_mfn()`? My expectation was that when CONFIG_PDX_COMPRESSION is disabled then this macros doesn't do anything: /* pdx<->pfn == identity */ #define pdx_to_pfn(x) (x) #define pfn_to_pdx(x) (x) > > It's a terribly short-sighted tradeoff. > > 32k of VA space might be worth saving in a 32bit build (I personally > wouldn't - especially as there's no need to share Xen's VA space with > guests, given no PV guests on ARM/RISC-V), but it's absolutely not at > all in an a 64bit build with TB of VA space available. Why 32k? If RAM is started at 0x8000_0000 then we have to cover 0x80000 entries, and the size of it is 0x8_0000 = 524288 * 64 = 33554432, so it is 32 Mb. Am I confusing something? P.S.: Should I map this start 32 mb? Or it is enough to have a slide ( FRAMETABLE_BASE_OFFSET ). ~ Oleksii > > Even if we do find a board with the first interesting thing in the > frametable starting sufficiently away from 0 that it might be worth > considering this slide, then it should still be Kconfig-able in a > similar way to PDX_COMPRESSION. > > You don't want to be penalising everyone because of a > theoretical/weird > corner case. > > ~Andrew
On 30/05/2024 7:22 pm, Oleksii K. wrote: > On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote: >> On 29/05/2024 8:55 pm, Oleksii Kurochko wrote: >>> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> >>> Acked-by: Jan Beulich <jbeulich@suse.com> >> This patch looks like it can go in independently? Or does it depend >> on >> having bitops.h working in practice? >> >> However, one very strong suggestion... >> >> >>> diff --git a/xen/arch/riscv/include/asm/mm.h >>> b/xen/arch/riscv/include/asm/mm.h >>> index 07c7a0abba..cc4a07a71c 100644 >>> --- a/xen/arch/riscv/include/asm/mm.h >>> +++ b/xen/arch/riscv/include/asm/mm.h >>> @@ -3,11 +3,246 @@ >>> <snip> >>> +/* PDX of the first page in the frame table. */ >>> +extern unsigned long frametable_base_pdx; >>> + >>> +/* Convert between machine frame numbers and page-info structures. >>> */ >>> +#define >>> mfn_to_page(mfn) \ >>> + (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) >>> +#define >>> page_to_mfn(pg) \ >>> + pdx_to_mfn((unsigned long)((pg) - frame_table) + >>> frametable_base_pdx) >> Do yourself a favour and not introduce frametable_base_pdx to begin >> with. >> >> Every RISC-V board I can find has things starting from 0 in physical >> address space, with RAM starting immediately after. > I checked Linux kernel and grep there: > [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/ -- > exclude "*.tmp" -I > arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive-2.dtsi:33: > memory@40000000 { > arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28: memory@80000000 > { > arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49: ddrc_cache: > memory@1000000000 { > arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33: ddrc_cache_lo: > memory@80000000 { > arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37: ddrc_cache_hi: > memory@1040000000 { > arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34: ddrc_cache_lo: > memory@80000000 { > arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40: ddrc_cache_hi: > memory@1000000000 { > arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22: ddrc_cache_lo: > memory@80000000 { > arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27: ddrc_cache_hi: > memory@1000000000 { > arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57: ddrc_cache_lo: > memory@80000000 { > arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63: ddrc_cache_hi: > memory@1040000000 { > arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32: memory@0 { > arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14: > memory@0 { > arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26: memory@80000000 > { > arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12: memory@80000000 { > arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26: memory@80000000 > { > arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25: memory@80000000 > { > arch/riscv/boot/dts/canaan/k210.dtsi:82: sram: memory@80000000 { > > And based on that majority of supported by Linux kernel boards has RAM > starting not from 0 in physical address space. Am I confusing > something? > >> Taking the microchip board as an example, RAM actually starts at >> 0x8000000, > Today we had conversation with the guy from SiFive in xen-devel channel > and he mentioned that they are using "starfive visionfive2 and sifive > unleashed platforms." which based on the grep above has RAM not at 0 > address. > > Also, QEMU uses 0x8000000. > >> which means that having frametable_base_pdx and assuming it >> does get set to 0x8000 (which isn't even a certainty, given that I >> think >> you'll need struct pages covering the PLICs), then what you are >> trading >> off is: >> >> * Saving 32k of virtual address space only (no need to even allocate >> memory for this range of the framtable), by >> * Having an extra memory load and add/sub in every page <-> mfn >> conversion, which is a screaming hotpath all over Xen. >> >> It's a terribly short-sighted tradeoff. >> >> 32k of VA space might be worth saving in a 32bit build (I personally >> wouldn't - especially as there's no need to share Xen's VA space with >> guests, given no PV guests on ARM/RISC-V), but it's absolutely not at >> all in an a 64bit build with TB of VA space available. >> >> Even if we do find a board with the first interesting thing in the >> frametable starting sufficiently away from 0 that it might be worth >> considering this slide, then it should still be Kconfig-able in a >> similar way to PDX_COMPRESSION. > I found your tradeoffs reasonable, but I don't understand how it will > work if RAM does not start from 0, as the frametable address and RAM > address are linked. > I tried to look at the PDX_COMPRESSION config and couldn't find any > "slide" there. Could you please clarify this for me? > If to use this "slide" would it help to avoid the mentioned above > tradeoffs? > > One more question: if we decide to go without frametable_base_pdx, > would it be sufficient to simply remove mentions of it from the code ( > at least, for now )? There is a relationship between system/host physical addresses (what Xen called maddr/mfn), and the frametable. The frametable has one entry per mfn. In the most simple case, there's a 1:1 relationship. i.e. frametable[0] = maddr(0), frametable[1] = maddr(4k) etc. This is very simple, and very easy to calculate (page_to_mfn()/mfn_to_page()). The frametable is one big array. It starts at a compile-time fixed address, and needs to be long enough to cover everything interesting in memory. Therefore it potentially takes a large amount of virtual address space. However, only interesting maddrs need to have data in the frametable, so it's fine for the backing RAM to be sparsely allocated/mapped in the frametable virtual addresses. For 64bit, that's really all you need, because there's always far more virtual address space than physical RAM in the system, even when you're looking at TB-scale giant servers. For 32bit, virtual address space is a limited resource. (Also to an extent, 64bit x86 with PV guests because we give 98% of the virtual address space to the guest kernel to use.) There are two tricks to reduce the virtual address space used, but they both cost performance in fastpaths. 1) PDX Compression. PDX compression makes a non-linear mfn <-> maddr mapping. This is for a usecase where you've got multiple RAM banks which are separated by a large distance (and evenly spaced), then you can "compress" a single range of 0's out of the middle of the system/host physical address. The cost is that all page <-> mfn conversions need to read two masks and a shift-count from variables in memory, to split/shift/recombine the address bits. 2) A slide, which is frametable_base_pdx in this context. When there's a big gap between 0 and the start of something interesting, you could chop out that range by just subtracting base_pdx. What qualifies as "big" is subjective, but Qemu starting at 128M certainly does not qualify as big enough to warrant frametable_base_pdx. This is less expensive that PDX compression. It only adds one memory read to the fastpath, but it also doesn't save as much virtual address space as PDX compression. When virtual address space is a major constraint (32 bit builds), both of these techniques are worth doing. But when there's no constraint on virtual address space (64 bit builds), there's no reason to use either; and the performance will definitely improve as a result. ~Andrew
On Tue, 2024-06-11 at 16:53 +0100, Andrew Cooper wrote: > On 30/05/2024 7:22 pm, Oleksii K. wrote: > > On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote: > > > On 29/05/2024 8:55 pm, Oleksii Kurochko wrote: > > > > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> > > > > Acked-by: Jan Beulich <jbeulich@suse.com> > > > This patch looks like it can go in independently? Or does it > > > depend > > > on > > > having bitops.h working in practice? > > > > > > However, one very strong suggestion... > > > > > > > > > > diff --git a/xen/arch/riscv/include/asm/mm.h > > > > b/xen/arch/riscv/include/asm/mm.h > > > > index 07c7a0abba..cc4a07a71c 100644 > > > > --- a/xen/arch/riscv/include/asm/mm.h > > > > +++ b/xen/arch/riscv/include/asm/mm.h > > > > @@ -3,11 +3,246 @@ > > > > <snip> > > > > +/* PDX of the first page in the frame table. */ > > > > +extern unsigned long frametable_base_pdx; > > > > + > > > > +/* Convert between machine frame numbers and page-info > > > > structures. > > > > */ > > > > +#define > > > > mfn_to_page(mfn) \ > > > > + (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) > > > > +#define > > > > page_to_mfn(pg) \ > > > > + pdx_to_mfn((unsigned long)((pg) - frame_table) + > > > > frametable_base_pdx) > > > Do yourself a favour and not introduce frametable_base_pdx to > > > begin > > > with. > > > > > > Every RISC-V board I can find has things starting from 0 in > > > physical > > > address space, with RAM starting immediately after. > > I checked Linux kernel and grep there: > > [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/ > > -- > > exclude "*.tmp" -I > > arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive- > > 2.dtsi:33: > > memory@40000000 { > > arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28: > > memory@80000000 > > { > > arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49: > > ddrc_cache: > > memory@1000000000 { > > arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33: > > ddrc_cache_lo: > > memory@80000000 { > > arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37: > > ddrc_cache_hi: > > memory@1040000000 { > > arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34: > > ddrc_cache_lo: > > memory@80000000 { > > arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40: > > ddrc_cache_hi: > > memory@1000000000 { > > arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22: > > ddrc_cache_lo: > > memory@80000000 { > > arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27: > > ddrc_cache_hi: > > memory@1000000000 { > > arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57: > > ddrc_cache_lo: > > memory@80000000 { > > arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63: > > ddrc_cache_hi: > > memory@1040000000 { > > arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32: memory@0 > > { > > arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14: > > memory@0 { > > arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26: > > memory@80000000 > > { > > arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12: memory@80000000 > > { > > arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26: > > memory@80000000 > > { > > arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25: > > memory@80000000 > > { > > arch/riscv/boot/dts/canaan/k210.dtsi:82: sram: > > memory@80000000 { > > > > And based on that majority of supported by Linux kernel boards has > > RAM > > starting not from 0 in physical address space. Am I confusing > > something? > > > > > Taking the microchip board as an example, RAM actually starts at > > > 0x8000000, > > Today we had conversation with the guy from SiFive in xen-devel > > channel > > and he mentioned that they are using "starfive visionfive2 and > > sifive > > unleashed platforms." which based on the grep above has RAM not at > > 0 > > address. > > > > Also, QEMU uses 0x8000000. > > > > > which means that having frametable_base_pdx and assuming it > > > does get set to 0x8000 (which isn't even a certainty, given that > > > I > > > think > > > you'll need struct pages covering the PLICs), then what you are > > > trading > > > off is: > > > > > > * Saving 32k of virtual address space only (no need to even > > > allocate > > > memory for this range of the framtable), by > > > * Having an extra memory load and add/sub in every page <-> mfn > > > conversion, which is a screaming hotpath all over Xen. > > > > > > It's a terribly short-sighted tradeoff. > > > > > > 32k of VA space might be worth saving in a 32bit build (I > > > personally > > > wouldn't - especially as there's no need to share Xen's VA space > > > with > > > guests, given no PV guests on ARM/RISC-V), but it's absolutely > > > not at > > > all in an a 64bit build with TB of VA space available. > > > > > > Even if we do find a board with the first interesting thing in > > > the > > > frametable starting sufficiently away from 0 that it might be > > > worth > > > considering this slide, then it should still be Kconfig-able in a > > > similar way to PDX_COMPRESSION. > > I found your tradeoffs reasonable, but I don't understand how it > > will > > work if RAM does not start from 0, as the frametable address and > > RAM > > address are linked. > > I tried to look at the PDX_COMPRESSION config and couldn't find any > > "slide" there. Could you please clarify this for me? > > If to use this "slide" would it help to avoid the mentioned above > > tradeoffs? > > > > One more question: if we decide to go without frametable_base_pdx, > > would it be sufficient to simply remove mentions of it from the > > code ( > > at least, for now )? > > There is a relationship between system/host physical addresses (what > Xen > called maddr/mfn), and the frametable. The frametable has one entry > per > mfn. > > In the most simple case, there's a 1:1 relationship. i.e. > frametable[0] > = maddr(0), frametable[1] = maddr(4k) etc. This is very simple, and > very easy to calculate (page_to_mfn()/mfn_to_page()). > > The frametable is one big array. It starts at a compile-time fixed > address, and needs to be long enough to cover everything interesting > in > memory. Therefore it potentially takes a large amount of virtual > address space. > > However, only interesting maddrs need to have data in the frametable, > so > it's fine for the backing RAM to be sparsely allocated/mapped in the > frametable virtual addresses. > > For 64bit, that's really all you need, because there's always far > more > virtual address space than physical RAM in the system, even when > you're > looking at TB-scale giant servers. > > > For 32bit, virtual address space is a limited resource. (Also to an > extent, 64bit x86 with PV guests because we give 98% of the virtual > address space to the guest kernel to use.) > > There are two tricks to reduce the virtual address space used, but > they > both cost performance in fastpaths. > > 1) PDX Compression. > > PDX compression makes a non-linear mfn <-> maddr mapping. This is > for a > usecase where you've got multiple RAM banks which are separated by a > large distance (and evenly spaced), then you can "compress" a single > range of 0's out of the middle of the system/host physical address. > > The cost is that all page <-> mfn conversions need to read two masks > and > a shift-count from variables in memory, to split/shift/recombine the > address bits. > > 2) A slide, which is frametable_base_pdx in this context. > > When there's a big gap between 0 and the start of something > interesting, > you could chop out that range by just subtracting base_pdx. What > qualifies as "big" is subjective, but Qemu starting at 128M certainly > does not qualify as big enough to warrant frametable_base_pdx. > > This is less expensive that PDX compression. It only adds one memory > read to the fastpath, but it also doesn't save as much virtual > address > space as PDX compression. > > > When virtual address space is a major constraint (32 bit builds), > both > of these techniques are worth doing. But when there's no constraint > on > virtual address space (64 bit builds), there's no reason to use > either; > and the performance will definitely improve as a result. Thanks for such good explanation. For RISC-V we have PDX config disabled as I haven't seen multiple RAM banks at boards which has hypervisor extension. Thereby mfn_to_pdx() and pdx_to_mfn() do nothing. The same for frametable_base_pdx, in case of PDX disabled, it just an offset ( or a slide ). IIUUC, you meant that it make sense to map frametable not to the address of where RAM is started, but to 0, so then we don't need this +-frametable_base_pdx. The price for that is loosing of VA space for the range from 0 to RAM start address. Right now, we are trying to support 3 boards with the following RAM address: 1. 0x8000_0000 - QEMU and SiFive board 2. 0x40_0000_0000 - Microchip board So if we mapping frametable to 0 ( not RAM start ) we will loose: 1. 0x8000_0 ( amount of page entries to cover range [0, 0x8000_0000] ) * 64 ( size of struct page_info ) = 32 MB 2. 0x40_0000_0 ( amount of page entries to cover range [0, 0x40_0000_0000] ) * 64 ( size of struct page_info ) = 4 Gb In terms of available virtual address space for RV-64 we can consider both options as acceptable. [OPTION 1] If we accepting of loosing 4 Gb of VA then we could implement mfn_to_page() and page_to_mfn() in the following way: ``` diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h index cc4a07a71c..fdac7e0646 100644 --- a/xen/arch/riscv/include/asm/mm.h +++ b/xen/arch/riscv/include/asm/mm.h @@ -107,14 +107,11 @@ struct page_info #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START) -/* PDX of the first page in the frame table. */ -extern unsigned long frametable_base_pdx; - /* Convert between machine frame numbers and page-info structures. */ #define mfn_to_page(mfn) \ - (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) + (frame_table + mfn)) #define page_to_mfn(pg) \ - pdx_to_mfn((unsigned long)((pg) - frame_table) + frametable_base_pdx) + ((unsigned long)((pg) - frame_table)) static inline void *page_to_virt(const struct page_info *pg) { diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c index 9c0fd80588..8f6dbdc699 100644 --- a/xen/arch/riscv/mm.c +++ b/xen/arch/riscv/mm.c @@ -15,7 +15,7 @@ #include <asm/page.h> #include <asm/processor.h> -unsigned long __ro_after_init frametable_base_pdx; unsigned long __ro_after_init frametable_virt_end; struct mmu_desc { ``` [OPTION 2] If we are not accepting of loosing 4 Gb of VA I think that there is no sense to rework that in the following way: ``` diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h index cc4a07a71c..fdac7e0646 100644 --- a/xen/arch/riscv/include/asm/mm.h +++ b/xen/arch/riscv/include/asm/mm.h @@ -107,14 +107,11 @@ struct page_info #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START) -/* PDX of the first page in the frame table. */ -extern unsigned long frametable_base_pdx; - /* Convert between machine frame numbers and page-info structures. */ #define mfn_to_page(mfn) \ - (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) + (frame_table + mfn - FRAMETABLE_BASE_OFFSET)) #define page_to_mfn(pg) \ - pdx_to_mfn((unsigned long)((pg) - frame_table) + frametable_base_pdx) + ((unsigned long)((pg) - frame_table) + FRAMETABLE_BASE_OFFSET) static inline void *page_to_virt(const struct page_info *pg) { diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c index 9c0fd80588..8f6dbdc699 100644 --- a/xen/arch/riscv/mm.c +++ b/xen/arch/riscv/mm.c @@ -15,7 +15,7 @@ #include <asm/page.h> #include <asm/processor.h> -unsigned long __ro_after_init frametable_base_pdx; unsigned long __ro_after_init frametable_virt_end; struct mmu_desc { ``` And I am not sure that there is any sense in option 2 as basically it the same to have the following macros definition with PDX disabled: ``` /* Convert between machine frame numbers and page-info structures. */ #define mfn_to_page(mfn) \ (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) #define page_to_mfn(pg) \ pdx_to_mfn((unsigned long)((pg) - frame_table) + frametable_base_pdx) ``` The only sense of option 2 is the case when FRAMETABLE_BASE_OFFSET is equal to 0, so the compiler will generate the code without additional sub/add of FRAMETABLE_BASE_OFFSET. Could you please clarify if my understanding is correct? Should we still change the definition of mfn_to_page() and page_to_mfn() with the fact that PDX is disabled? If yes, OPTION_1 is okay or I am missing something? Thanks in advance. ~ Oleksii
On 11/06/2024 7:23 pm, Oleksii K. wrote: > On Tue, 2024-06-11 at 16:53 +0100, Andrew Cooper wrote: >> On 30/05/2024 7:22 pm, Oleksii K. wrote: >>> On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote: >>>> On 29/05/2024 8:55 pm, Oleksii Kurochko wrote: >>>>> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> >>>>> Acked-by: Jan Beulich <jbeulich@suse.com> >>>> This patch looks like it can go in independently? Or does it >>>> depend >>>> on >>>> having bitops.h working in practice? >>>> >>>> However, one very strong suggestion... >>>> >>>> >>>>> diff --git a/xen/arch/riscv/include/asm/mm.h >>>>> b/xen/arch/riscv/include/asm/mm.h >>>>> index 07c7a0abba..cc4a07a71c 100644 >>>>> --- a/xen/arch/riscv/include/asm/mm.h >>>>> +++ b/xen/arch/riscv/include/asm/mm.h >>>>> @@ -3,11 +3,246 @@ >>>>> <snip> >>>>> +/* PDX of the first page in the frame table. */ >>>>> +extern unsigned long frametable_base_pdx; >>>>> + >>>>> +/* Convert between machine frame numbers and page-info >>>>> structures. >>>>> */ >>>>> +#define >>>>> mfn_to_page(mfn) \ >>>>> + (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) >>>>> +#define >>>>> page_to_mfn(pg) \ >>>>> + pdx_to_mfn((unsigned long)((pg) - frame_table) + >>>>> frametable_base_pdx) >>>> Do yourself a favour and not introduce frametable_base_pdx to >>>> begin >>>> with. >>>> >>>> Every RISC-V board I can find has things starting from 0 in >>>> physical >>>> address space, with RAM starting immediately after. >>> I checked Linux kernel and grep there: >>> [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/ >>> -- >>> exclude "*.tmp" -I >>> arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive- >>> 2.dtsi:33: >>> memory@40000000 { >>> arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28: >>> memory@80000000 >>> { >>> arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49: >>> ddrc_cache: >>> memory@1000000000 { >>> arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33: >>> ddrc_cache_lo: >>> memory@80000000 { >>> arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37: >>> ddrc_cache_hi: >>> memory@1040000000 { >>> arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34: >>> ddrc_cache_lo: >>> memory@80000000 { >>> arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40: >>> ddrc_cache_hi: >>> memory@1000000000 { >>> arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22: >>> ddrc_cache_lo: >>> memory@80000000 { >>> arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27: >>> ddrc_cache_hi: >>> memory@1000000000 { >>> arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57: >>> ddrc_cache_lo: >>> memory@80000000 { >>> arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63: >>> ddrc_cache_hi: >>> memory@1040000000 { >>> arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32: memory@0 >>> { >>> arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14: >>> memory@0 { >>> arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26: >>> memory@80000000 >>> { >>> arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12: memory@80000000 >>> { >>> arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26: >>> memory@80000000 >>> { >>> arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25: >>> memory@80000000 >>> { >>> arch/riscv/boot/dts/canaan/k210.dtsi:82: sram: >>> memory@80000000 { >>> >>> And based on that majority of supported by Linux kernel boards has >>> RAM >>> starting not from 0 in physical address space. Am I confusing >>> something? >>> >>>> Taking the microchip board as an example, RAM actually starts at >>>> 0x8000000, >>> Today we had conversation with the guy from SiFive in xen-devel >>> channel >>> and he mentioned that they are using "starfive visionfive2 and >>> sifive >>> unleashed platforms." which based on the grep above has RAM not at >>> 0 >>> address. >>> >>> Also, QEMU uses 0x8000000. >>> >>>> which means that having frametable_base_pdx and assuming it >>>> does get set to 0x8000 (which isn't even a certainty, given that >>>> I >>>> think >>>> you'll need struct pages covering the PLICs), then what you are >>>> trading >>>> off is: >>>> >>>> * Saving 32k of virtual address space only (no need to even >>>> allocate >>>> memory for this range of the framtable), by >>>> * Having an extra memory load and add/sub in every page <-> mfn >>>> conversion, which is a screaming hotpath all over Xen. >>>> >>>> It's a terribly short-sighted tradeoff. >>>> >>>> 32k of VA space might be worth saving in a 32bit build (I >>>> personally >>>> wouldn't - especially as there's no need to share Xen's VA space >>>> with >>>> guests, given no PV guests on ARM/RISC-V), but it's absolutely >>>> not at >>>> all in an a 64bit build with TB of VA space available. >>>> >>>> Even if we do find a board with the first interesting thing in >>>> the >>>> frametable starting sufficiently away from 0 that it might be >>>> worth >>>> considering this slide, then it should still be Kconfig-able in a >>>> similar way to PDX_COMPRESSION. >>> I found your tradeoffs reasonable, but I don't understand how it >>> will >>> work if RAM does not start from 0, as the frametable address and >>> RAM >>> address are linked. >>> I tried to look at the PDX_COMPRESSION config and couldn't find any >>> "slide" there. Could you please clarify this for me? >>> If to use this "slide" would it help to avoid the mentioned above >>> tradeoffs? >>> >>> One more question: if we decide to go without frametable_base_pdx, >>> would it be sufficient to simply remove mentions of it from the >>> code ( >>> at least, for now )? >> There is a relationship between system/host physical addresses (what >> Xen >> called maddr/mfn), and the frametable. The frametable has one entry >> per >> mfn. >> >> In the most simple case, there's a 1:1 relationship. i.e. >> frametable[0] >> = maddr(0), frametable[1] = maddr(4k) etc. This is very simple, and >> very easy to calculate (page_to_mfn()/mfn_to_page()). >> >> The frametable is one big array. It starts at a compile-time fixed >> address, and needs to be long enough to cover everything interesting >> in >> memory. Therefore it potentially takes a large amount of virtual >> address space. >> >> However, only interesting maddrs need to have data in the frametable, >> so >> it's fine for the backing RAM to be sparsely allocated/mapped in the >> frametable virtual addresses. >> >> For 64bit, that's really all you need, because there's always far >> more >> virtual address space than physical RAM in the system, even when >> you're >> looking at TB-scale giant servers. >> >> >> For 32bit, virtual address space is a limited resource. (Also to an >> extent, 64bit x86 with PV guests because we give 98% of the virtual >> address space to the guest kernel to use.) >> >> There are two tricks to reduce the virtual address space used, but >> they >> both cost performance in fastpaths. >> >> 1) PDX Compression. >> >> PDX compression makes a non-linear mfn <-> maddr mapping. This is >> for a >> usecase where you've got multiple RAM banks which are separated by a >> large distance (and evenly spaced), then you can "compress" a single >> range of 0's out of the middle of the system/host physical address. >> >> The cost is that all page <-> mfn conversions need to read two masks >> and >> a shift-count from variables in memory, to split/shift/recombine the >> address bits. >> >> 2) A slide, which is frametable_base_pdx in this context. >> >> When there's a big gap between 0 and the start of something >> interesting, >> you could chop out that range by just subtracting base_pdx. What >> qualifies as "big" is subjective, but Qemu starting at 128M certainly >> does not qualify as big enough to warrant frametable_base_pdx. >> >> This is less expensive that PDX compression. It only adds one memory >> read to the fastpath, but it also doesn't save as much virtual >> address >> space as PDX compression. >> >> >> When virtual address space is a major constraint (32 bit builds), >> both >> of these techniques are worth doing. But when there's no constraint >> on >> virtual address space (64 bit builds), there's no reason to use >> either; >> and the performance will definitely improve as a result. > Thanks for such good explanation. > > For RISC-V we have PDX config disabled as I haven't seen multiple RAM > banks at boards which has hypervisor extension. Thereby mfn_to_pdx() > and pdx_to_mfn() do nothing. The same for frametable_base_pdx, in case > of PDX disabled, it just an offset ( or a slide ). > > IIUUC, you meant that it make sense to map frametable not to the > address of where RAM is started, but to 0, so then we don't need this > +-frametable_base_pdx. The price for that is loosing of VA space for > the range from 0 to RAM start address. > > Right now, we are trying to support 3 boards with the following RAM > address: > 1. 0x8000_0000 - QEMU and SiFive board > 2. 0x40_0000_0000 - Microchip board > > So if we mapping frametable to 0 ( not RAM start ) we will loose: > 1. 0x8000_0 ( amount of page entries to cover range [0, 0x8000_0000] ) > * 64 ( size of struct page_info ) = 32 MB > 2. 0x40_0000_0 ( amount of page entries to cover range [0, > 0x40_0000_0000] ) * 64 ( size of struct page_info ) = 4 Gb > > In terms of available virtual address space for RV-64 we can consider > both options as acceptable. For Qemu and SiFive, 32M is definitely not worth worrying about. I personally wouldn't worry about Microchip either. That's 4G of 1T VA space (assuming Sv39). > [OPTION 1] If we accepting of loosing 4 Gb of VA then we could > implement mfn_to_page() and page_to_mfn() in the following way: > ``` > diff --git a/xen/arch/riscv/include/asm/mm.h > b/xen/arch/riscv/include/asm/mm.h > index cc4a07a71c..fdac7e0646 100644 > --- a/xen/arch/riscv/include/asm/mm.h > +++ b/xen/arch/riscv/include/asm/mm.h > @@ -107,14 +107,11 @@ struct page_info > > #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START) > > -/* PDX of the first page in the frame table. */ > -extern unsigned long frametable_base_pdx; > - > /* Convert between machine frame numbers and page-info structures. > */ > #define mfn_to_page(mfn) > \ > - (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) > + (frame_table + mfn)) > #define page_to_mfn(pg) > \ > - pdx_to_mfn((unsigned long)((pg) - frame_table) + > frametable_base_pdx) > + ((unsigned long)((pg) - frame_table)) > > static inline void *page_to_virt(const struct page_info *pg) > { > diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c > index 9c0fd80588..8f6dbdc699 100644 > --- a/xen/arch/riscv/mm.c > +++ b/xen/arch/riscv/mm.c > @@ -15,7 +15,7 @@ > #include <asm/page.h> > #include <asm/processor.h> > > -unsigned long __ro_after_init frametable_base_pdx; > unsigned long __ro_after_init frametable_virt_end; > > struct mmu_desc { > ``` I firmly recommend option 1, especially at this point. If specific boards really have a problem with losing 4G of VA space, then option 2 can be added easily at a later point. That said, I'd think carefully about doing option 2. Even subtracting a constant - which is far better than subtracting a variable - is still extra overhead in fastpaths. Option 2 needs careful consideration on a board-by-board case. ~Andrew
On Fri, 2024-06-14 at 10:47 +0100, Andrew Cooper wrote: > On 11/06/2024 7:23 pm, Oleksii K. wrote: > > On Tue, 2024-06-11 at 16:53 +0100, Andrew Cooper wrote: > > > On 30/05/2024 7:22 pm, Oleksii K. wrote: > > > > On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote: > > > > > On 29/05/2024 8:55 pm, Oleksii Kurochko wrote: > > > > > > Signed-off-by: Oleksii Kurochko > > > > > > <oleksii.kurochko@gmail.com> > > > > > > Acked-by: Jan Beulich <jbeulich@suse.com> > > > > > This patch looks like it can go in independently? Or does it > > > > > depend > > > > > on > > > > > having bitops.h working in practice? > > > > > > > > > > However, one very strong suggestion... > > > > > > > > > > > > > > > > diff --git a/xen/arch/riscv/include/asm/mm.h > > > > > > b/xen/arch/riscv/include/asm/mm.h > > > > > > index 07c7a0abba..cc4a07a71c 100644 > > > > > > --- a/xen/arch/riscv/include/asm/mm.h > > > > > > +++ b/xen/arch/riscv/include/asm/mm.h > > > > > > @@ -3,11 +3,246 @@ > > > > > > <snip> > > > > > > +/* PDX of the first page in the frame table. */ > > > > > > +extern unsigned long frametable_base_pdx; > > > > > > + > > > > > > +/* Convert between machine frame numbers and page-info > > > > > > structures. > > > > > > */ > > > > > > +#define > > > > > > mfn_to_page(mfn) > > > > > > \ > > > > > > + (frame_table + (mfn_to_pdx(mfn) - > > > > > > frametable_base_pdx)) > > > > > > +#define > > > > > > page_to_mfn(pg) > > > > > > \ > > > > > > + pdx_to_mfn((unsigned long)((pg) - frame_table) + > > > > > > frametable_base_pdx) > > > > > Do yourself a favour and not introduce frametable_base_pdx to > > > > > begin > > > > > with. > > > > > > > > > > Every RISC-V board I can find has things starting from 0 in > > > > > physical > > > > > address space, with RAM starting immediately after. > > > > I checked Linux kernel and grep there: > > > > [ok@fedora linux-aia]$ grep -Rni "memory@" > > > > arch/riscv/boot/dts/ > > > > -- > > > > exclude "*.tmp" -I > > > > arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive- > > > > 2.dtsi:33: > > > > memory@40000000 { > > > > arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28: > > > > memory@80000000 > > > > { > > > > arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49: > > > > ddrc_cache: > > > > memory@1000000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33: > > > > ddrc_cache_lo: > > > > memory@80000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37: > > > > ddrc_cache_hi: > > > > memory@1040000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34: > > > > ddrc_cache_lo: > > > > memory@80000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40: > > > > ddrc_cache_hi: > > > > memory@1000000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22: > > > > ddrc_cache_lo: > > > > memory@80000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27: > > > > ddrc_cache_hi: > > > > memory@1000000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57: > > > > ddrc_cache_lo: > > > > memory@80000000 { > > > > arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63: > > > > ddrc_cache_hi: > > > > memory@1040000000 { > > > > arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32: > > > > memory@0 > > > > { > > > > arch/riscv/boot/dts/thead/th1520-lichee-module- > > > > 4a.dtsi:14: > > > > memory@0 { > > > > arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26: > > > > memory@80000000 > > > > { > > > > arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12: > > > > memory@80000000 > > > > { > > > > arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26: > > > > memory@80000000 > > > > { > > > > arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25: > > > > memory@80000000 > > > > { > > > > arch/riscv/boot/dts/canaan/k210.dtsi:82: sram: > > > > memory@80000000 { > > > > > > > > And based on that majority of supported by Linux kernel boards > > > > has > > > > RAM > > > > starting not from 0 in physical address space. Am I confusing > > > > something? > > > > > > > > > Taking the microchip board as an example, RAM actually starts > > > > > at > > > > > 0x8000000, > > > > Today we had conversation with the guy from SiFive in xen-devel > > > > channel > > > > and he mentioned that they are using "starfive visionfive2 and > > > > sifive > > > > unleashed platforms." which based on the grep above has RAM not > > > > at > > > > 0 > > > > address. > > > > > > > > Also, QEMU uses 0x8000000. > > > > > > > > > which means that having frametable_base_pdx and assuming it > > > > > does get set to 0x8000 (which isn't even a certainty, given > > > > > that > > > > > I > > > > > think > > > > > you'll need struct pages covering the PLICs), then what you > > > > > are > > > > > trading > > > > > off is: > > > > > > > > > > * Saving 32k of virtual address space only (no need to even > > > > > allocate > > > > > memory for this range of the framtable), by > > > > > * Having an extra memory load and add/sub in every page <-> > > > > > mfn > > > > > conversion, which is a screaming hotpath all over Xen. > > > > > > > > > > It's a terribly short-sighted tradeoff. > > > > > > > > > > 32k of VA space might be worth saving in a 32bit build (I > > > > > personally > > > > > wouldn't - especially as there's no need to share Xen's VA > > > > > space > > > > > with > > > > > guests, given no PV guests on ARM/RISC-V), but it's > > > > > absolutely > > > > > not at > > > > > all in an a 64bit build with TB of VA space available. > > > > > > > > > > Even if we do find a board with the first interesting thing > > > > > in > > > > > the > > > > > frametable starting sufficiently away from 0 that it might be > > > > > worth > > > > > considering this slide, then it should still be Kconfig-able > > > > > in a > > > > > similar way to PDX_COMPRESSION. > > > > I found your tradeoffs reasonable, but I don't understand how > > > > it > > > > will > > > > work if RAM does not start from 0, as the frametable address > > > > and > > > > RAM > > > > address are linked. > > > > I tried to look at the PDX_COMPRESSION config and couldn't find > > > > any > > > > "slide" there. Could you please clarify this for me? > > > > If to use this "slide" would it help to avoid the mentioned > > > > above > > > > tradeoffs? > > > > > > > > One more question: if we decide to go without > > > > frametable_base_pdx, > > > > would it be sufficient to simply remove mentions of it from the > > > > code ( > > > > at least, for now )? > > > There is a relationship between system/host physical addresses > > > (what > > > Xen > > > called maddr/mfn), and the frametable. The frametable has one > > > entry > > > per > > > mfn. > > > > > > In the most simple case, there's a 1:1 relationship. i.e. > > > frametable[0] > > > = maddr(0), frametable[1] = maddr(4k) etc. This is very simple, > > > and > > > very easy to calculate (page_to_mfn()/mfn_to_page()). > > > > > > The frametable is one big array. It starts at a compile-time > > > fixed > > > address, and needs to be long enough to cover everything > > > interesting > > > in > > > memory. Therefore it potentially takes a large amount of virtual > > > address space. > > > > > > However, only interesting maddrs need to have data in the > > > frametable, > > > so > > > it's fine for the backing RAM to be sparsely allocated/mapped in > > > the > > > frametable virtual addresses. > > > > > > For 64bit, that's really all you need, because there's always far > > > more > > > virtual address space than physical RAM in the system, even when > > > you're > > > looking at TB-scale giant servers. > > > > > > > > > For 32bit, virtual address space is a limited resource. (Also to > > > an > > > extent, 64bit x86 with PV guests because we give 98% of the > > > virtual > > > address space to the guest kernel to use.) > > > > > > There are two tricks to reduce the virtual address space used, > > > but > > > they > > > both cost performance in fastpaths. > > > > > > 1) PDX Compression. > > > > > > PDX compression makes a non-linear mfn <-> maddr mapping. This > > > is > > > for a > > > usecase where you've got multiple RAM banks which are separated > > > by a > > > large distance (and evenly spaced), then you can "compress" a > > > single > > > range of 0's out of the middle of the system/host physical > > > address. > > > > > > The cost is that all page <-> mfn conversions need to read two > > > masks > > > and > > > a shift-count from variables in memory, to split/shift/recombine > > > the > > > address bits. > > > > > > 2) A slide, which is frametable_base_pdx in this context. > > > > > > When there's a big gap between 0 and the start of something > > > interesting, > > > you could chop out that range by just subtracting base_pdx. What > > > qualifies as "big" is subjective, but Qemu starting at 128M > > > certainly > > > does not qualify as big enough to warrant frametable_base_pdx. > > > > > > This is less expensive that PDX compression. It only adds one > > > memory > > > read to the fastpath, but it also doesn't save as much virtual > > > address > > > space as PDX compression. > > > > > > > > > When virtual address space is a major constraint (32 bit builds), > > > both > > > of these techniques are worth doing. But when there's no > > > constraint > > > on > > > virtual address space (64 bit builds), there's no reason to use > > > either; > > > and the performance will definitely improve as a result. > > Thanks for such good explanation. > > > > For RISC-V we have PDX config disabled as I haven't seen multiple > > RAM > > banks at boards which has hypervisor extension. Thereby > > mfn_to_pdx() > > and pdx_to_mfn() do nothing. The same for frametable_base_pdx, in > > case > > of PDX disabled, it just an offset ( or a slide ). > > > > IIUUC, you meant that it make sense to map frametable not to the > > address of where RAM is started, but to 0, so then we don't need > > this > > +-frametable_base_pdx. The price for that is loosing of VA space > > for > > the range from 0 to RAM start address. > > > > Right now, we are trying to support 3 boards with the following RAM > > address: > > 1. 0x8000_0000 - QEMU and SiFive board > > 2. 0x40_0000_0000 - Microchip board > > > > So if we mapping frametable to 0 ( not RAM start ) we will loose: > > 1. 0x8000_0 ( amount of page entries to cover range [0, > > 0x8000_0000] ) > > * 64 ( size of struct page_info ) = 32 MB > > 2. 0x40_0000_0 ( amount of page entries to cover range [0, > > 0x40_0000_0000] ) * 64 ( size of struct page_info ) = 4 Gb > > > > In terms of available virtual address space for RV-64 we can > > consider > > both options as acceptable. > > For Qemu and SiFive, 32M is definitely not worth worrying about. > > I personally wouldn't worry about Microchip either. That's 4G of 1T > VA > space (assuming Sv39). > > > [OPTION 1] If we accepting of loosing 4 Gb of VA then we could > > implement mfn_to_page() and page_to_mfn() in the following way: > > ``` > > diff --git a/xen/arch/riscv/include/asm/mm.h > > b/xen/arch/riscv/include/asm/mm.h > > index cc4a07a71c..fdac7e0646 100644 > > --- a/xen/arch/riscv/include/asm/mm.h > > +++ b/xen/arch/riscv/include/asm/mm.h > > @@ -107,14 +107,11 @@ struct page_info > > > > #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START) > > > > -/* PDX of the first page in the frame table. */ > > -extern unsigned long frametable_base_pdx; > > - > > /* Convert between machine frame numbers and page-info > > structures. > > */ > > #define > > mfn_to_page(mfn) > > \ > > - (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) > > + (frame_table + mfn)) > > #define > > page_to_mfn(pg) > > \ > > - pdx_to_mfn((unsigned long)((pg) - frame_table) + > > frametable_base_pdx) > > + ((unsigned long)((pg) - frame_table)) > > > > static inline void *page_to_virt(const struct page_info *pg) > > { > > diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c > > index 9c0fd80588..8f6dbdc699 100644 > > --- a/xen/arch/riscv/mm.c > > +++ b/xen/arch/riscv/mm.c > > @@ -15,7 +15,7 @@ > > #include <asm/page.h> > > #include <asm/processor.h> > > > > -unsigned long __ro_after_init frametable_base_pdx; > > unsigned long __ro_after_init frametable_virt_end; > > > > struct mmu_desc { > > ``` > > I firmly recommend option 1, especially at this point. Jan, as you gave your Acked before, don't you mind to define mfn_to_page() and page_to_mfn as mentioned above ( Option 1 )? ~ Oleksi > > If specific boards really have a problem with losing 4G of VA space, > then option 2 can be added easily at a later point. > > That said, I'd think carefully about doing option 2. Even > subtracting a > constant - which is far better than subtracting a variable - is still > extra overhead in fastpaths. Option 2 needs careful consideration on > a > board-by-board case. > > ~Andrew
On 18.06.2024 10:21, Oleksii K. wrote: > On Fri, 2024-06-14 at 10:47 +0100, Andrew Cooper wrote: >> On 11/06/2024 7:23 pm, Oleksii K. wrote: >>> [OPTION 1] If we accepting of loosing 4 Gb of VA then we could >>> implement mfn_to_page() and page_to_mfn() in the following way: >>> ``` >>> diff --git a/xen/arch/riscv/include/asm/mm.h >>> b/xen/arch/riscv/include/asm/mm.h >>> index cc4a07a71c..fdac7e0646 100644 >>> --- a/xen/arch/riscv/include/asm/mm.h >>> +++ b/xen/arch/riscv/include/asm/mm.h >>> @@ -107,14 +107,11 @@ struct page_info >>> >>> #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START) >>> >>> -/* PDX of the first page in the frame table. */ >>> -extern unsigned long frametable_base_pdx; >>> - >>> /* Convert between machine frame numbers and page-info >>> structures. >>> */ >>> #define >>> mfn_to_page(mfn) >>> \ >>> - (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) >>> + (frame_table + mfn)) >>> #define >>> page_to_mfn(pg) >>> \ >>> - pdx_to_mfn((unsigned long)((pg) - frame_table) + >>> frametable_base_pdx) >>> + ((unsigned long)((pg) - frame_table)) >>> >>> static inline void *page_to_virt(const struct page_info *pg) >>> { >>> diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c >>> index 9c0fd80588..8f6dbdc699 100644 >>> --- a/xen/arch/riscv/mm.c >>> +++ b/xen/arch/riscv/mm.c >>> @@ -15,7 +15,7 @@ >>> #include <asm/page.h> >>> #include <asm/processor.h> >>> >>> -unsigned long __ro_after_init frametable_base_pdx; >>> unsigned long __ro_after_init frametable_virt_end; >>> >>> struct mmu_desc { >>> ``` >> >> I firmly recommend option 1, especially at this point. > Jan, as you gave your Acked before, don't you mind to define > mfn_to_page() and page_to_mfn as mentioned above ( Option 1 )? No, I don't mind. And please feel free to keep my ack if no other significant changes are made. Jan
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h index 07c7a0abba..cc4a07a71c 100644 --- a/xen/arch/riscv/include/asm/mm.h +++ b/xen/arch/riscv/include/asm/mm.h @@ -3,11 +3,246 @@ #ifndef _ASM_RISCV_MM_H #define _ASM_RISCV_MM_H +#include <public/xen.h> +#include <xen/bug.h> +#include <xen/mm-frame.h> +#include <xen/pdx.h> +#include <xen/types.h> + #include <asm/page-bits.h> #define pfn_to_paddr(pfn) ((paddr_t)(pfn) << PAGE_SHIFT) #define paddr_to_pfn(pa) ((unsigned long)((pa) >> PAGE_SHIFT)) +#define paddr_to_pdx(pa) mfn_to_pdx(maddr_to_mfn(pa)) +#define gfn_to_gaddr(gfn) pfn_to_paddr(gfn_x(gfn)) +#define gaddr_to_gfn(ga) _gfn(paddr_to_pfn(ga)) +#define mfn_to_maddr(mfn) pfn_to_paddr(mfn_x(mfn)) +#define maddr_to_mfn(ma) _mfn(paddr_to_pfn(ma)) +#define vmap_to_mfn(va) maddr_to_mfn(virt_to_maddr((vaddr_t)(va))) +#define vmap_to_page(va) mfn_to_page(vmap_to_mfn(va)) + +static inline void *maddr_to_virt(paddr_t ma) +{ + BUG_ON("unimplemented"); + return NULL; +} + +#define virt_to_maddr(va) ({ BUG_ON("unimplemented"); 0; }) + +/* Convert between Xen-heap virtual addresses and machine frame numbers. */ +#define __virt_to_mfn(va) mfn_x(maddr_to_mfn(virt_to_maddr(va))) +#define __mfn_to_virt(mfn) maddr_to_virt(mfn_to_maddr(_mfn(mfn))) + +/* + * We define non-underscored wrappers for above conversion functions. + * These are overriden in various source files while underscored version + * remain intact. + */ +#define virt_to_mfn(va) __virt_to_mfn(va) +#define mfn_to_virt(mfn) __mfn_to_virt(mfn) + +struct page_info +{ + /* Each frame can be threaded onto a doubly-linked list. */ + struct page_list_entry list; + + /* Reference count and various PGC_xxx flags and fields. */ + unsigned long count_info; + + /* Context-dependent fields follow... */ + union { + /* Page is in use: ((count_info & PGC_count_mask) != 0). */ + struct { + /* Type reference count and various PGT_xxx flags and fields. */ + unsigned long type_info; + } inuse; + + /* Page is on a free list: ((count_info & PGC_count_mask) == 0). */ + union { + struct { + /* + * Index of the first *possibly* unscrubbed page in the buddy. + * One more bit than maximum possible order to accommodate + * INVALID_DIRTY_IDX. + */ +#define INVALID_DIRTY_IDX ((1UL << (MAX_ORDER + 1)) - 1) + unsigned long first_dirty:MAX_ORDER + 1; + + /* Do TLBs need flushing for safety before next page use? */ + bool need_tlbflush:1; + +#define BUDDY_NOT_SCRUBBING 0 +#define BUDDY_SCRUBBING 1 +#define BUDDY_SCRUB_ABORT 2 + unsigned long scrub_state:2; + }; + + unsigned long val; + } free; + } u; + + union { + /* Page is in use */ + struct { + /* Owner of this page (NULL if page is anonymous). */ + struct domain *domain; + } inuse; + + /* Page is on a free list. */ + struct { + /* Order-size of the free chunk this page is the head of. */ + unsigned int order; + } free; + } v; + + union { + /* + * Timestamp from 'TLB clock', used to avoid extra safety flushes. + * Only valid for: a) free pages, and b) pages with zero type count + */ + uint32_t tlbflush_timestamp; + }; +}; + +#define frame_table ((struct page_info *)FRAMETABLE_VIRT_START) + +/* PDX of the first page in the frame table. */ +extern unsigned long frametable_base_pdx; + +/* Convert between machine frame numbers and page-info structures. */ +#define mfn_to_page(mfn) \ + (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) +#define page_to_mfn(pg) \ + pdx_to_mfn((unsigned long)((pg) - frame_table) + frametable_base_pdx) + +static inline void *page_to_virt(const struct page_info *pg) +{ + return mfn_to_virt(mfn_x(page_to_mfn(pg))); +} + +/* Convert between Xen-heap virtual addresses and page-info structures. */ +static inline struct page_info *virt_to_page(const void *v) +{ + BUG_ON("unimplemented"); + return NULL; +} + +/* + * Common code requires get_page_type and put_page_type. + * We don't care about typecounts so we just do the minimum to make it + * happy. + */ +static inline int get_page_type(struct page_info *page, unsigned long type) +{ + return 1; +} + +static inline void put_page_type(struct page_info *page) +{ +} + +static inline void put_page_and_type(struct page_info *page) +{ + put_page_type(page); + put_page(page); +} + +/* + * RISC-V does not have an M2P, but common code expects a handful of + * M2P-related defines and functions. Provide dummy versions of these. + */ +#define INVALID_M2P_ENTRY (~0UL) +#define SHARED_M2P_ENTRY (~0UL - 1UL) +#define SHARED_M2P(_e) ((_e) == SHARED_M2P_ENTRY) + +#define set_gpfn_from_mfn(mfn, pfn) do { (void)(mfn), (void)(pfn); } while (0) +#define mfn_to_gfn(d, mfn) ((void)(d), _gfn(mfn_x(mfn))) + +#define PDX_GROUP_SHIFT (PAGE_SHIFT + VPN_BITS) + +static inline unsigned long domain_get_maximum_gpfn(struct domain *d) +{ + BUG_ON("unimplemented"); + return 0; +} + +static inline long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg) +{ + BUG_ON("unimplemented"); + return 0; +} + +/* + * On RISCV, all the RAM is currently direct mapped in Xen. + * Hence return always true. + */ +static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr) +{ + return true; +} + +#define PG_shift(idx) (BITS_PER_LONG - (idx)) +#define PG_mask(x, idx) (x ## UL << PG_shift(idx)) + +#define PGT_none PG_mask(0, 1) /* no special uses of this page */ +#define PGT_writable_page PG_mask(1, 1) /* has writable mappings? */ +#define PGT_type_mask PG_mask(1, 1) /* Bits 31 or 63. */ + +/* Count of uses of this frame as its current type. */ +#define PGT_count_width PG_shift(2) +#define PGT_count_mask ((1UL << PGT_count_width) - 1) + +/* + * Page needs to be scrubbed. Since this bit can only be set on a page that is + * free (i.e. in PGC_state_free) we can reuse PGC_allocated bit. + */ +#define _PGC_need_scrub _PGC_allocated +#define PGC_need_scrub PGC_allocated + +/* Cleared when the owning guest 'frees' this page. */ +#define _PGC_allocated PG_shift(1) +#define PGC_allocated PG_mask(1, 1) +/* Page is Xen heap? */ +#define _PGC_xen_heap PG_shift(2) +#define PGC_xen_heap PG_mask(1, 2) +/* Page is broken? */ +#define _PGC_broken PG_shift(7) +#define PGC_broken PG_mask(1, 7) +/* Mutually-exclusive page states: { inuse, offlining, offlined, free }. */ +#define PGC_state PG_mask(3, 9) +#define PGC_state_inuse PG_mask(0, 9) +#define PGC_state_offlining PG_mask(1, 9) +#define PGC_state_offlined PG_mask(2, 9) +#define PGC_state_free PG_mask(3, 9) +#define page_state_is(pg, st) (((pg)->count_info&PGC_state) == PGC_state_##st) + +/* Count of references to this frame. */ +#define PGC_count_width PG_shift(9) +#define PGC_count_mask ((1UL << PGC_count_width) - 1) + +#define _PGC_extra PG_shift(10) +#define PGC_extra PG_mask(1, 10) + +#define is_xen_heap_page(page) ((page)->count_info & PGC_xen_heap) +#define is_xen_heap_mfn(mfn) \ + (mfn_valid(mfn) && is_xen_heap_page(mfn_to_page(mfn))) + +#define is_xen_fixed_mfn(mfn) \ + ((mfn_to_maddr(mfn) >= virt_to_maddr((vaddr_t)_start)) && \ + (mfn_to_maddr(mfn) <= virt_to_maddr((vaddr_t)_end - 1))) + +#define page_get_owner(p) (p)->v.inuse.domain +#define page_set_owner(p, d) ((p)->v.inuse.domain = (d)) + +/* TODO: implement */ +#define mfn_valid(mfn) ({ (void)(mfn); 0; }) + +#define domain_set_alloc_bitsize(d) ((void)(d)) +#define domain_clamp_alloc_bitsize(d, b) ((void)(d), (b)) + +#define PFN_ORDER(pg) ((pg)->v.free.order) + extern unsigned char cpu0_boot_stack[]; void setup_initial_pagetables(void); @@ -20,4 +255,9 @@ unsigned long calc_phys_offset(void); void turn_on_mmu(unsigned long ra); +static inline unsigned int arch_get_dma_bitsize(void) +{ + return 32; /* TODO */ +} + #endif /* _ASM_RISCV_MM_H */ diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c index 053f043a3d..fe3a43be20 100644 --- a/xen/arch/riscv/mm.c +++ b/xen/arch/riscv/mm.c @@ -5,12 +5,12 @@ #include <xen/init.h> #include <xen/kernel.h> #include <xen/macros.h> +#include <xen/mm.h> #include <xen/pfn.h> #include <asm/early_printk.h> #include <asm/csr.h> #include <asm/current.h> -#include <asm/mm.h> #include <asm/page.h> #include <asm/processor.h> diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c index 6593f601c1..98a94c4c48 100644 --- a/xen/arch/riscv/setup.c +++ b/xen/arch/riscv/setup.c @@ -2,9 +2,9 @@ #include <xen/compile.h> #include <xen/init.h> +#include <xen/mm.h> #include <asm/early_printk.h> -#include <asm/mm.h> /* Xen stack for bringing up the first CPU. */ unsigned char __initdata cpu0_boot_stack[STACK_SIZE]