diff mbox series

[RFC] xen/memory: Introduce a hypercall to provide unallocated space

Message ID 1627489110-25633-1-git-send-email-olekstysh@gmail.com (mailing list archive)
State New, archived
Headers show
Series [RFC] xen/memory: Introduce a hypercall to provide unallocated space | expand

Commit Message

Oleksandr Tyshchenko July 28, 2021, 4:18 p.m. UTC
From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>

Add XENMEM_get_unallocated_space hypercall which purpose is to
query hypervisor to find regions of guest physical address space
which are unused and can be used to create grant/foreign mappings
instead of wasting real pages from the domain memory for
establishing these mappings. The problem with the current Linux
on Xen on Arm behaviour is if we want to map some guest memory
regions in advance or to perform cache mappings in the backend
we might run out of memory in the host (see XSA-300).
This of course, depends on the both host and guest memory sizes.

The "unallocated space" can't be figured out precisely by
the domain on Arm without hypervisor involvement:
- not all device I/O regions are known by the time domain starts
  creating grant/foreign mappings
- the Dom0 is not aware of memory regions used for the identity
  mappings needed for the PV drivers to work
In both cases we might end up re-using these regions by
a mistake. So, the hypervisor which maintains the P2M for the domain
is in the best position to provide "unallocated space".

The arch code is in charge of finding these regions and filling
in corresponding array in new helper arch_get_unallocated_space().

This patch implements common and Arm parts, the x86 specific bits
are left uniplemented for now.

Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
---
Current patch is based on the latest staging branch:
73c932d tools/libxc: use uint32_t for pirq in xc_domain_irq_permission
and also available at:
https://github.com/otyshchenko1/xen/commits/map_opt_ml1

The corresponding Linux changes you can find at:
https://github.com/otyshchenko1/linux/commits/map_opt_ml1
I am going to publish them soon.
---
 tools/flask/policy/modules/dom0.te  |  2 +-
 xen/arch/arm/mm.c                   | 18 ++++++++++++
 xen/common/memory.c                 | 56 +++++++++++++++++++++++++++++++++++++
 xen/include/asm-arm/mm.h            |  4 +++
 xen/include/asm-x86/mm.h            |  8 ++++++
 xen/include/public/memory.h         | 37 +++++++++++++++++++++++-
 xen/include/xsm/dummy.h             |  6 ++++
 xen/include/xsm/xsm.h               |  6 ++++
 xen/xsm/dummy.c                     |  1 +
 xen/xsm/flask/hooks.c               |  6 ++++
 xen/xsm/flask/policy/access_vectors |  2 ++
 11 files changed, 144 insertions(+), 2 deletions(-)

Comments

Oleksandr Tyshchenko July 28, 2021, 5:06 p.m. UTC | #1
Hello all


On 28.07.21 19:18, Oleksandr Tyshchenko wrote:
> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>
> Add XENMEM_get_unallocated_space hypercall which purpose is to
> query hypervisor to find regions of guest physical address space
> which are unused and can be used to create grant/foreign mappings
> instead of wasting real pages from the domain memory for
> establishing these mappings. The problem with the current Linux
> on Xen on Arm behaviour is if we want to map some guest memory
> regions in advance or to perform cache mappings in the backend
> we might run out of memory in the host (see XSA-300).
> This of course, depends on the both host and guest memory sizes.
>
> The "unallocated space" can't be figured out precisely by
> the domain on Arm without hypervisor involvement:
> - not all device I/O regions are known by the time domain starts
>    creating grant/foreign mappings
> - the Dom0 is not aware of memory regions used for the identity
>    mappings needed for the PV drivers to work
> In both cases we might end up re-using these regions by
> a mistake. So, the hypervisor which maintains the P2M for the domain
> is in the best position to provide "unallocated space".
>
> The arch code is in charge of finding these regions and filling
> in corresponding array in new helper arch_get_unallocated_space().
>
> This patch implements common and Arm parts, the x86 specific bits
> are left uniplemented for now.
>
> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> ---
> Current patch is based on the latest staging branch:
> 73c932d tools/libxc: use uint32_t for pirq in xc_domain_irq_permission
> and also available at:
> https://github.com/otyshchenko1/xen/commits/map_opt_ml1
>
> The corresponding Linux changes you can find at:
> https://github.com/otyshchenko1/linux/commits/map_opt_ml1
> I am going to publish them soon.

The Linux changes are already pushed. You can find them at:

1. Small prereq patch

https://lore.kernel.org/lkml/1627490656-1267-1-git-send-email-olekstysh@gmail.com/

2. Main patch

https://lore.kernel.org/lkml/1627490656-1267-2-git-send-email-olekstysh@gmail.com/
Andrew Cooper July 28, 2021, 5:19 p.m. UTC | #2
On 28/07/2021 17:18, Oleksandr Tyshchenko wrote:
> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>
> Add XENMEM_get_unallocated_space hypercall which purpose is to
> query hypervisor to find regions of guest physical address space
> which are unused and can be used to create grant/foreign mappings
> instead of wasting real pages from the domain memory for
> establishing these mappings. The problem with the current Linux
> on Xen on Arm behaviour is if we want to map some guest memory
> regions in advance or to perform cache mappings in the backend
> we might run out of memory in the host (see XSA-300).
> This of course, depends on the both host and guest memory sizes.
>
> The "unallocated space" can't be figured out precisely by
> the domain on Arm without hypervisor involvement:
> - not all device I/O regions are known by the time domain starts
>   creating grant/foreign mappings
> - the Dom0 is not aware of memory regions used for the identity
>   mappings needed for the PV drivers to work
> In both cases we might end up re-using these regions by
> a mistake. So, the hypervisor which maintains the P2M for the domain
> is in the best position to provide "unallocated space".

I'm afraid this does not improve the situation.

If a guest follows the advice from XENMEM_get_unallocated_space, and
subsequently a new IO or identity region appears, everything will
explode, because the "safe area" wasn't actually safe.

The safe range *must* be chosen by the toolstack, because nothing else
can do it safely or correctly.

Once a safe range (or ranges) has been chosen, any subsequent action
which overlaps with the ranges must be rejected, as it will violate the
guarantees provided.

Furthermore, the ranges should be made available to the guest via normal
memory map means.  On x86, this is via the E820 table, and on ARM I
presume the DTB.  There is no need for a new hypercall.

>
> The arch code is in charge of finding these regions and filling
> in corresponding array in new helper arch_get_unallocated_space().
>
> This patch implements common and Arm parts, the x86 specific bits
> are left uniplemented for now.

Leaving x86 for now is fine by me.

> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
> index 0e07335..8a70fe0 100644
> --- a/xen/arch/arm/mm.c
> +++ b/xen/arch/arm/mm.c
> @@ -1635,6 +1635,24 @@ unsigned long get_upper_mfn_bound(void)
>      return max_page - 1;
>  }
>  
> +#define GPFN_ALIGN_TO_GB(x) (((x) + (1UL << 18) - 1) & (~((1UL << 18) - 1)))
> +
> +int arch_get_unallocated_space(struct domain *d,
> +                               struct xen_unallocated_region *regions,
> +                               unsigned int *nr_regions)
> +{
> +    /*
> +     * Everything not mapped into P2M could be treated as unused. Taking into
> +     * the account that we have a lot of unallocated space above max_mapped_gfn
> +     * and in order to simplify things, just provide a single 8GB memory
> +     * region aligned to 1GB boundary for now.
> +     */
> +    regions->start_gpfn = GPFN_ALIGN_TO_GB(domain_get_maximum_gpfn(d) + 1);

To drive home my earlier point, this is racy and cannot be fixed in Xen.

You don't (and cannot) hold the p2m lock until the guest has acted upon
the information provided, so it is guaranteed stale by the time the
guest can inspect the results.  In the concurrent meantime, a wide range
of operations can change max_gpfn.

The only way to get a value the guest can use is to choose it before
hand, and have Xen enforce that the nominated range(s) remain safe.

~Andrew
Julien Grall July 28, 2021, 5:27 p.m. UTC | #3
Hi Andrew,

On 28/07/2021 18:19, Andrew Cooper wrote:
> On 28/07/2021 17:18, Oleksandr Tyshchenko wrote:
>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>
>> Add XENMEM_get_unallocated_space hypercall which purpose is to
>> query hypervisor to find regions of guest physical address space
>> which are unused and can be used to create grant/foreign mappings
>> instead of wasting real pages from the domain memory for
>> establishing these mappings. The problem with the current Linux
>> on Xen on Arm behaviour is if we want to map some guest memory
>> regions in advance or to perform cache mappings in the backend
>> we might run out of memory in the host (see XSA-300).
>> This of course, depends on the both host and guest memory sizes.
>>
>> The "unallocated space" can't be figured out precisely by
>> the domain on Arm without hypervisor involvement:
>> - not all device I/O regions are known by the time domain starts
>>    creating grant/foreign mappings
>> - the Dom0 is not aware of memory regions used for the identity
>>    mappings needed for the PV drivers to work
>> In both cases we might end up re-using these regions by
>> a mistake. So, the hypervisor which maintains the P2M for the domain
>> is in the best position to provide "unallocated space".
> 
> I'm afraid this does not improve the situation.
> 
> If a guest follows the advice from XENMEM_get_unallocated_space, and
> subsequently a new IO or identity region appears, everything will
> explode, because the "safe area" wasn't actually safe.
> 
> The safe range *must* be chosen by the toolstack, because nothing else
> can do it safely or correctly.

The problem is how do you size it? In particular, a backend may map 
multiple time the same page (for instance if the page is granted twice).

> 
> Once a safe range (or ranges) has been chosen, any subsequent action
> which overlaps with the ranges must be rejected, as it will violate the
> guarantees provided.
> 
> Furthermore, the ranges should be made available to the guest via normal
> memory map means.  On x86, this is via the E820 table, and on ARM I
> presume the DTB.  There is no need for a new hypercall.

Device-Tree only works if you have a guest using it. How about ACPI?

To me the hypercall solution at least:
   1) Avoid to have to define the region on every single firmware table
   2) Allow to easily extend after the guest run

Cheers,
Andrew Cooper July 28, 2021, 7 p.m. UTC | #4
On 28/07/2021 18:27, Julien Grall wrote:
> Hi Andrew,
>
> On 28/07/2021 18:19, Andrew Cooper wrote:
>> On 28/07/2021 17:18, Oleksandr Tyshchenko wrote:
>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>
>>> Add XENMEM_get_unallocated_space hypercall which purpose is to
>>> query hypervisor to find regions of guest physical address space
>>> which are unused and can be used to create grant/foreign mappings
>>> instead of wasting real pages from the domain memory for
>>> establishing these mappings. The problem with the current Linux
>>> on Xen on Arm behaviour is if we want to map some guest memory
>>> regions in advance or to perform cache mappings in the backend
>>> we might run out of memory in the host (see XSA-300).
>>> This of course, depends on the both host and guest memory sizes.
>>>
>>> The "unallocated space" can't be figured out precisely by
>>> the domain on Arm without hypervisor involvement:
>>> - not all device I/O regions are known by the time domain starts
>>>    creating grant/foreign mappings
>>> - the Dom0 is not aware of memory regions used for the identity
>>>    mappings needed for the PV drivers to work
>>> In both cases we might end up re-using these regions by
>>> a mistake. So, the hypervisor which maintains the P2M for the domain
>>> is in the best position to provide "unallocated space".
>>
>> I'm afraid this does not improve the situation.
>>
>> If a guest follows the advice from XENMEM_get_unallocated_space, and
>> subsequently a new IO or identity region appears, everything will
>> explode, because the "safe area" wasn't actually safe.
>>
>> The safe range *must* be chosen by the toolstack, because nothing else
>> can do it safely or correctly.
>
> The problem is how do you size it? In particular, a backend may map
> multiple time the same page (for instance if the page is granted twice).

The number of mapped grants is limited by the size of the maptrack table
in Xen, which is a toolstack input to the domaincreate hypercall. 
Therefore, the amount of space required is known and bounded.

There are a handful of other frames required in the current ABI (shared
info, vcpu info, etc).

The areas where things do become fuzzy is things like foreign mappings,
acquire_resource, etc for the control domain, which are effectively
unbounded from the domain's point of view.

For those, its entirely fine to say "here 128G of safe mapping space" or
so.  Even the quantity of mapping dom0 can make is limited by the shadow
memory pool and the number of pagetables Xen is willing to expend on the
second stage translation tables.

>
>>
>> Once a safe range (or ranges) has been chosen, any subsequent action
>> which overlaps with the ranges must be rejected, as it will violate the
>> guarantees provided.
>>
>> Furthermore, the ranges should be made available to the guest via normal
>> memory map means.  On x86, this is via the E820 table, and on ARM I
>> presume the DTB.  There is no need for a new hypercall.
>
> Device-Tree only works if you have a guest using it. How about ACPI?

ACPI inherits E820 from x86 (its a trivial format), and UEFI was also
based on it.

But whichever...  All firmware interfaces have a memory map.

> To me the hypercall solution at least:
>   1) Avoid to have to define the region on every single firmware table

There is only ever one.

>   2) Allow to easily extend after the guest run

The safe ranges can't be changed (safely).  This is the same problem as
needing to know things like your PCI apertures ahead of time, or where
the DIMM hotplug regions are.

Having the guest physmap be actually dynamic is the cause of so many
bugs (inc security) and misfeatures in Xen.  Guests cannot and do no
cope with things being fully dynamic, because that's not how real
hardware works.  What you get is layers and layers of breakage on top of
each other, rather than a working system.

The size of mapping space is a limit, just like maxphysaddr, or the PCI
apertures, or MMCFG space, etc.  You can make it large by default (as it
doesn't consume resource when not being used), but any guest OS isn't
going to tolerate it morphing during runtime.

~Andrew
Julien Grall July 28, 2021, 7:53 p.m. UTC | #5
On 28/07/2021 20:00, Andrew Cooper wrote:
> On 28/07/2021 18:27, Julien Grall wrote:
>> Hi Andrew,
>>
>> On 28/07/2021 18:19, Andrew Cooper wrote:
>>> On 28/07/2021 17:18, Oleksandr Tyshchenko wrote:
>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>
>>>> Add XENMEM_get_unallocated_space hypercall which purpose is to
>>>> query hypervisor to find regions of guest physical address space
>>>> which are unused and can be used to create grant/foreign mappings
>>>> instead of wasting real pages from the domain memory for
>>>> establishing these mappings. The problem with the current Linux
>>>> on Xen on Arm behaviour is if we want to map some guest memory
>>>> regions in advance or to perform cache mappings in the backend
>>>> we might run out of memory in the host (see XSA-300).
>>>> This of course, depends on the both host and guest memory sizes.
>>>>
>>>> The "unallocated space" can't be figured out precisely by
>>>> the domain on Arm without hypervisor involvement:
>>>> - not all device I/O regions are known by the time domain starts
>>>>     creating grant/foreign mappings
>>>> - the Dom0 is not aware of memory regions used for the identity
>>>>     mappings needed for the PV drivers to work
>>>> In both cases we might end up re-using these regions by
>>>> a mistake. So, the hypervisor which maintains the P2M for the domain
>>>> is in the best position to provide "unallocated space".
>>>
>>> I'm afraid this does not improve the situation.
>>>
>>> If a guest follows the advice from XENMEM_get_unallocated_space, and
>>> subsequently a new IO or identity region appears, everything will
>>> explode, because the "safe area" wasn't actually safe.
>>>
>>> The safe range *must* be chosen by the toolstack, because nothing else
>>> can do it safely or correctly.
>>
>> The problem is how do you size it? In particular, a backend may map
>> multiple time the same page (for instance if the page is granted twice).
> 
> The number of mapped grants is limited by the size of the maptrack table
> in Xen, which is a toolstack input to the domaincreate hypercall.
> Therefore, the amount of space required is known and bounded.
> 
> There are a handful of other frames required in the current ABI (shared
> info, vcpu info, etc).
> 
> The areas where things do become fuzzy is things like foreign mappings,
> acquire_resource, etc for the control domain, which are effectively
> unbounded from the domain's point of view.
> 
> For those, its entirely fine to say "here 128G of safe mapping space" or
> so.  Even the quantity of mapping dom0 can make is limited by the shadow
> memory pool and the number of pagetables Xen is willing to expend on the
> second stage translation tables.

FWIW, on Arm, we don't have shadow memory pool.

Anyway, it should be possible to give a 128GB "safe range" and let Xen 
deal with it.

> 
>>
>>>
>>> Once a safe range (or ranges) has been chosen, any subsequent action
>>> which overlaps with the ranges must be rejected, as it will violate the
>>> guarantees provided.
>>>
>>> Furthermore, the ranges should be made available to the guest via normal
>>> memory map means.  On x86, this is via the E820 table, and on ARM I
>>> presume the DTB.  There is no need for a new hypercall.
>>
>> Device-Tree only works if you have a guest using it. How about ACPI?
> 
> ACPI inherits E820 from x86 (its a trivial format), and UEFI was also
> based on it.
> 
> But whichever...  All firmware interfaces have a memory map.

This will be UEFI memory map. However, I am a bit confused how we can 
tell the OS the region will be used for grant/foreign mapping. Is it 
possible to reserved a new type?

> 
>> To me the hypercall solution at least:
>>    1) Avoid to have to define the region on every single firmware table
> 
> There is only ever one.

Why? I could forsee an interest to use the host memory map and therefore 
we may need to find a few holes for safe regions to use.

> 
>>    2) Allow to easily extend after the guest run
> 
> The safe ranges can't be changed (safely).  This is the same problem as
> needing to know things like your PCI apertures ahead of time, or where
> the DIMM hotplug regions are.
> 
> Having the guest physmap be actually dynamic is the cause of so many
> bugs (inc security) and misfeatures in Xen.  Guests cannot and do no
> cope with things being fully dynamic, because that's not how real
> hardware works.  What you get is layers and layers of breakage on top of
> each other, rather than a working system.

I would not call it "fully dynamic". Xen can easily know whether a 
region has ever be allocated before. So long the region has never been 
allocated, then it should be fine. In fact...

> 
> The size of mapping space is a limit, just like maxphysaddr, or the PCI
> apertures, or MMCFG space, etc.  You can make it large by default (as it
> doesn't consume resource when not being used), but any guest OS isn't
> going to tolerate it morphing during runtime.
... I believe the OS may be not aware of the hotplug region until it is 
actually used.

Anyway, I suggested this approach a few months ago to Oleksandr (see 
[1]) which BTW you were CCed on ;). My take was that Xen knows about the 
unallocated space and therefore can make an informed decision without 
having to allocate insanely large region.

Now if you think that's fine (IIRC Stefano had a preference for that as 
well). Then we can use the firmware table (assuming we can solve the 
ACPI question).

At the end of the day, this is not really the interesting bit of the 
problem. What matter if the OS part where hopefully Linux will be able 
to use normally the RAM. We may even be able to fix XSA-300!

Cheers,

[1] <YJ3jlGSxs60Io+dp@mattapan.m5p.com>
Oleksandr Tyshchenko July 30, 2021, 4:13 p.m. UTC | #6
Hi Andrew, Julien.


@Andrew, I think that arguments you provided in your first answer (why 
the proposed solution is a potentially racy and not a safe) are valid 
and reasonable.
Thanks for the feedback.


On 28.07.21 22:53, Julien Grall wrote:
>
>
> On 28/07/2021 20:00, Andrew Cooper wrote:
>> On 28/07/2021 18:27, Julien Grall wrote:
>>> Hi Andrew,
>>>
>>> On 28/07/2021 18:19, Andrew Cooper wrote:
>>>> On 28/07/2021 17:18, Oleksandr Tyshchenko wrote:
>>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>>
>>>>> Add XENMEM_get_unallocated_space hypercall which purpose is to
>>>>> query hypervisor to find regions of guest physical address space
>>>>> which are unused and can be used to create grant/foreign mappings
>>>>> instead of wasting real pages from the domain memory for
>>>>> establishing these mappings. The problem with the current Linux
>>>>> on Xen on Arm behaviour is if we want to map some guest memory
>>>>> regions in advance or to perform cache mappings in the backend
>>>>> we might run out of memory in the host (see XSA-300).
>>>>> This of course, depends on the both host and guest memory sizes.
>>>>>
>>>>> The "unallocated space" can't be figured out precisely by
>>>>> the domain on Arm without hypervisor involvement:
>>>>> - not all device I/O regions are known by the time domain starts
>>>>>     creating grant/foreign mappings
>>>>> - the Dom0 is not aware of memory regions used for the identity
>>>>>     mappings needed for the PV drivers to work
>>>>> In both cases we might end up re-using these regions by
>>>>> a mistake. So, the hypervisor which maintains the P2M for the domain
>>>>> is in the best position to provide "unallocated space".
>>>>
>>>> I'm afraid this does not improve the situation.
>>>>
>>>> If a guest follows the advice from XENMEM_get_unallocated_space, and
>>>> subsequently a new IO or identity region appears, everything will
>>>> explode, because the "safe area" wasn't actually safe.
>>>>
>>>> The safe range *must* be chosen by the toolstack, because nothing else
>>>> can do it safely or correctly.
>>>
>>> The problem is how do you size it? In particular, a backend may map
>>> multiple time the same page (for instance if the page is granted 
>>> twice).
>>
>> The number of mapped grants is limited by the size of the maptrack table
>> in Xen, which is a toolstack input to the domaincreate hypercall.
>> Therefore, the amount of space required is known and bounded.
>>
>> There are a handful of other frames required in the current ABI (shared
>> info, vcpu info, etc).
>>
>> The areas where things do become fuzzy is things like foreign mappings,
>> acquire_resource, etc for the control domain, which are effectively
>> unbounded from the domain's point of view.
>>
>> For those, its entirely fine to say "here 128G of safe mapping space" or
>> so.  Even the quantity of mapping dom0 can make is limited by the shadow
>> memory pool and the number of pagetables Xen is willing to expend on the
>> second stage translation tables.
>
> FWIW, on Arm, we don't have shadow memory pool.
>
> Anyway, it should be possible to give a 128GB "safe range" and let Xen 
> deal with it.
>
>>
>>>
>>>>
>>>> Once a safe range (or ranges) has been chosen, any subsequent action
>>>> which overlaps with the ranges must be rejected, as it will violate 
>>>> the
>>>> guarantees provided.
>>>>
>>>> Furthermore, the ranges should be made available to the guest via 
>>>> normal
>>>> memory map means.  On x86, this is via the E820 table, and on ARM I
>>>> presume the DTB.  There is no need for a new hypercall.
>>>
>>> Device-Tree only works if you have a guest using it. How about ACPI?
>>
>> ACPI inherits E820 from x86 (its a trivial format), and UEFI was also
>> based on it.
>>
>> But whichever...  All firmware interfaces have a memory map.
>
> This will be UEFI memory map. However, I am a bit confused how we can 
> tell the OS the region will be used for grant/foreign mapping. Is it 
> possible to reserved a new type?
>
>>
>>> To me the hypercall solution at least:
>>>    1) Avoid to have to define the region on every single firmware table
>>
>> There is only ever one.
>
> Why? I could forsee an interest to use the host memory map and 
> therefore we may need to find a few holes for safe regions to use.
>
>>
>>>    2) Allow to easily extend after the guest run
>>
>> The safe ranges can't be changed (safely).  This is the same problem as
>> needing to know things like your PCI apertures ahead of time, or where
>> the DIMM hotplug regions are.
>>
>> Having the guest physmap be actually dynamic is the cause of so many
>> bugs (inc security) and misfeatures in Xen.  Guests cannot and do no
>> cope with things being fully dynamic, because that's not how real
>> hardware works.  What you get is layers and layers of breakage on top of
>> each other, rather than a working system.
>
> I would not call it "fully dynamic". Xen can easily know whether a 
> region has ever be allocated before. So long the region has never been 
> allocated, then it should be fine. In fact...
>
>>
>> The size of mapping space is a limit, just like maxphysaddr, or the PCI
>> apertures, or MMCFG space, etc.  You can make it large by default (as it
>> doesn't consume resource when not being used), but any guest OS isn't
>> going to tolerate it morphing during runtime.
> ... I believe the OS may be not aware of the hotplug region until it 
> is actually used.
>
> Anyway, I suggested this approach a few months ago to Oleksandr (see 
> [1]) which BTW you were CCed on ;). My take was that Xen knows about 
> the unallocated space and therefore can make an informed decision 
> without having to allocate insanely large region.
>
> Now if you think that's fine (IIRC Stefano had a preference for that 
> as well). Then we can use the firmware table (assuming we can solve 
> the ACPI question).


Well, if new hypercall and, what is more, "the querying hypervisor at 
runtime to find unused space" model itself is not welcome, I am ok, 
let's try to create a working system,
may we please find a common ground to move this forward (at least on Arm 
for now, the foreign mapping is the most important question).

I got the proposed idea in general, but I haven't connected all dots 
yet, some points need clarification.

1. The safe range must be defined/allocated in advance and must remain 
const during the runtime. The safe range must be chosen by the toolstack.
[For the initial implementation we can start with some large value 
(128GB) as discussed above]

Questions:

- Do we need to inform Xen about that range (via domain_create 
hypercall, etc)?
- What will be in charge of guaranteeing the safety of that range at 
runtime (reject new mapping requests with possible overlaps, etc), Xen, 
toolstack or both?
- Where that range should be located in guest address space, should that 
range be the same for all domains (how GUEST_GNTTAB_BASE(SIZE) for example)
or it should be calculated based on actual guest_ram_base(size) for a 
particular domain?
- What about a safe range the Dom0 can use itself? Xen should choose it 
for Dom0 the same way how toolstack chooses it for other domains, correct?

2. The safe range must be provided to domain using the firmware table.
[We can start with the DTB and leave ACPI unimplemented for now, 
assuming we will be able to solve open questions as discussed above]

Questions:

- Do we need distinguish between foreign and grant mappings at the 
domain side at all? Can the same safe range be used for all types of 
mapping?
- How will domain recognize what range can be used for foreign/grant 
mappings? Do we need to introduce new device-tree bindings for 
describing the range
or it is possible to re-use current bindings (expand the "reg" property 
under "hypervisor" node, currently it contains gnttab region)?


>
> At the end of the day, this is not really the interesting bit of the 
> problem. What matter if the OS part where hopefully Linux will be able 
> to use normally the RAM. We may even be able to fix XSA-300!
>
> Cheers,
>
> [1] <YJ3jlGSxs60Io+dp@mattapan.m5p.com>
>
Stefano Stabellini July 30, 2021, 11:57 p.m. UTC | #7
On Fri, 30 Jul 2021, Oleksandr wrote:
> Hi Andrew, Julien.
> 
> 
> @Andrew, I think that arguments you provided in your first answer (why the
> proposed solution is a potentially racy and not a safe) are valid and
> reasonable.
> Thanks for the feedback.
> 
> 
> On 28.07.21 22:53, Julien Grall wrote:
> > 
> > 
> > On 28/07/2021 20:00, Andrew Cooper wrote:
> > > On 28/07/2021 18:27, Julien Grall wrote:
> > > > Hi Andrew,
> > > > 
> > > > On 28/07/2021 18:19, Andrew Cooper wrote:
> > > > > On 28/07/2021 17:18, Oleksandr Tyshchenko wrote:
> > > > > > From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> > > > > > 
> > > > > > Add XENMEM_get_unallocated_space hypercall which purpose is to
> > > > > > query hypervisor to find regions of guest physical address space
> > > > > > which are unused and can be used to create grant/foreign mappings
> > > > > > instead of wasting real pages from the domain memory for
> > > > > > establishing these mappings. The problem with the current Linux
> > > > > > on Xen on Arm behaviour is if we want to map some guest memory
> > > > > > regions in advance or to perform cache mappings in the backend
> > > > > > we might run out of memory in the host (see XSA-300).
> > > > > > This of course, depends on the both host and guest memory sizes.
> > > > > > 
> > > > > > The "unallocated space" can't be figured out precisely by
> > > > > > the domain on Arm without hypervisor involvement:
> > > > > > - not all device I/O regions are known by the time domain starts
> > > > > >     creating grant/foreign mappings
> > > > > > - the Dom0 is not aware of memory regions used for the identity
> > > > > >     mappings needed for the PV drivers to work
> > > > > > In both cases we might end up re-using these regions by
> > > > > > a mistake. So, the hypervisor which maintains the P2M for the domain
> > > > > > is in the best position to provide "unallocated space".
> > > > > 
> > > > > I'm afraid this does not improve the situation.
> > > > > 
> > > > > If a guest follows the advice from XENMEM_get_unallocated_space, and
> > > > > subsequently a new IO or identity region appears, everything will
> > > > > explode, because the "safe area" wasn't actually safe.
> > > > > 
> > > > > The safe range *must* be chosen by the toolstack, because nothing else
> > > > > can do it safely or correctly.
> > > > 
> > > > The problem is how do you size it? In particular, a backend may map
> > > > multiple time the same page (for instance if the page is granted twice).
> > > 
> > > The number of mapped grants is limited by the size of the maptrack table
> > > in Xen, which is a toolstack input to the domaincreate hypercall.
> > > Therefore, the amount of space required is known and bounded.
> > > 
> > > There are a handful of other frames required in the current ABI (shared
> > > info, vcpu info, etc).
> > > 
> > > The areas where things do become fuzzy is things like foreign mappings,
> > > acquire_resource, etc for the control domain, which are effectively
> > > unbounded from the domain's point of view.
> > > 
> > > For those, its entirely fine to say "here 128G of safe mapping space" or
> > > so.  Even the quantity of mapping dom0 can make is limited by the shadow
> > > memory pool and the number of pagetables Xen is willing to expend on the
> > > second stage translation tables.
> > 
> > FWIW, on Arm, we don't have shadow memory pool.
> > 
> > Anyway, it should be possible to give a 128GB "safe range" and let Xen deal
> > with it.
> > 
> > > 
> > > > 
> > > > > 
> > > > > Once a safe range (or ranges) has been chosen, any subsequent action
> > > > > which overlaps with the ranges must be rejected, as it will violate
> > > > > the
> > > > > guarantees provided.
> > > > > 
> > > > > Furthermore, the ranges should be made available to the guest via
> > > > > normal
> > > > > memory map means.  On x86, this is via the E820 table, and on ARM I
> > > > > presume the DTB.  There is no need for a new hypercall.
> > > > 
> > > > Device-Tree only works if you have a guest using it. How about ACPI?
> > > 
> > > ACPI inherits E820 from x86 (its a trivial format), and UEFI was also
> > > based on it.
> > > 
> > > But whichever...  All firmware interfaces have a memory map.
> > 
> > This will be UEFI memory map. However, I am a bit confused how we can tell
> > the OS the region will be used for grant/foreign mapping. Is it possible to
> > reserved a new type?
> > 
> > > 
> > > > To me the hypercall solution at least:
> > > >    1) Avoid to have to define the region on every single firmware table
> > > 
> > > There is only ever one.
> > 
> > Why? I could forsee an interest to use the host memory map and therefore we
> > may need to find a few holes for safe regions to use.
> > 
> > > 
> > > >    2) Allow to easily extend after the guest run
> > > 
> > > The safe ranges can't be changed (safely).  This is the same problem as
> > > needing to know things like your PCI apertures ahead of time, or where
> > > the DIMM hotplug regions are.
> > > 
> > > Having the guest physmap be actually dynamic is the cause of so many
> > > bugs (inc security) and misfeatures in Xen.  Guests cannot and do no
> > > cope with things being fully dynamic, because that's not how real
> > > hardware works.  What you get is layers and layers of breakage on top of
> > > each other, rather than a working system.
> > 
> > I would not call it "fully dynamic". Xen can easily know whether a region
> > has ever be allocated before. So long the region has never been allocated,
> > then it should be fine. In fact...
> > 
> > > 
> > > The size of mapping space is a limit, just like maxphysaddr, or the PCI
> > > apertures, or MMCFG space, etc.  You can make it large by default (as it
> > > doesn't consume resource when not being used), but any guest OS isn't
> > > going to tolerate it morphing during runtime.
> > ... I believe the OS may be not aware of the hotplug region until it is
> > actually used.
> > 
> > Anyway, I suggested this approach a few months ago to Oleksandr (see [1])
> > which BTW you were CCed on ;). My take was that Xen knows about the
> > unallocated space and therefore can make an informed decision without having
> > to allocate insanely large region.
> > 
> > Now if you think that's fine (IIRC Stefano had a preference for that as
> > well). Then we can use the firmware table (assuming we can solve the ACPI
> > question).
> 
> 
> Well, if new hypercall and, what is more, "the querying hypervisor at runtime
> to find unused space" model itself is not welcome, I am ok, let's try to
> create a working system,
> may we please find a common ground to move this forward (at least on Arm for
> now, the foreign mapping is the most important question).
> 
> I got the proposed idea in general, but I haven't connected all dots yet, some
> points need clarification.
> 
> 1. The safe range must be defined/allocated in advance and must remain const
> during the runtime. The safe range must be chosen by the toolstack.
> [For the initial implementation we can start with some large value (128GB) as
> discussed above]
> 
> Questions:
> 
> - Do we need to inform Xen about that range (via domain_create hypercall,
> etc)?
> - What will be in charge of guaranteeing the safety of that range at runtime
> (reject new mapping requests with possible overlaps, etc), Xen, toolstack or
> both?
> - Where that range should be located in guest address space, should that range
> be the same for all domains (how GUEST_GNTTAB_BASE(SIZE) for example)
> or it should be calculated based on actual guest_ram_base(size) for a
> particular domain?
> - What about a safe range the Dom0 can use itself? Xen should choose it for
> Dom0 the same way how toolstack chooses it for other domains, correct?

Brainstorming an idea to answer some of these questions.

Xen needs to be able to choose a safe range without the toolstack's help
at least for dom0 and dom0less guests. Could we just have Xen choose the
safe range for all domains?

Could it be just as simple as choosing a range 1GB-aligned above
whatever is the highest address previously allocated for memory, MMIO,
whatever?

We could also have the toolstack provide the info but I wonder if we can
get away without it.


> 2. The safe range must be provided to domain using the firmware table.
> [We can start with the DTB and leave ACPI unimplemented for now, assuming we
> will be able to solve open questions as discussed above]
>
> Questions:
> 
> - Do we need distinguish between foreign and grant mappings at the domain side
> at all? Can the same safe range be used for all types of mapping?

This is a good question. I can't think of a reason why we would need to
distinguish between the two from Xen's point of view. Would it make
the Linux implementation easier if we distinguished them?


> - How will domain recognize what range can be used for foreign/grant mappings?
> Do we need to introduce new device-tree bindings for describing the range
> or it is possible to re-use current bindings (expand the "reg" property under
> "hypervisor" node, currently it contains gnttab region)?

Do we need a struct page* for the range? We needed it in the past and it
is the reason why we currenly use ballooned-out pages: they have a
proper struct page* associated with them. pte_special() caused problems.

So, continuing with the assumption that we need a struct page*, then the
range needs to be advertised as "memory" to the DomU (part to of the
/memory node). But we can reserve the range for foreign/grant-mapping
use by adding a reserved-memory binding for it. If we go down this
route, we should add a new binding as I don't think we should reuse
reserved-memory/xen,shared-memory.txt. It would be very simply to add,
just follow the model of xen,shared-memory.txt. (Note that just because
the range is part of the DomU /memory node, it doesn't mean it needs to
be backed by real memory.)

If we don't need a struct page* then we can do something different using
bus ranges and/or MMIO regions.
Oleksandr Tyshchenko Aug. 2, 2021, 7:12 p.m. UTC | #8
Hi Stefano,


Thank you for the comments and ideas.


On 31.07.21 02:57, Stefano Stabellini wrote:
> On Fri, 30 Jul 2021, Oleksandr wrote:
>> Hi Andrew, Julien.
>>
>>
>> @Andrew, I think that arguments you provided in your first answer (why the
>> proposed solution is a potentially racy and not a safe) are valid and
>> reasonable.
>> Thanks for the feedback.
>>
>>
>> On 28.07.21 22:53, Julien Grall wrote:
>>>
>>> On 28/07/2021 20:00, Andrew Cooper wrote:
>>>> On 28/07/2021 18:27, Julien Grall wrote:
>>>>> Hi Andrew,
>>>>>
>>>>> On 28/07/2021 18:19, Andrew Cooper wrote:
>>>>>> On 28/07/2021 17:18, Oleksandr Tyshchenko wrote:
>>>>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>>>>
>>>>>>> Add XENMEM_get_unallocated_space hypercall which purpose is to
>>>>>>> query hypervisor to find regions of guest physical address space
>>>>>>> which are unused and can be used to create grant/foreign mappings
>>>>>>> instead of wasting real pages from the domain memory for
>>>>>>> establishing these mappings. The problem with the current Linux
>>>>>>> on Xen on Arm behaviour is if we want to map some guest memory
>>>>>>> regions in advance or to perform cache mappings in the backend
>>>>>>> we might run out of memory in the host (see XSA-300).
>>>>>>> This of course, depends on the both host and guest memory sizes.
>>>>>>>
>>>>>>> The "unallocated space" can't be figured out precisely by
>>>>>>> the domain on Arm without hypervisor involvement:
>>>>>>> - not all device I/O regions are known by the time domain starts
>>>>>>>      creating grant/foreign mappings
>>>>>>> - the Dom0 is not aware of memory regions used for the identity
>>>>>>>      mappings needed for the PV drivers to work
>>>>>>> In both cases we might end up re-using these regions by
>>>>>>> a mistake. So, the hypervisor which maintains the P2M for the domain
>>>>>>> is in the best position to provide "unallocated space".
>>>>>> I'm afraid this does not improve the situation.
>>>>>>
>>>>>> If a guest follows the advice from XENMEM_get_unallocated_space, and
>>>>>> subsequently a new IO or identity region appears, everything will
>>>>>> explode, because the "safe area" wasn't actually safe.
>>>>>>
>>>>>> The safe range *must* be chosen by the toolstack, because nothing else
>>>>>> can do it safely or correctly.
>>>>> The problem is how do you size it? In particular, a backend may map
>>>>> multiple time the same page (for instance if the page is granted twice).
>>>> The number of mapped grants is limited by the size of the maptrack table
>>>> in Xen, which is a toolstack input to the domaincreate hypercall.
>>>> Therefore, the amount of space required is known and bounded.
>>>>
>>>> There are a handful of other frames required in the current ABI (shared
>>>> info, vcpu info, etc).
>>>>
>>>> The areas where things do become fuzzy is things like foreign mappings,
>>>> acquire_resource, etc for the control domain, which are effectively
>>>> unbounded from the domain's point of view.
>>>>
>>>> For those, its entirely fine to say "here 128G of safe mapping space" or
>>>> so.  Even the quantity of mapping dom0 can make is limited by the shadow
>>>> memory pool and the number of pagetables Xen is willing to expend on the
>>>> second stage translation tables.
>>> FWIW, on Arm, we don't have shadow memory pool.
>>>
>>> Anyway, it should be possible to give a 128GB "safe range" and let Xen deal
>>> with it.
>>>
>>>>>> Once a safe range (or ranges) has been chosen, any subsequent action
>>>>>> which overlaps with the ranges must be rejected, as it will violate
>>>>>> the
>>>>>> guarantees provided.
>>>>>>
>>>>>> Furthermore, the ranges should be made available to the guest via
>>>>>> normal
>>>>>> memory map means.  On x86, this is via the E820 table, and on ARM I
>>>>>> presume the DTB.  There is no need for a new hypercall.
>>>>> Device-Tree only works if you have a guest using it. How about ACPI?
>>>> ACPI inherits E820 from x86 (its a trivial format), and UEFI was also
>>>> based on it.
>>>>
>>>> But whichever...  All firmware interfaces have a memory map.
>>> This will be UEFI memory map. However, I am a bit confused how we can tell
>>> the OS the region will be used for grant/foreign mapping. Is it possible to
>>> reserved a new type?
>>>
>>>>> To me the hypercall solution at least:
>>>>>     1) Avoid to have to define the region on every single firmware table
>>>> There is only ever one.
>>> Why? I could forsee an interest to use the host memory map and therefore we
>>> may need to find a few holes for safe regions to use.
>>>
>>>>>     2) Allow to easily extend after the guest run
>>>> The safe ranges can't be changed (safely).  This is the same problem as
>>>> needing to know things like your PCI apertures ahead of time, or where
>>>> the DIMM hotplug regions are.
>>>>
>>>> Having the guest physmap be actually dynamic is the cause of so many
>>>> bugs (inc security) and misfeatures in Xen.  Guests cannot and do no
>>>> cope with things being fully dynamic, because that's not how real
>>>> hardware works.  What you get is layers and layers of breakage on top of
>>>> each other, rather than a working system.
>>> I would not call it "fully dynamic". Xen can easily know whether a region
>>> has ever be allocated before. So long the region has never been allocated,
>>> then it should be fine. In fact...
>>>
>>>> The size of mapping space is a limit, just like maxphysaddr, or the PCI
>>>> apertures, or MMCFG space, etc.  You can make it large by default (as it
>>>> doesn't consume resource when not being used), but any guest OS isn't
>>>> going to tolerate it morphing during runtime.
>>> ... I believe the OS may be not aware of the hotplug region until it is
>>> actually used.
>>>
>>> Anyway, I suggested this approach a few months ago to Oleksandr (see [1])
>>> which BTW you were CCed on ;). My take was that Xen knows about the
>>> unallocated space and therefore can make an informed decision without having
>>> to allocate insanely large region.
>>>
>>> Now if you think that's fine (IIRC Stefano had a preference for that as
>>> well). Then we can use the firmware table (assuming we can solve the ACPI
>>> question).
>>
>> Well, if new hypercall and, what is more, "the querying hypervisor at runtime
>> to find unused space" model itself is not welcome, I am ok, let's try to
>> create a working system,
>> may we please find a common ground to move this forward (at least on Arm for
>> now, the foreign mapping is the most important question).
>>
>> I got the proposed idea in general, but I haven't connected all dots yet, some
>> points need clarification.
>>
>> 1. The safe range must be defined/allocated in advance and must remain const
>> during the runtime. The safe range must be chosen by the toolstack.
>> [For the initial implementation we can start with some large value (128GB) as
>> discussed above]
>>
>> Questions:
>>
>> - Do we need to inform Xen about that range (via domain_create hypercall,
>> etc)?
>> - What will be in charge of guaranteeing the safety of that range at runtime
>> (reject new mapping requests with possible overlaps, etc), Xen, toolstack or
>> both?
>> - Where that range should be located in guest address space, should that range
>> be the same for all domains (how GUEST_GNTTAB_BASE(SIZE) for example)
>> or it should be calculated based on actual guest_ram_base(size) for a
>> particular domain?
>> - What about a safe range the Dom0 can use itself? Xen should choose it for
>> Dom0 the same way how toolstack chooses it for other domains, correct?
> Brainstorming an idea to answer some of these questions.
>
> Xen needs to be able to choose a safe range without the toolstack's help
> at least for dom0 and dom0less guests.

Completely agree.


> Could we just have Xen choose the
> safe range for all domains?

As I understand, the region info (base, size) should be known to whoever 
generates firmware tables (DTB) for the domain in order to insert that 
range.
So, in case if Xen chooses that range, it should be queried by the 
toolstack anyway (either by new domctl or extending existing one which 
context would fit).
This adds some complexity.


>
> Could it be just as simple as choosing a range 1GB-aligned above
> whatever is the highest address previously allocated for memory, MMIO,
> whatever?
Probably, we could choose a range above max mapped gpfn at the domain 
creation time.
Currently, we are dealing with platform devices only on Arm and all IO 
are known in advance, so no new IO are expected at runtime.
But, what to do when hotplug comes into play (and new conflicting IO 
might just appear)? On the other side, we might be faced with the same 
problem,
if we used a predefined static range. With both options (calculated and 
predefined range) we would need to deny any new conflicting mapping.


>
> We could also have the toolstack provide the info but I wonder if we can
> get away without it.

I may not see all pitfalls here, but it seems that having the same 
predefined range for all domains would simplify things.
On Arm we have the following:

#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 1016GB of RAM @ 
8GB */
#define GUEST_RAM1_SIZE   xen_mk_ullong(0xfe00000000)

Now I am wondering, can we limit the GUEST_RAM1_SIZE by 128GB for 
example in order to reserve a range at the end, would this be acceptable?
So, both Xen and toolstack would see that definition and would be able 
to insert a range to the generated DTB, Xen's P2M code would be in 
change of keeping that range in a safe state.

>> 2. The safe range must be provided to domain using the firmware table.
>> [We can start with the DTB and leave ACPI unimplemented for now, assuming we
>> will be able to solve open questions as discussed above]
>>
>> Questions:
>>
>> - Do we need distinguish between foreign and grant mappings at the domain side
>> at all? Can the same safe range be used for all types of mapping?
> This is a good question. I can't think of a reason why we would need to
> distinguish between the two from Xen's point of view. Would it make
> the Linux implementation easier if we distinguished them?
I may mistake, but I don't think it would make the Linux implementation 
easier, both foreign and grants mappings result in same action which is 
xen_alloc_unpopulated_pages (or alloc_xenballooned_pages).
For both mappings the problem with wasting real domain memory on Arm is 
actual at the moment, although with foreign mapping
for the device emulator use case it becomes more evident.



>
>
>> - How will domain recognize what range can be used for foreign/grant mappings?
>> Do we need to introduce new device-tree bindings for describing the range
>> or it is possible to re-use current bindings (expand the "reg" property under
>> "hypervisor" node, currently it contains gnttab region)?
> Do we need a struct page* for the range? We needed it in the past and it
> is the reason why we currenly use ballooned-out pages: they have a
> proper struct page* associated with them. pte_special() caused problems.
> So, continuing with the assumption that we need a struct page*, then the
> range needs to be advertised as "memory" to the DomU (part to of the
> /memory node). But we can reserve the range for foreign/grant-mapping
> use by adding a reserved-memory binding for it. If we go down this
> route, we should add a new binding as I don't think we should reuse
> reserved-memory/xen,shared-memory.txt. It would be very simply to add,
> just follow the model of xen,shared-memory.txt. (Note that just because
> the range is part of the DomU /memory node, it doesn't mean it needs to
> be backed by real memory.)
>
> If we don't need a struct page* then we can do something different using
> bus ranges and/or MMIO regions.

Yes, we need struct page* for the each pfn in the range. But, I think, 
this can be achieved even without describing the range as a part of 
memory/reserved-memory nodes. Fortunately, the Linux ZONE_DEVICE facility
can help with that, actually the "unpopulated-alloc" is based on this 
feature, please see drivers/xen/unpopulated-alloc.c:fill_list()
and 
https://lore.kernel.org/lkml/1627490656-1267-2-git-send-email-olekstysh@gmail.com/ 
for details.
Jan Beulich Aug. 3, 2021, 12:49 p.m. UTC | #9
On 28.07.2021 21:53, Julien Grall wrote:
> On 28/07/2021 20:00, Andrew Cooper wrote:
>> On 28/07/2021 18:27, Julien Grall wrote:
>>> On 28/07/2021 18:19, Andrew Cooper wrote:
>>>> On 28/07/2021 17:18, Oleksandr Tyshchenko wrote:
>>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>>
>>>>> Add XENMEM_get_unallocated_space hypercall which purpose is to
>>>>> query hypervisor to find regions of guest physical address space
>>>>> which are unused and can be used to create grant/foreign mappings
>>>>> instead of wasting real pages from the domain memory for
>>>>> establishing these mappings. The problem with the current Linux
>>>>> on Xen on Arm behaviour is if we want to map some guest memory
>>>>> regions in advance or to perform cache mappings in the backend
>>>>> we might run out of memory in the host (see XSA-300).
>>>>> This of course, depends on the both host and guest memory sizes.
>>>>>
>>>>> The "unallocated space" can't be figured out precisely by
>>>>> the domain on Arm without hypervisor involvement:
>>>>> - not all device I/O regions are known by the time domain starts
>>>>>     creating grant/foreign mappings
>>>>> - the Dom0 is not aware of memory regions used for the identity
>>>>>     mappings needed for the PV drivers to work
>>>>> In both cases we might end up re-using these regions by
>>>>> a mistake. So, the hypervisor which maintains the P2M for the domain
>>>>> is in the best position to provide "unallocated space".
>>>>
>>>> I'm afraid this does not improve the situation.
>>>>
>>>> If a guest follows the advice from XENMEM_get_unallocated_space, and
>>>> subsequently a new IO or identity region appears, everything will
>>>> explode, because the "safe area" wasn't actually safe.
>>>>
>>>> The safe range *must* be chosen by the toolstack, because nothing else
>>>> can do it safely or correctly.
>>>
>>> The problem is how do you size it? In particular, a backend may map
>>> multiple time the same page (for instance if the page is granted twice).
>>
>> The number of mapped grants is limited by the size of the maptrack table
>> in Xen, which is a toolstack input to the domaincreate hypercall.
>> Therefore, the amount of space required is known and bounded.
>>
>> There are a handful of other frames required in the current ABI (shared
>> info, vcpu info, etc).
>>
>> The areas where things do become fuzzy is things like foreign mappings,
>> acquire_resource, etc for the control domain, which are effectively
>> unbounded from the domain's point of view.
>>
>> For those, its entirely fine to say "here 128G of safe mapping space" or
>> so.  Even the quantity of mapping dom0 can make is limited by the shadow
>> memory pool and the number of pagetables Xen is willing to expend on the
>> second stage translation tables.
> 
> FWIW, on Arm, we don't have shadow memory pool.

Where do you take 2nd level page table memory from then?

>>>> Once a safe range (or ranges) has been chosen, any subsequent action
>>>> which overlaps with the ranges must be rejected, as it will violate the
>>>> guarantees provided.
>>>>
>>>> Furthermore, the ranges should be made available to the guest via normal
>>>> memory map means.  On x86, this is via the E820 table, and on ARM I
>>>> presume the DTB.  There is no need for a new hypercall.
>>>
>>> Device-Tree only works if you have a guest using it. How about ACPI?
>>
>> ACPI inherits E820 from x86 (its a trivial format), and UEFI was also
>> based on it.
>>
>> But whichever...  All firmware interfaces have a memory map.
> 
> This will be UEFI memory map. However, I am a bit confused how we can 
> tell the OS the region will be used for grant/foreign mapping. Is it 
> possible to reserved a new type?

As with about any non-abandoned specification it is in principle
possible to define/reserve new types. Question how practical it is,
i.e. in particular how long it may take to get to the point where
we have a firmly reserved type. Short of this I wonder whether you,
Andrew, were thinking to re-use an existing type (in which case the
question of disambiguation arises).

As a result I wonder whether a "middle" approach wouldn't be better:
Have the range be determined up front (by tool stack or Xen), but
communicate it to the guest by PV means (hypercall, shared info,
start info, or yet some other table).

Jan
Julien Grall Aug. 3, 2021, 12:53 p.m. UTC | #10
Hi Jan,

On 03/08/2021 13:49, Jan Beulich wrote:
> On 28.07.2021 21:53, Julien Grall wrote:
>> On 28/07/2021 20:00, Andrew Cooper wrote:
>>> On 28/07/2021 18:27, Julien Grall wrote:
>>>> On 28/07/2021 18:19, Andrew Cooper wrote:
>>>>> On 28/07/2021 17:18, Oleksandr Tyshchenko wrote:
>>>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>>>
>>>>>> Add XENMEM_get_unallocated_space hypercall which purpose is to
>>>>>> query hypervisor to find regions of guest physical address space
>>>>>> which are unused and can be used to create grant/foreign mappings
>>>>>> instead of wasting real pages from the domain memory for
>>>>>> establishing these mappings. The problem with the current Linux
>>>>>> on Xen on Arm behaviour is if we want to map some guest memory
>>>>>> regions in advance or to perform cache mappings in the backend
>>>>>> we might run out of memory in the host (see XSA-300).
>>>>>> This of course, depends on the both host and guest memory sizes.
>>>>>>
>>>>>> The "unallocated space" can't be figured out precisely by
>>>>>> the domain on Arm without hypervisor involvement:
>>>>>> - not all device I/O regions are known by the time domain starts
>>>>>>      creating grant/foreign mappings
>>>>>> - the Dom0 is not aware of memory regions used for the identity
>>>>>>      mappings needed for the PV drivers to work
>>>>>> In both cases we might end up re-using these regions by
>>>>>> a mistake. So, the hypervisor which maintains the P2M for the domain
>>>>>> is in the best position to provide "unallocated space".
>>>>>
>>>>> I'm afraid this does not improve the situation.
>>>>>
>>>>> If a guest follows the advice from XENMEM_get_unallocated_space, and
>>>>> subsequently a new IO or identity region appears, everything will
>>>>> explode, because the "safe area" wasn't actually safe.
>>>>>
>>>>> The safe range *must* be chosen by the toolstack, because nothing else
>>>>> can do it safely or correctly.
>>>>
>>>> The problem is how do you size it? In particular, a backend may map
>>>> multiple time the same page (for instance if the page is granted twice).
>>>
>>> The number of mapped grants is limited by the size of the maptrack table
>>> in Xen, which is a toolstack input to the domaincreate hypercall.
>>> Therefore, the amount of space required is known and bounded.
>>>
>>> There are a handful of other frames required in the current ABI (shared
>>> info, vcpu info, etc).
>>>
>>> The areas where things do become fuzzy is things like foreign mappings,
>>> acquire_resource, etc for the control domain, which are effectively
>>> unbounded from the domain's point of view.
>>>
>>> For those, its entirely fine to say "here 128G of safe mapping space" or
>>> so.  Even the quantity of mapping dom0 can make is limited by the shadow
>>> memory pool and the number of pagetables Xen is willing to expend on the
>>> second stage translation tables.
>>
>> FWIW, on Arm, we don't have shadow memory pool.
> 
> Where do you take 2nd level page table memory from then?

 From xenheap directly. And yes, I know this may lead to other issues. 
But that's a known missing features that so far no-one tackled.

Cheers,
Jan Beulich Aug. 3, 2021, 12:53 p.m. UTC | #11
On 30.07.2021 18:13, Oleksandr wrote:
> Well, if new hypercall and, what is more, "the querying hypervisor at 
> runtime to find unused space" model itself is not welcome, I am ok, 
> let's try to create a working system,
> may we please find a common ground to move this forward (at least on Arm 
> for now, the foreign mapping is the most important question).
> 
> I got the proposed idea in general, but I haven't connected all dots 
> yet, some points need clarification.
> 
> 1. The safe range must be defined/allocated in advance and must remain 
> const during the runtime. The safe range must be chosen by the toolstack.
> [For the initial implementation we can start with some large value 
> (128GB) as discussed above]
> 
> Questions:
> 
> - Do we need to inform Xen about that range (via domain_create 
> hypercall, etc)?
> - What will be in charge of guaranteeing the safety of that range at 
> runtime (reject new mapping requests with possible overlaps, etc), Xen, 
> toolstack or both?

Well, what other entity than Xen could enforce this? (By implication,
the answer to the earlier question can imo only be "yes", unless it's
Xen itself to establish the region.)

> - Where that range should be located in guest address space, should that 
> range be the same for all domains (how GUEST_GNTTAB_BASE(SIZE) for example)
> or it should be calculated based on actual guest_ram_base(size) for a 
> particular domain?

The default size may well be fixed or amount-of-memory-dependent, but
I think there will need to be a way to enlarge the region for guests
with particular needs.

> - What about a safe range the Dom0 can use itself? Xen should choose it 
> for Dom0 the same way how toolstack chooses it for other domains, correct?
> 
> 2. The safe range must be provided to domain using the firmware table.
> [We can start with the DTB and leave ACPI unimplemented for now, 
> assuming we will be able to solve open questions as discussed above]
> 
> Questions:
> 
> - Do we need distinguish between foreign and grant mappings at the 
> domain side at all? Can the same safe range be used for all types of 
> mapping?

Like Stefano I don't think so.

Jan
Oleksandr Tyshchenko Aug. 4, 2021, 7:18 p.m. UTC | #12
On 03.08.21 15:53, Jan Beulich wrote:

Hi, Jan

Thank you for the input.

> On 30.07.2021 18:13, Oleksandr wrote:
>> Well, if new hypercall and, what is more, "the querying hypervisor at
>> runtime to find unused space" model itself is not welcome, I am ok,
>> let's try to create a working system,
>> may we please find a common ground to move this forward (at least on Arm
>> for now, the foreign mapping is the most important question).
>>
>> I got the proposed idea in general, but I haven't connected all dots
>> yet, some points need clarification.
>>
>> 1. The safe range must be defined/allocated in advance and must remain
>> const during the runtime. The safe range must be chosen by the toolstack.
>> [For the initial implementation we can start with some large value
>> (128GB) as discussed above]
>>
>> Questions:
>>
>> - Do we need to inform Xen about that range (via domain_create
>> hypercall, etc)?
>> - What will be in charge of guaranteeing the safety of that range at
>> runtime (reject new mapping requests with possible overlaps, etc), Xen,
>> toolstack or both?
> Well, what other entity than Xen could enforce this? (By implication,
> the answer to the earlier question can imo only be "yes", unless it's
> Xen itself to establish the region.)

Indeed, agree.


>
>> - Where that range should be located in guest address space, should that
>> range be the same for all domains (how GUEST_GNTTAB_BASE(SIZE) for example)
>> or it should be calculated based on actual guest_ram_base(size) for a
>> particular domain?
> The default size may well be fixed or amount-of-memory-dependent, but
> I think there will need to be a way to enlarge the region for guests
> with particular needs.
Well, but why we couldn't just make a large chunk by default which would
satisfy all guests, as it was mentioned earlier in this thread "as it 
doesn't consume resource when not being used"
to avoid an extra configuration option, etc?


>
>> - What about a safe range the Dom0 can use itself? Xen should choose it
>> for Dom0 the same way how toolstack chooses it for other domains, correct?
>>
>> 2. The safe range must be provided to domain using the firmware table.
>> [We can start with the DTB and leave ACPI unimplemented for now,
>> assuming we will be able to solve open questions as discussed above]
>>
>> Questions:
>>
>> - Do we need distinguish between foreign and grant mappings at the
>> domain side at all? Can the same safe range be used for all types of
>> mapping?
> Like Stefano I don't think so.

Agree.
Oleksandr Tyshchenko Aug. 4, 2021, 8:56 p.m. UTC | #13
Hi Julien, Stefano.


On 02.08.21 22:12, Oleksandr wrote:
>
> Hi Stefano,
>
>
> Thank you for the comments and ideas.
>
>
> On 31.07.21 02:57, Stefano Stabellini wrote:
>> On Fri, 30 Jul 2021, Oleksandr wrote:
>>> Hi Andrew, Julien.
>>>
>>>
>>> @Andrew, I think that arguments you provided in your first answer 
>>> (why the
>>> proposed solution is a potentially racy and not a safe) are valid and
>>> reasonable.
>>> Thanks for the feedback.
>>>
>>>
>>> On 28.07.21 22:53, Julien Grall wrote:
>>>>
>>>> On 28/07/2021 20:00, Andrew Cooper wrote:
>>>>> On 28/07/2021 18:27, Julien Grall wrote:
>>>>>> Hi Andrew,
>>>>>>
>>>>>> On 28/07/2021 18:19, Andrew Cooper wrote:
>>>>>>> On 28/07/2021 17:18, Oleksandr Tyshchenko wrote:
>>>>>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>>>>>
>>>>>>>> Add XENMEM_get_unallocated_space hypercall which purpose is to
>>>>>>>> query hypervisor to find regions of guest physical address space
>>>>>>>> which are unused and can be used to create grant/foreign mappings
>>>>>>>> instead of wasting real pages from the domain memory for
>>>>>>>> establishing these mappings. The problem with the current Linux
>>>>>>>> on Xen on Arm behaviour is if we want to map some guest memory
>>>>>>>> regions in advance or to perform cache mappings in the backend
>>>>>>>> we might run out of memory in the host (see XSA-300).
>>>>>>>> This of course, depends on the both host and guest memory sizes.
>>>>>>>>
>>>>>>>> The "unallocated space" can't be figured out precisely by
>>>>>>>> the domain on Arm without hypervisor involvement:
>>>>>>>> - not all device I/O regions are known by the time domain starts
>>>>>>>>      creating grant/foreign mappings
>>>>>>>> - the Dom0 is not aware of memory regions used for the identity
>>>>>>>>      mappings needed for the PV drivers to work
>>>>>>>> In both cases we might end up re-using these regions by
>>>>>>>> a mistake. So, the hypervisor which maintains the P2M for the 
>>>>>>>> domain
>>>>>>>> is in the best position to provide "unallocated space".
>>>>>>> I'm afraid this does not improve the situation.
>>>>>>>
>>>>>>> If a guest follows the advice from XENMEM_get_unallocated_space, 
>>>>>>> and
>>>>>>> subsequently a new IO or identity region appears, everything will
>>>>>>> explode, because the "safe area" wasn't actually safe.
>>>>>>>
>>>>>>> The safe range *must* be chosen by the toolstack, because 
>>>>>>> nothing else
>>>>>>> can do it safely or correctly.
>>>>>> The problem is how do you size it? In particular, a backend may map
>>>>>> multiple time the same page (for instance if the page is granted 
>>>>>> twice).
>>>>> The number of mapped grants is limited by the size of the maptrack 
>>>>> table
>>>>> in Xen, which is a toolstack input to the domaincreate hypercall.
>>>>> Therefore, the amount of space required is known and bounded.
>>>>>
>>>>> There are a handful of other frames required in the current ABI 
>>>>> (shared
>>>>> info, vcpu info, etc).
>>>>>
>>>>> The areas where things do become fuzzy is things like foreign 
>>>>> mappings,
>>>>> acquire_resource, etc for the control domain, which are effectively
>>>>> unbounded from the domain's point of view.
>>>>>
>>>>> For those, its entirely fine to say "here 128G of safe mapping 
>>>>> space" or
>>>>> so.  Even the quantity of mapping dom0 can make is limited by the 
>>>>> shadow
>>>>> memory pool and the number of pagetables Xen is willing to expend 
>>>>> on the
>>>>> second stage translation tables.
>>>> FWIW, on Arm, we don't have shadow memory pool.
>>>>
>>>> Anyway, it should be possible to give a 128GB "safe range" and let 
>>>> Xen deal
>>>> with it.
>>>>
>>>>>>> Once a safe range (or ranges) has been chosen, any subsequent 
>>>>>>> action
>>>>>>> which overlaps with the ranges must be rejected, as it will violate
>>>>>>> the
>>>>>>> guarantees provided.
>>>>>>>
>>>>>>> Furthermore, the ranges should be made available to the guest via
>>>>>>> normal
>>>>>>> memory map means.  On x86, this is via the E820 table, and on ARM I
>>>>>>> presume the DTB.  There is no need for a new hypercall.
>>>>>> Device-Tree only works if you have a guest using it. How about ACPI?
>>>>> ACPI inherits E820 from x86 (its a trivial format), and UEFI was also
>>>>> based on it.
>>>>>
>>>>> But whichever...  All firmware interfaces have a memory map.
>>>> This will be UEFI memory map. However, I am a bit confused how we 
>>>> can tell
>>>> the OS the region will be used for grant/foreign mapping. Is it 
>>>> possible to
>>>> reserved a new type?
>>>>
>>>>>> To me the hypercall solution at least:
>>>>>>     1) Avoid to have to define the region on every single 
>>>>>> firmware table
>>>>> There is only ever one.
>>>> Why? I could forsee an interest to use the host memory map and 
>>>> therefore we
>>>> may need to find a few holes for safe regions to use.
>>>>
>>>>>>     2) Allow to easily extend after the guest run
>>>>> The safe ranges can't be changed (safely).  This is the same 
>>>>> problem as
>>>>> needing to know things like your PCI apertures ahead of time, or 
>>>>> where
>>>>> the DIMM hotplug regions are.
>>>>>
>>>>> Having the guest physmap be actually dynamic is the cause of so many
>>>>> bugs (inc security) and misfeatures in Xen.  Guests cannot and do no
>>>>> cope with things being fully dynamic, because that's not how real
>>>>> hardware works.  What you get is layers and layers of breakage on 
>>>>> top of
>>>>> each other, rather than a working system.
>>>> I would not call it "fully dynamic". Xen can easily know whether a 
>>>> region
>>>> has ever be allocated before. So long the region has never been 
>>>> allocated,
>>>> then it should be fine. In fact...
>>>>
>>>>> The size of mapping space is a limit, just like maxphysaddr, or 
>>>>> the PCI
>>>>> apertures, or MMCFG space, etc.  You can make it large by default 
>>>>> (as it
>>>>> doesn't consume resource when not being used), but any guest OS isn't
>>>>> going to tolerate it morphing during runtime.
>>>> ... I believe the OS may be not aware of the hotplug region until 
>>>> it is
>>>> actually used.
>>>>
>>>> Anyway, I suggested this approach a few months ago to Oleksandr 
>>>> (see [1])
>>>> which BTW you were CCed on ;). My take was that Xen knows about the
>>>> unallocated space and therefore can make an informed decision 
>>>> without having
>>>> to allocate insanely large region.
>>>>
>>>> Now if you think that's fine (IIRC Stefano had a preference for 
>>>> that as
>>>> well). Then we can use the firmware table (assuming we can solve 
>>>> the ACPI
>>>> question).
>>>
>>> Well, if new hypercall and, what is more, "the querying hypervisor 
>>> at runtime
>>> to find unused space" model itself is not welcome, I am ok, let's 
>>> try to
>>> create a working system,
>>> may we please find a common ground to move this forward (at least on 
>>> Arm for
>>> now, the foreign mapping is the most important question).
>>>
>>> I got the proposed idea in general, but I haven't connected all dots 
>>> yet, some
>>> points need clarification.
>>>
>>> 1. The safe range must be defined/allocated in advance and must 
>>> remain const
>>> during the runtime. The safe range must be chosen by the toolstack.
>>> [For the initial implementation we can start with some large value 
>>> (128GB) as
>>> discussed above]
>>>
>>> Questions:
>>>
>>> - Do we need to inform Xen about that range (via domain_create 
>>> hypercall,
>>> etc)?
>>> - What will be in charge of guaranteeing the safety of that range at 
>>> runtime
>>> (reject new mapping requests with possible overlaps, etc), Xen, 
>>> toolstack or
>>> both?
>>> - Where that range should be located in guest address space, should 
>>> that range
>>> be the same for all domains (how GUEST_GNTTAB_BASE(SIZE) for example)
>>> or it should be calculated based on actual guest_ram_base(size) for a
>>> particular domain?
>>> - What about a safe range the Dom0 can use itself? Xen should choose 
>>> it for
>>> Dom0 the same way how toolstack chooses it for other domains, correct?
>> Brainstorming an idea to answer some of these questions.
>>
>> Xen needs to be able to choose a safe range without the toolstack's help
>> at least for dom0 and dom0less guests.
>
> Completely agree.
>
>
>> Could we just have Xen choose the
>> safe range for all domains?
>
> As I understand, the region info (base, size) should be known to 
> whoever generates firmware tables (DTB) for the domain in order to 
> insert that range.
> So, in case if Xen chooses that range, it should be queried by the 
> toolstack anyway (either by new domctl or extending existing one which 
> context would fit).
> This adds some complexity.
>
>
>>
>> Could it be just as simple as choosing a range 1GB-aligned above
>> whatever is the highest address previously allocated for memory, MMIO,
>> whatever?
> Probably, we could choose a range above max mapped gpfn at the domain 
> creation time.
> Currently, we are dealing with platform devices only on Arm and all IO 
> are known in advance, so no new IO are expected at runtime.
> But, what to do when hotplug comes into play (and new conflicting IO 
> might just appear)? On the other side, we might be faced with the same 
> problem,
> if we used a predefined static range. With both options (calculated 
> and predefined range) we would need to deny any new conflicting mapping.
>
>
>>
>> We could also have the toolstack provide the info but I wonder if we can
>> get away without it.
>
> I may not see all pitfalls here, but it seems that having the same 
> predefined range for all domains would simplify things.
> On Arm we have the following:
>
> #define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 1016GB of RAM 
> @ 8GB */
> #define GUEST_RAM1_SIZE   xen_mk_ullong(0xfe00000000)
>
> Now I am wondering, can we limit the GUEST_RAM1_SIZE by 128GB for 
> example in order to reserve a range at the end, would this be acceptable?
> So, both Xen and toolstack would see that definition and would be able 
> to insert a range to the generated DTB, Xen's P2M code would be in 
> change of keeping that range in a safe state.
>
>>> 2. The safe range must be provided to domain using the firmware table.
>>> [We can start with the DTB and leave ACPI unimplemented for now, 
>>> assuming we
>>> will be able to solve open questions as discussed above]
>>>
>>> Questions:
>>>
>>> - Do we need distinguish between foreign and grant mappings at the 
>>> domain side
>>> at all? Can the same safe range be used for all types of mapping?
>> This is a good question. I can't think of a reason why we would need to
>> distinguish between the two from Xen's point of view. Would it make
>> the Linux implementation easier if we distinguished them?
> I may mistake, but I don't think it would make the Linux 
> implementation easier, both foreign and grants mappings result in same 
> action which is xen_alloc_unpopulated_pages (or 
> alloc_xenballooned_pages).
> For both mappings the problem with wasting real domain memory on Arm 
> is actual at the moment, although with foreign mapping
> for the device emulator use case it becomes more evident.
>
>
>
>>
>>
>>> - How will domain recognize what range can be used for foreign/grant 
>>> mappings?
>>> Do we need to introduce new device-tree bindings for describing the 
>>> range
>>> or it is possible to re-use current bindings (expand the "reg" 
>>> property under
>>> "hypervisor" node, currently it contains gnttab region)?
>> Do we need a struct page* for the range? We needed it in the past and it
>> is the reason why we currenly use ballooned-out pages: they have a
>> proper struct page* associated with them. pte_special() caused problems.
>> So, continuing with the assumption that we need a struct page*, then the
>> range needs to be advertised as "memory" to the DomU (part to of the
>> /memory node). But we can reserve the range for foreign/grant-mapping
>> use by adding a reserved-memory binding for it. If we go down this
>> route, we should add a new binding as I don't think we should reuse
>> reserved-memory/xen,shared-memory.txt. It would be very simply to add,
>> just follow the model of xen,shared-memory.txt. (Note that just because
>> the range is part of the DomU /memory node, it doesn't mean it needs to
>> be backed by real memory.)
>>
>> If we don't need a struct page* then we can do something different using
>> bus ranges and/or MMIO regions.
>
> Yes, we need struct page* for the each pfn in the range. But, I think, 
> this can be achieved even without describing the range as a part of 
> memory/reserved-memory nodes. Fortunately, the Linux ZONE_DEVICE facility
> can help with that, actually the "unpopulated-alloc" is based on this 
> feature, please see drivers/xen/unpopulated-alloc.c:fill_list()
> and 
> https://lore.kernel.org/lkml/1627490656-1267-2-git-send-email-olekstysh@gmail.com/ 
> for details.


I have done some experiments with Xen and toolstack according to the 
discussion above. So, I re-used DTB to pass a safe range to the domain. 
For the range I borrowed some space from the second RAM bank.

-#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 1016GB of RAM 
@ 8GB */
-#define GUEST_RAM1_SIZE   xen_mk_ullong(0xfe00000000)
+#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 888GB of RAM @ 
8GB */
+#define GUEST_RAM1_SIZE   xen_mk_ullong(0xDE00000000)
+
+#define GUEST_SAFE_RANGE_BASE   xen_mk_ullong(0xDE00000000) /* 128GB */
+#define GUEST_SAFE_RANGE_SIZE   xen_mk_ullong(0x0200000000)

While the possible new DT bindings has not been agreed yet, I re-used 
existing "reg" property under the hypervisor node to pass safe range as 
a second region,
https://elixir.bootlin.com/linux/v5.14-rc4/source/Documentation/devicetree/bindings/arm/xen.txt#L10:


--- a/tools/libs/light/libxl_arm.c
+++ b/tools/libs/light/libxl_arm.c
@@ -735,9 +735,11 @@ static int make_hypervisor_node(libxl__gc *gc, void 
*fdt,
                                "xen,xen");
      if (res) return res;

-    /* reg 0 is grant table space */milat
+    /* reg 0 is grant table space, reg 1 is safe range */
      res = fdt_property_regs(gc, fdt, GUEST_ROOT_ADDRESS_CELLS, 
GUEST_ROOT_SIZE_CELLS,
-                            1,GUEST_GNTTAB_BASE, GUEST_GNTTAB_SIZE);
+                            2,
+                            GUEST_GNTTAB_BASE, GUEST_GNTTAB_SIZE,
+                            GUEST_SAFE_RANGE_BASE, GUEST_SAFE_RANGE_SIZE);
      if (res) return res;

      /*


/* Resulting hypervisor node */

  hypervisor {
                 interrupts = <0x01 0x0f 0xf08>;
                 interrupt-parent = <0xfde8>;
                 compatible = "xen,xen-4.16\0xen,xen";
                 reg = <0x00 0x38000000 0x00 0x1000000 0xde 0x00 0x02 0x00>;
  };


Near the same I did for the Xen itself to insert a range for Dom0. The 
Linux side change is just to retrieve a range from DTB instead of 
issuing a hypercall.

Sorry, I might miss some important bits here, but from what I wrote 
about the "reg" purpose, it seems it could be suitable for us, why 
actually not? Why do we need yet another binding?
I noticed, Linux on Arm doesn't use it at all, probably it is used by 
other OSes, I don't know.

Now, I am wondering, would it be possible to update/clarify the current 
"reg" purpose and use it to pass a safe unallocated space for any Xen 
specific mappings (grant, foreign, whatever) instead of just for the 
grant table region. In case, it is not allowed for any reason 
(compatibility PoV, etc), would it be possible to extend a property by 
passing an extra range separately, something similar to how I described 
above?
Julien Grall Aug. 4, 2021, 10 p.m. UTC | #14
On 04/08/2021 21:56, Oleksandr wrote:
> 
> Hi Julien, Stefano.

Hi Oleksandr,

> 
> On 02.08.21 22:12, Oleksandr wrote:
> I have done some experiments with Xen and toolstack according to the 
> discussion above. So, I re-used DTB to pass a safe range to the domain. 
> For the range I borrowed some space from the second RAM bank.
> 
> -#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 1016GB of RAM 
> @ 8GB */
> -#define GUEST_RAM1_SIZE   xen_mk_ullong(0xfe00000000)
> +#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 888GB of RAM @ 
> 8GB */
> +#define GUEST_RAM1_SIZE   xen_mk_ullong(0xDE00000000)
> +

I am a bit split with reducing the amount of RAM. On one hand large 
guest is not unheard on the server side (at least in the x86 world). On 
the other hand, I am not aware of anyone using Xen on Arm in such setup.

So technically this will be a regression, but it may be OK.

Regarding the range, this will be a problem as Xen configure the number 
of the IPA bits based on the PA bits. The lowest possible address space 
ize on 64-bit is 4GB.

 From my understanding, this is because the number of IPA bits supported 
is contrained by the PA bits. So the position and the size of the region
would need to depend on the P2M configuration.

For simplicity, this could be the last few X bytes of the supported 
address space.

For 32-bit domain, we also need to make sure the address is usable for 
domain short page tables (not too long ago Debian was shipping the 
kernel with them rather than LPAE). I haven't yet checked what's the 
limit here.

> +#define GUEST_SAFE_RANGE_BASE   xen_mk_ullong(0xDE00000000) /* 128GB */
> +#define GUEST_SAFE_RANGE_SIZE   xen_mk_ullong(0x0200000000)
> 
> While the possible new DT bindings has not been agreed yet, I re-used 
> existing "reg" property under the hypervisor node to pass safe range as 
> a second region,
> https://elixir.bootlin.com/linux/v5.14-rc4/source/Documentation/devicetree/bindings/arm/xen.txt#L10: 

So a single region works for a guest today, but for dom0 we will need 
multiple regions because it is may be difficult to find enough 
contiguous space for a single region.

That said, as dom0 is mapped 1:1 (including some guest mapping), there 
is also the question where to allocate the safe region. For grant table, 
we so far re-use the Xen address space because it is assumed it will 
space will always be bigger than the grant table.

I am not sure yet where we could allocate the safe regions. Stefano, do 
you have any ideas?

> 
> 
> 
> --- a/tools/libs/light/libxl_arm.c
> +++ b/tools/libs/light/libxl_arm.c
> @@ -735,9 +735,11 @@ static int make_hypervisor_node(libxl__gc *gc, void 
> *fdt,
>                                 "xen,xen");
>       if (res) return res;
> 
> -    /* reg 0 is grant table space */milat
> +    /* reg 0 is grant table space, reg 1 is safe range */
>       res = fdt_property_regs(gc, fdt, GUEST_ROOT_ADDRESS_CELLS, 
> GUEST_ROOT_SIZE_CELLS,
> -                            1,GUEST_GNTTAB_BASE, GUEST_GNTTAB_SIZE);
> +                            2,
> +                            GUEST_GNTTAB_BASE, GUEST_GNTTAB_SIZE,
> +                            GUEST_SAFE_RANGE_BASE, GUEST_SAFE_RANGE_SIZE);
>       if (res) return res;
> 
>       /*
> 
> 
> /* Resulting hypervisor node */
> 
>   hypervisor {
>                  interrupts = <0x01 0x0f 0xf08>;
>                  interrupt-parent = <0xfde8>;
>                  compatible = "xen,xen-4.16\0xen,xen";
>                  reg = <0x00 0x38000000 0x00 0x1000000 0xde 0x00 0x02 
> 0x00>;
>   };
> 
> 
> Near the same I did for the Xen itself to insert a range for Dom0. The 
> Linux side change is just to retrieve a range from DTB instead of 
> issuing a hypercall.
> 
> Sorry, I might miss some important bits here, but from what I wrote 
> about the "reg" purpose, it seems it could be suitable for us, why 
> actually not? Why do we need yet another binding?
> I noticed, Linux on Arm doesn't use it at all, probably it is used by 
> other OSes, I don't know.

Linux used the range until 4.7. This was dropped by commit 
3cf4095d7446efde28b48c26050b9db6f0bcb004 so the same code can be used by 
ACPI and DT. However, looking at this now, I think this was a bad 
decision because it means we are shattering superpages.

So ideally we should switch back the region to use the safe address 
space once this is in place.

> 
> Now, I am wondering, would it be possible to update/clarify the current 
> "reg" purpose and use it to pass a safe unallocated space for any Xen 
> specific mappings (grant, foreign, whatever) instead of just for the 
> grant table region. In case, it is not allowed for any reason 
> (compatibility PoV, etc), would it be possible to extend a property by 
> passing an extra range separately, something similar to how I described 
> above?

I think it should be fine to re-use the same region so long the size of 
the first bank is at least the size of the original region.

I also think we should be able to add extra regions as OSes are unlikely 
to enforce that the "reg" contains a single region.

That said, we need to be careful about new guests as the region may be 
quite small on older Xen. So we would need some heuristic to decide 
whether to stole some RAM or use the safe space.

Another possibility would be to add a new compatible in the DT that 
indicates the region is "big" enough.

Cheers,
Jan Beulich Aug. 5, 2021, 5:58 a.m. UTC | #15
On 04.08.2021 21:18, Oleksandr wrote:
> On 03.08.21 15:53, Jan Beulich wrote:
>> On 30.07.2021 18:13, Oleksandr wrote:
>>> - Where that range should be located in guest address space, should that
>>> range be the same for all domains (how GUEST_GNTTAB_BASE(SIZE) for example)
>>> or it should be calculated based on actual guest_ram_base(size) for a
>>> particular domain?
>> The default size may well be fixed or amount-of-memory-dependent, but
>> I think there will need to be a way to enlarge the region for guests
>> with particular needs.
> Well, but why we couldn't just make a large chunk by default which would
> satisfy all guests, as it was mentioned earlier in this thread "as it 
> doesn't consume resource when not being used"
> to avoid an extra configuration option, etc?

Because experience tells me that no static upper limit will ever please
everyone.

Jan
Oleksandr Tyshchenko Aug. 5, 2021, 2:52 p.m. UTC | #16
On 05.08.21 01:00, Julien Grall wrote:
>
>
> On 04/08/2021 21:56, Oleksandr wrote:
>>
>> Hi Julien, Stefano.
>
> Hi Oleksandr,


Hi, Julien


Thank you for the prompt reply and explanations.


>
>>
>> On 02.08.21 22:12, Oleksandr wrote:
>> I have done some experiments with Xen and toolstack according to the 
>> discussion above. So, I re-used DTB to pass a safe range to the 
>> domain. For the range I borrowed some space from the second RAM bank.
>>
>> -#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 1016GB of 
>> RAM @ 8GB */
>> -#define GUEST_RAM1_SIZE   xen_mk_ullong(0xfe00000000)
>> +#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 888GB of 
>> RAM @ 8GB */
>> +#define GUEST_RAM1_SIZE   xen_mk_ullong(0xDE00000000)
>> +
>
> I am a bit split with reducing the amount of RAM. On one hand large 
> guest is not unheard on the server side (at least in the x86 world). 
> On the other hand, I am not aware of anyone using Xen on Arm in such 
> setup.
>
> So technically this will be a regression, but it may be OK.

I got it.


>
>
> Regarding the range, this will be a problem as Xen configure the 
> number of the IPA bits based on the PA bits. The lowest possible 
> address space ize on 64-bit is 4GB.
>
> From my understanding, this is because the number of IPA bits 
> supported is contrained by the PA bits. So the position and the size 
> of the region
> would need to depend on the P2M configuration.

Indeed, I missed these bits that IPA bits on Arm64 might be < 40 bit, I 
remember, we select p2m_ipa_bits in setup_virt_paging() depending on 
pabits, moreover the p2m_ipa_bits might be even restricted by some 
external entity (IOMMU, if P2M is shared).


>
> For simplicity, this could be the last few X bytes of the supported 
> address space.
ok, agree. To summarize, so it sounds like we can't use the fixed safe 
range as in my example, it must be variable. Well, I hope, we will be 
able to achieve this without reducing the total amount of domain RAM in 
front (GUEST_RAM1_SIZE). After all, we know the IPA size and the domain 
RAM in advance, so we certainly can choose the start and size of the 
range. In case, we won't be able to find a suitable large chunk (for 
example, when IPA bits = 32, and domain has a lot of RAM assigned and as 
the result - almost all address space below 4GB is in use), we won't 
expose a safe range to a domain at all, and domain will just fall back 
to use real pages instead (actually, how it currently behaves on Arm).

A side note: we would likely need the toolstack (that generates 
device-tree for guests) to query IPA size, or similar.


>
>
> For 32-bit domain, we also need to make sure the address is usable for 
> domain short page tables (not too long ago Debian was shipping the 
> kernel with them rather than LPAE). I haven't yet checked what's the 
> limit here.

Hmm, I didn't take this use-case into the account. So, I assume we need 
the safe range to be located below 4GB if is_32bit_domain() returns true.


>
>
>> +#define GUEST_SAFE_RANGE_BASE xen_mk_ullong(0xDE00000000) /* 128GB */
>> +#define GUEST_SAFE_RANGE_SIZE   xen_mk_ullong(0x0200000000)
>>
>> While the possible new DT bindings has not been agreed yet, I re-used 
>> existing "reg" property under the hypervisor node to pass safe range 
>> as a second region,
>> https://elixir.bootlin.com/linux/v5.14-rc4/source/Documentation/devicetree/bindings/arm/xen.txt#L10: 
>
>
> So a single region works for a guest today, but for dom0 we will need 
> multiple regions because it is may be difficult to find enough 
> contiguous space for a single region.
>
> That said, as dom0 is mapped 1:1 (including some guest mapping), there 
> is also the question where to allocate the safe region. For grant 
> table, we so far re-use the Xen address space because it is assumed it 
> will space will always be bigger than the grant table.

Oh, I see. Indeed, it is unclear at the moment. Agree, the possibility 
to provide multiple ranges should be envisaged.


>
>
> I am not sure yet where we could allocate the safe regions. Stefano, 
> do you have any ideas?
>
>>
>>
>>
>> --- a/tools/libs/light/libxl_arm.c
>> +++ b/tools/libs/light/libxl_arm.c
>> @@ -735,9 +735,11 @@ static int make_hypervisor_node(libxl__gc *gc, 
>> void *fdt,
>>                                 "xen,xen");
>>       if (res) return res;
>>
>> -    /* reg 0 is grant table space */milat
>> +    /* reg 0 is grant table space, reg 1 is safe range */
>>       res = fdt_property_regs(gc, fdt, GUEST_ROOT_ADDRESS_CELLS, 
>> GUEST_ROOT_SIZE_CELLS,
>> -                            1,GUEST_GNTTAB_BASE, GUEST_GNTTAB_SIZE);
>> +                            2,
>> +                            GUEST_GNTTAB_BASE, GUEST_GNTTAB_SIZE,
>> +                            GUEST_SAFE_RANGE_BASE, 
>> GUEST_SAFE_RANGE_SIZE);
>>       if (res) return res;
>>
>>       /*
>>
>>
>> /* Resulting hypervisor node */
>>
>>   hypervisor {
>>                  interrupts = <0x01 0x0f 0xf08>;
>>                  interrupt-parent = <0xfde8>;
>>                  compatible = "xen,xen-4.16\0xen,xen";
>>                  reg = <0x00 0x38000000 0x00 0x1000000 0xde 0x00 0x02 
>> 0x00>;
>>   };
>>
>>
>> Near the same I did for the Xen itself to insert a range for Dom0. 
>> The Linux side change is just to retrieve a range from DTB instead of 
>> issuing a hypercall.
>>
>> Sorry, I might miss some important bits here, but from what I wrote 
>> about the "reg" purpose, it seems it could be suitable for us, why 
>> actually not? Why do we need yet another binding?
>> I noticed, Linux on Arm doesn't use it at all, probably it is used by 
>> other OSes, I don't know.
>
> Linux used the range until 4.7. This was dropped by commit 
> 3cf4095d7446efde28b48c26050b9db6f0bcb004 so the same code can be used 
> by ACPI and DT. However, looking at this now, I think this was a bad 
> decision because it means we are shattering superpages.

Yes, the another benefit of this enabling work (besides avoid wasting 
real domain pages) would be avoid shattering superpages, and as the 
result the I/O performance for the DMA devices behind the IOMMU would be 
slightly better, I hope.


>
>
> So ideally we should switch back the region to use the safe address 
> space once this is in place.

Yes.


>
>
>>
>> Now, I am wondering, would it be possible to update/clarify the 
>> current "reg" purpose and use it to pass a safe unallocated space for 
>> any Xen specific mappings (grant, foreign, whatever) instead of just 
>> for the grant table region. In case, it is not allowed for any reason 
>> (compatibility PoV, etc), would it be possible to extend a property 
>> by passing an extra range separately, something similar to how I 
>> described above?
>
> I think it should be fine to re-use the same region so long the size 
> of the first bank is at least the size of the original region.

ok


>
> I also think we should be able to add extra regions as OSes are 
> unlikely to enforce that the "reg" contains a single region.
>
> That said, we need to be careful about new guests as the region may be 
> quite small on older Xen. 

Agree.


> So we would need some heuristic to decide whether to stole some RAM or 
> use the safe space.
> Another possibility would be to add a new compatible in the DT that 
> indicates the region is "big" enough.
I like the last idea, did you perhaps mean new property (optional) 
rather than new compatible? Let's say "xen, safe-range" or "xen, 
extended-regions"  ...
That sign would suggest us how to operate: either pick up safe range(s) 
or behave as usual (stole real pages) as existing region is indeed small 
- 16MB (if I am not mistaken), this would work with old and new Xen.


>
>
> Cheers,
>
Daniel P. Smith Aug. 5, 2021, 3:03 p.m. UTC | #17
On 7/28/21 12:18 PM, Oleksandr Tyshchenko wrote:
> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>

...

> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index e07bd9a..3f9b816 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1811,6 +1811,62 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>              start_extent);
>          break;
>  
> +    case XENMEM_get_unallocated_space:
> +    {
> +        struct xen_get_unallocated_space xgus;
> +        struct xen_unallocated_region *regions;
> +
> +        if ( unlikely(start_extent) )
> +            return -EINVAL;
> +
> +        if ( copy_from_guest(&xgus, arg, 1) ||
> +             !guest_handle_okay(xgus.buffer, xgus.nr_regions) )
> +            return -EFAULT;
> +
> +        d = rcu_lock_domain_by_any_id(xgus.domid);
> +        if ( d == NULL )
> +            return -ESRCH;
> +
> +        rc = xsm_get_unallocated_space(XSM_HOOK, d);

Not sure if you are aware but XSM_HOOK is a no-op check, meaning that
you are allowing any domain to do this operation on any other domain. In
most cases there is an XSM check at the beginning of the hypercall
processing to do an initial clamp down but I am pretty sure there is no
prior XSM check on this path. Based on my understanding of how this is
intended, which may be incorrect, but I think you would actually want
XSM_TARGET.

> +        if ( rc )
> +        {
> +            rcu_unlock_domain(d);
> +            return rc;
> +        }
> +
> +        if ( !xgus.nr_regions || xgus.nr_regions > XEN_MAX_UNALLOCATED_REGIONS )
> +        {
> +            rcu_unlock_domain(d);
> +            return -EINVAL;
> +        }
> +
> +        regions = xzalloc_array(struct xen_unallocated_region, xgus.nr_regions);
> +        if ( !regions )
> +        {
> +            rcu_unlock_domain(d);
> +            return -ENOMEM;
> +        }
> +
> +        rc = arch_get_unallocated_space(d, regions, &xgus.nr_regions);
> +        if ( rc )
> +            goto unallocated_out;
> +
> +        if ( __copy_to_guest(xgus.buffer, regions, xgus.nr_regions) )
> +        {
> +            rc = -EFAULT;
> +            goto unallocated_out;
> +        }
> +
> +        if ( __copy_to_guest(arg, &xgus, 1) )
> +            rc = -EFAULT;
> +
> +unallocated_out:
> +        rcu_unlock_domain(d);
> +        xfree(regions);
> +
> +        break;
> +    }
> +
>      default:
>          rc = arch_memory_op(cmd, arg);
>          break;

...

> diff --git a/xen/include/xsm/dummy.h b/xen/include/xsm/dummy.h
> index 363c6d7..2761fbd 100644
> --- a/xen/include/xsm/dummy.h
> +++ b/xen/include/xsm/dummy.h
> @@ -772,3 +772,9 @@ static XSM_INLINE int xsm_domain_resource_map(XSM_DEFAULT_ARG struct domain *d)
>      XSM_ASSERT_ACTION(XSM_DM_PRIV);
>      return xsm_default_action(action, current->domain, d);
>  }
> +
> +static XSM_INLINE int xsm_get_unallocated_space(XSM_DEFAULT_ARG struct domain *d)
> +{
> +    XSM_ASSERT_ACTION(XSM_HOOK);

For completeness, if you switch to XSM_TARGET at the call site, you will
want to change it here as well.

> +    return xsm_default_action(action, current->domain, d);
> +}

V/r,
Daniel P. Smith
Oleksandr Tyshchenko Aug. 5, 2021, 3:10 p.m. UTC | #18
On 05.08.21 08:58, Jan Beulich wrote:

Hi Jan.

> On 04.08.2021 21:18, Oleksandr wrote:
>> On 03.08.21 15:53, Jan Beulich wrote:
>>> On 30.07.2021 18:13, Oleksandr wrote:
>>>> - Where that range should be located in guest address space, should that
>>>> range be the same for all domains (how GUEST_GNTTAB_BASE(SIZE) for example)
>>>> or it should be calculated based on actual guest_ram_base(size) for a
>>>> particular domain?
>>> The default size may well be fixed or amount-of-memory-dependent, but
>>> I think there will need to be a way to enlarge the region for guests
>>> with particular needs.
>> Well, but why we couldn't just make a large chunk by default which would
>> satisfy all guests, as it was mentioned earlier in this thread "as it
>> doesn't consume resource when not being used"
>> to avoid an extra configuration option, etc?
> Because experience tells me that no static upper limit will ever please
> everyone.

ok, thank you for the explanation.


>
> Jan
>
Oleksandr Tyshchenko Aug. 5, 2021, 3:59 p.m. UTC | #19
On 05.08.21 18:03, Daniel P. Smith wrote:

Hi Daniel.

> On 7/28/21 12:18 PM, Oleksandr Tyshchenko wrote:
>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> ...
>
>> diff --git a/xen/common/memory.c b/xen/common/memory.c
>> index e07bd9a..3f9b816 100644
>> --- a/xen/common/memory.c
>> +++ b/xen/common/memory.c
>> @@ -1811,6 +1811,62 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>               start_extent);
>>           break;
>>   
>> +    case XENMEM_get_unallocated_space:
>> +    {
>> +        struct xen_get_unallocated_space xgus;
>> +        struct xen_unallocated_region *regions;
>> +
>> +        if ( unlikely(start_extent) )
>> +            return -EINVAL;
>> +
>> +        if ( copy_from_guest(&xgus, arg, 1) ||
>> +             !guest_handle_okay(xgus.buffer, xgus.nr_regions) )
>> +            return -EFAULT;
>> +
>> +        d = rcu_lock_domain_by_any_id(xgus.domid);
>> +        if ( d == NULL )
>> +            return -ESRCH;
>> +
>> +        rc = xsm_get_unallocated_space(XSM_HOOK, d);
> Not sure if you are aware but XSM_HOOK is a no-op check, meaning that
> you are allowing any domain to do this operation on any other domain. In
> most cases there is an XSM check at the beginning of the hypercall
> processing to do an initial clamp down but I am pretty sure there is no
> prior XSM check on this path. Based on my understanding of how this is
> intended, which may be incorrect, but I think you would actually want
> XSM_TARGET.the
Thank you for pointing this out.
I am aware what the XSM_HOOK is, but I was thinking what the default 
action would be better suited for that hypercall, and failed to think of 
a better alternative.
I was going to choose XSM_TARGET, but the description "/* Can perform on 
self or your target domain */" confused me a bit, as there was no target 
domain involved as I thought, XSM_PRIV
sounded too strictly to me, etc. So, I decided to leave a "hook" for the 
RFC version. But, now I see that XSM_TARGET might be indeed better 
choice across all possible variants.


>
>> +        if ( rc )
>> +        {
>> +            rcu_unlock_domain(d);
>> +            return rc;
>> +        }
>> +
>> +        if ( !xgus.nr_regions || xgus.nr_regions > XEN_MAX_UNALLOCATED_REGIONS )
>> +        {
>> +            rcu_unlock_domain(d);
>> +            return -EINVAL;
>> +        }
>> +
>> +        regions = xzalloc_array(struct xen_unallocated_region, xgus.nr_regions);
>> +        if ( !regions )
>> +        {
>> +            rcu_unlock_domain(d);
>> +            return -ENOMEM;
>> +        }
>> +
>> +        rc = arch_get_unallocated_space(d, regions, &xgus.nr_regions);
>> +        if ( rc )
>> +            goto unallocated_out;
>> +
>> +        if ( __copy_to_guest(xgus.buffer, regions, xgus.nr_regions) )
>> +        {
>> +            rc = -EFAULT;
>> +            goto unallocated_out;
>> +        }
>> +
>> +        if ( __copy_to_guest(arg, &xgus, 1) )
>> +            rc = -EFAULT;
>> +
>> +unallocated_out:
>> +        rcu_unlock_domain(d);
>> +        xfree(regions);
>> +
>> +        break;
>> +    }
>> +
>>       default:
>>           rc = arch_memory_op(cmd, arg);
>>           break;
> ...
>
>> diff --git a/xen/include/xsm/dummy.h b/xen/include/xsm/dummy.h
>> index 363c6d7..2761fbd 100644
>> --- a/xen/include/xsm/dummy.h
>> +++ b/xen/include/xsm/dummy.h
>> @@ -772,3 +772,9 @@ static XSM_INLINE int xsm_domain_resource_map(XSM_DEFAULT_ARG struct domain *d)
>>       XSM_ASSERT_ACTION(XSM_DM_PRIV);
>>       return xsm_default_action(action, current->domain, d);
>>   }
>> +
>> +static XSM_INLINE int xsm_get_unallocated_space(XSM_DEFAULT_ARG struct domain *d)
>> +{
>> +    XSM_ASSERT_ACTION(XSM_HOOK);
> For completeness, if you switch to XSM_TARGET at the call site, you will
> want to change it here as well.

Yes.


>
>> +    return xsm_default_action(action, current->domain, d);
>> +}
> V/r,
> Daniel P. Smith
Daniel P. Smith Aug. 5, 2021, 4:37 p.m. UTC | #20
Hey Oleksandr!

On 8/5/21 11:59 AM, Oleksandr wrote:
> 
> On 05.08.21 18:03, Daniel P. Smith wrote:
> 
> Hi Daniel.
> 
>> On 7/28/21 12:18 PM, Oleksandr Tyshchenko wrote:
>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>> ...
>>
>>> diff --git a/xen/common/memory.c b/xen/common/memory.c
>>> index e07bd9a..3f9b816 100644
>>> --- a/xen/common/memory.c
>>> +++ b/xen/common/memory.c
>>> @@ -1811,6 +1811,62 @@ long do_memory_op(unsigned long cmd,
>>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>               start_extent);
>>>           break;
>>>   +    case XENMEM_get_unallocated_space:
>>> +    {
>>> +        struct xen_get_unallocated_space xgus;
>>> +        struct xen_unallocated_region *regions;
>>> +
>>> +        if ( unlikely(start_extent) )
>>> +            return -EINVAL;
>>> +
>>> +        if ( copy_from_guest(&xgus, arg, 1) ||
>>> +             !guest_handle_okay(xgus.buffer, xgus.nr_regions) )
>>> +            return -EFAULT;
>>> +
>>> +        d = rcu_lock_domain_by_any_id(xgus.domid);
>>> +        if ( d == NULL )
>>> +            return -ESRCH;
>>> +
>>> +        rc = xsm_get_unallocated_space(XSM_HOOK, d);
>> Not sure if you are aware but XSM_HOOK is a no-op check, meaning that
>> you are allowing any domain to do this operation on any other domain. In
>> most cases there is an XSM check at the beginning of the hypercall
>> processing to do an initial clamp down but I am pretty sure there is no
>> prior XSM check on this path. Based on my understanding of how this is
>> intended, which may be incorrect, but I think you would actually want
>> XSM_TARGET.the
> Thank you for pointing this out.
> I am aware what the XSM_HOOK is, but I was thinking what the default
> action would be better suited for that hypercall, and failed to think of
> a better alternative.
> I was going to choose XSM_TARGET, but the description "/* Can perform on
> self or your target domain */" confused me a bit, as there was no target
> domain involved as I thought, XSM_PRIV
> sounded too strictly to me, etc. So, I decided to leave a "hook" for the
> RFC version. But, now I see that XSM_TARGET might be indeed better
> choice across all possible variants.

If you unravel the craftiness that is xsm_default_action, there is
actually a bit of hierarchy there. If you set the default_action to
XSM_TARGET, it will first check if calling domain(src) is the target,
then falls into the XSM_DM_PRIV check which is if src->target == target,
and then finally checks if is_control_domain(src). That will constrict
the operation so that a domain can call it on itself, a device model
domain (stubdom) can call it on the domain it is backing, and the
control domain can make the call. I am not a 100% sure on this but I do
not believe a hardware domain would be able to make the call with it set
to XSM_TARGET and not employing Flask. Hope this helps.
Julien Grall Aug. 5, 2021, 5:25 p.m. UTC | #21
Hi Oleksandr,

On 05/08/2021 15:52, Oleksandr wrote:
> 
> On 05.08.21 01:00, Julien Grall wrote:
>>
>>
>> On 04/08/2021 21:56, Oleksandr wrote:
>>>
>>> Hi Julien, Stefano.
>>
>> Hi Oleksandr,
> 
> 
> Hi, Julien
> 
> 
> Thank you for the prompt reply and explanations.
> 
> 
>>
>>>
>>> On 02.08.21 22:12, Oleksandr wrote:
>>> I have done some experiments with Xen and toolstack according to the 
>>> discussion above. So, I re-used DTB to pass a safe range to the 
>>> domain. For the range I borrowed some space from the second RAM bank.
>>>
>>> -#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 1016GB of 
>>> RAM @ 8GB */
>>> -#define GUEST_RAM1_SIZE   xen_mk_ullong(0xfe00000000)
>>> +#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 888GB of 
>>> RAM @ 8GB */
>>> +#define GUEST_RAM1_SIZE   xen_mk_ullong(0xDE00000000)
>>> +
>>
>> I am a bit split with reducing the amount of RAM. On one hand large 
>> guest is not unheard on the server side (at least in the x86 world). 
>> On the other hand, I am not aware of anyone using Xen on Arm in such 
>> setup.
>>
>> So technically this will be a regression, but it may be OK.
> 
> I got it.
> 
> 
>>
>>
>> Regarding the range, this will be a problem as Xen configure the 
>> number of the IPA bits based on the PA bits. The lowest possible 
>> address space ize on 64-bit is 4GB.
>>
>> From my understanding, this is because the number of IPA bits 
>> supported is contrained by the PA bits. So the position and the size 
>> of the region
>> would need to depend on the P2M configuration.
> 
> Indeed, I missed these bits that IPA bits on Arm64 might be < 40 bit, I 
> remember, we select p2m_ipa_bits in setup_virt_paging() depending on 
> pabits, moreover the p2m_ipa_bits might be even restricted by some 
> external entity (IOMMU, if P2M is shared).
> 
> 
>>
>> For simplicity, this could be the last few X bytes of the supported 
>> address space.
> ok, agree. To summarize, so it sounds like we can't use the fixed safe 
> range as in my example, it must be variable. Well, I hope, we will be 
> able to achieve this without reducing the total amount of domain RAM in 
> front (GUEST_RAM1_SIZE). After all, we know the IPA size and the domain 
> RAM in advance, so we certainly can choose the start and size of the 
> range. In case, we won't be able to find a suitable large chunk (for 
> example, when IPA bits = 32, and domain has a lot of RAM assigned and as 
> the result - almost all address space below 4GB is in use), we won't 
> expose a safe range to a domain at all, and domain will just fall back 
> to use real pages instead (actually, how it currently behaves on Arm).

I think it would be fine for a first approach. We can refine it in the 
future. What matters is that we correctly define the binding/hypercall.

> 
> A side note: we would likely need the toolstack (that generates 
> device-tree for guests) to query IPA size, or similar.

I think we can use XEN_SYSCTL_* (or possibly hypfs) for that.

>>
>>
>> For 32-bit domain, we also need to make sure the address is usable for 
>> domain short page tables (not too long ago Debian was shipping the 
>> kernel with them rather than LPAE). I haven't yet checked what's the 
>> limit here.
> 
> Hmm, I didn't take this use-case into the account. So, I assume we need 
> the safe range to be located below 4GB if is_32bit_domain() returns true.

Yes. Or we can say that if you are using a 32-bit domain then we don't 
(yet) support a safe range for range.
>> So we would need some heuristic to decide whether to stole some RAM or 
>> use the safe space.
>> Another possibility would be to add a new compatible in the DT that 
>> indicates the region is "big" enough.
> I like the last idea, did you perhaps mean new property (optional) 
> rather than new compatible? Let's say "xen, safe-range" or "xen, 
> extended-regions"  ...

I actually meant adding an extra compatible because this is technically 
a change of the binding (even though it is backward compatible).

Although, I would be OK with the property. You may first want to ask the 
Device-Tree folks how they expect a binding to be extended in a backward 
compatible way.

Cheers,
Oleksandr Tyshchenko Aug. 5, 2021, 8:48 p.m. UTC | #22
On 05.08.21 20:25, Julien Grall wrote:
> Hi Oleksandr,

Hi Julien


>
> On 05/08/2021 15:52, Oleksandr wrote:
>>
>> On 05.08.21 01:00, Julien Grall wrote:
>>>
>>>
>>> On 04/08/2021 21:56, Oleksandr wrote:
>>>>
>>>> Hi Julien, Stefano.
>>>
>>> Hi Oleksandr,
>>
>>
>> Hi, Julien
>>
>>
>> Thank you for the prompt reply and explanations.
>>
>>
>>>
>>>>
>>>> On 02.08.21 22:12, Oleksandr wrote:
>>>> I have done some experiments with Xen and toolstack according to 
>>>> the discussion above. So, I re-used DTB to pass a safe range to the 
>>>> domain. For the range I borrowed some space from the second RAM bank.
>>>>
>>>> -#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 1016GB of 
>>>> RAM @ 8GB */
>>>> -#define GUEST_RAM1_SIZE   xen_mk_ullong(0xfe00000000)
>>>> +#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 888GB of 
>>>> RAM @ 8GB */
>>>> +#define GUEST_RAM1_SIZE   xen_mk_ullong(0xDE00000000)
>>>> +
>>>
>>> I am a bit split with reducing the amount of RAM. On one hand large 
>>> guest is not unheard on the server side (at least in the x86 world). 
>>> On the other hand, I am not aware of anyone using Xen on Arm in such 
>>> setup.
>>>
>>> So technically this will be a regression, but it may be OK.
>>
>> I got it.
>>
>>
>>>
>>>
>>> Regarding the range, this will be a problem as Xen configure the 
>>> number of the IPA bits based on the PA bits. The lowest possible 
>>> address space ize on 64-bit is 4GB.
>>>
>>> From my understanding, this is because the number of IPA bits 
>>> supported is contrained by the PA bits. So the position and the size 
>>> of the region
>>> would need to depend on the P2M configuration.
>>
>> Indeed, I missed these bits that IPA bits on Arm64 might be < 40 bit, 
>> I remember, we select p2m_ipa_bits in setup_virt_paging() depending 
>> on pabits, moreover the p2m_ipa_bits might be even restricted by some 
>> external entity (IOMMU, if P2M is shared).
>>
>>
>>>
>>> For simplicity, this could be the last few X bytes of the supported 
>>> address space.
>> ok, agree. To summarize, so it sounds like we can't use the fixed 
>> safe range as in my example, it must be variable. Well, I hope, we 
>> will be able to achieve this without reducing the total amount of 
>> domain RAM in front (GUEST_RAM1_SIZE). After all, we know the IPA 
>> size and the domain RAM in advance, so we certainly can choose the 
>> start and size of the range. In case, we won't be able to find a 
>> suitable large chunk (for example, when IPA bits = 32, and domain has 
>> a lot of RAM assigned and as the result - almost all address space 
>> below 4GB is in use), we won't expose a safe range to a domain at 
>> all, and domain will just fall back to use real pages instead 
>> (actually, how it currently behaves on Arm).
>
> I think it would be fine for a first approach. We can refine it in the 
> future. What matters is that we correctly define the binding/hypercall.

ok


>
>>
>> A side note: we would likely need the toolstack (that generates 
>> device-tree for guests) to query IPA size, or similar.
>
> I think we can use XEN_SYSCTL_* (or possibly hypfs) for that.

probably yes, need to investigate


>
>
>>>
>>>
>>> For 32-bit domain, we also need to make sure the address is usable 
>>> for domain short page tables (not too long ago Debian was shipping 
>>> the kernel with them rather than LPAE). I haven't yet checked what's 
>>> the limit here.
>>
>> Hmm, I didn't take this use-case into the account. So, I assume we 
>> need the safe range to be located below 4GB if is_32bit_domain() 
>> returns true.
>
> Yes. Or we can say that if you are using a 32-bit domain then we don't 
> (yet) support a safe range for range.

yes, as a variant


>
>>> So we would need some heuristic to decide whether to stole some RAM 
>>> or use the safe space.
>>> Another possibility would be to add a new compatible in the DT that 
>>> indicates the region is "big" enough.
>> I like the last idea, did you perhaps mean new property (optional) 
>> rather than new compatible? Let's say "xen, safe-range" or "xen, 
>> extended-regions"  ...
>
> I actually meant adding an extra compatible because this is 
> technically a change of the binding (even though it is backward 
> compatible).
>
> Although, I would be OK with the property. You may first want to ask 
> the Device-Tree folks how they expect a binding to be extended in a 
> backward compatible way.

makes sense, will clarify these bits


>
> Cheers,
>
Oleksandr Tyshchenko Aug. 5, 2021, 9:56 p.m. UTC | #23
On 05.08.21 19:37, Daniel P. Smith wrote:
> Hey Oleksandr!

Hi Daniel.


>
> On 8/5/21 11:59 AM, Oleksandr wrote:
>> On 05.08.21 18:03, Daniel P. Smith wrote:
>>
>> Hi Daniel.
>>
>>> On 7/28/21 12:18 PM, Oleksandr Tyshchenko wrote:
>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>> ...
>>>
>>>> diff --git a/xen/common/memory.c b/xen/common/memory.c
>>>> index e07bd9a..3f9b816 100644
>>>> --- a/xen/common/memory.c
>>>> +++ b/xen/common/memory.c
>>>> @@ -1811,6 +1811,62 @@ long do_memory_op(unsigned long cmd,
>>>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>>                start_extent);
>>>>            break;
>>>>    +    case XENMEM_get_unallocated_space:
>>>> +    {
>>>> +        struct xen_get_unallocated_space xgus;
>>>> +        struct xen_unallocated_region *regions;
>>>> +
>>>> +        if ( unlikely(start_extent) )
>>>> +            return -EINVAL;
>>>> +
>>>> +        if ( copy_from_guest(&xgus, arg, 1) ||
>>>> +             !guest_handle_okay(xgus.buffer, xgus.nr_regions) )
>>>> +            return -EFAULT;
>>>> +
>>>> +        d = rcu_lock_domain_by_any_id(xgus.domid);
>>>> +        if ( d == NULL )
>>>> +            return -ESRCH;
>>>> +
>>>> +        rc = xsm_get_unallocated_space(XSM_HOOK, d);
>>> Not sure if you are aware but XSM_HOOK is a no-op check, meaning that
>>> you are allowing any domain to do this operation on any other domain. In
>>> most cases there is an XSM check at the beginning of the hypercall
>>> processing to do an initial clamp down but I am pretty sure there is no
>>> prior XSM check on this path. Based on my understanding of how this is
>>> intended, which may be incorrect, but I think you would actually want
>>> XSM_TARGET.the
>> Thank you for pointing this out.
>> I am aware what the XSM_HOOK is, but I was thinking what the default
>> action would be better suited for that hypercall, and failed to think of
>> a better alternative.
>> I was going to choose XSM_TARGET, but the description "/* Can perform on
>> self or your target domain */" confused me a bit, as there was no target
>> domain involved as I thought, XSM_PRIV
>> sounded too strictly to me, etc. So, I decided to leave a "hook" for the
>> RFC version. But, now I see that XSM_TARGET might be indeed better
>> choice across all possible variants.
> If you unravel the craftiness that is xsm_default_action, there is
> actually a bit of hierarchy there. If you set the default_action to
> XSM_TARGET, it will first check if calling domain(src) is the target,
> then falls into the XSM_DM_PRIV check which is if src->target == target,
> and then finally checks if is_control_domain(src). That will constrict
> the operation so that a domain can call it on itself, a device model
> domain (stubdom) can call it on the domain it is backing, and the
> control domain can make the call. I am not a 100% sure on this but I do
> not believe a hardware domain would be able to make the call with it set
> to XSM_TARGET and not employing Flask. Hope this helps.

This definitely helps. Thank you for the explanation.


>
>
>
Stefano Stabellini Aug. 6, 2021, 12:20 a.m. UTC | #24
On Thu, 5 Aug 2021, Julien Grall wrote:
> Hi Oleksandr,
> 
> On 05/08/2021 15:52, Oleksandr wrote:
> > 
> > On 05.08.21 01:00, Julien Grall wrote:
> > > 
> > > 
> > > On 04/08/2021 21:56, Oleksandr wrote:
> > > > 
> > > > Hi Julien, Stefano.
> > > 
> > > Hi Oleksandr,
> > 
> > 
> > Hi, Julien
> > 
> > 
> > Thank you for the prompt reply and explanations.
> > 
> > 
> > > 
> > > > 
> > > > On 02.08.21 22:12, Oleksandr wrote:
> > > > I have done some experiments with Xen and toolstack according to the
> > > > discussion above. So, I re-used DTB to pass a safe range to the domain.
> > > > For the range I borrowed some space from the second RAM bank.
> > > > 
> > > > -#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 1016GB of RAM
> > > > @ 8GB */
> > > > -#define GUEST_RAM1_SIZE   xen_mk_ullong(0xfe00000000)
> > > > +#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 888GB of RAM @
> > > > 8GB */
> > > > +#define GUEST_RAM1_SIZE   xen_mk_ullong(0xDE00000000)
> > > > +
> > > 
> > > I am a bit split with reducing the amount of RAM. On one hand large guest
> > > is not unheard on the server side (at least in the x86 world). On the
> > > other hand, I am not aware of anyone using Xen on Arm in such setup.
> > > 
> > > So technically this will be a regression, but it may be OK.
> > 
> > I got it.
> > 
> > 
> > > 
> > > 
> > > Regarding the range, this will be a problem as Xen configure the number of
> > > the IPA bits based on the PA bits. The lowest possible address space ize
> > > on 64-bit is 4GB.
> > > 
> > > From my understanding, this is because the number of IPA bits supported is
> > > contrained by the PA bits. So the position and the size of the region
> > > would need to depend on the P2M configuration.
> > 
> > Indeed, I missed these bits that IPA bits on Arm64 might be < 40 bit, I
> > remember, we select p2m_ipa_bits in setup_virt_paging() depending on pabits,
> > moreover the p2m_ipa_bits might be even restricted by some external entity
> > (IOMMU, if P2M is shared).
> > 
> > 
> > > 
> > > For simplicity, this could be the last few X bytes of the supported
> > > address space.
> > ok, agree. To summarize, so it sounds like we can't use the fixed safe range
> > as in my example, it must be variable. Well, I hope, we will be able to
> > achieve this without reducing the total amount of domain RAM in front
> > (GUEST_RAM1_SIZE). After all, we know the IPA size and the domain RAM in
> > advance, so we certainly can choose the start and size of the range. In
> > case, we won't be able to find a suitable large chunk (for example, when IPA
> > bits = 32, and domain has a lot of RAM assigned and as the result - almost
> > all address space below 4GB is in use), we won't expose a safe range to a
> > domain at all, and domain will just fall back to use real pages instead
> > (actually, how it currently behaves on Arm).
> 
> I think it would be fine for a first approach. We can refine it in the future.
> What matters is that we correctly define the binding/hypercall.

I agree with Julien on both points. Looking at the existing device tree
binding, it is almost exactly what we need, so I think it would be OK to
use it.


> > > For 32-bit domain, we also need to make sure the address is usable for
> > > domain short page tables (not too long ago Debian was shipping the kernel
> > > with them rather than LPAE). I haven't yet checked what's the limit here.
> > 
> > Hmm, I didn't take this use-case into the account. So, I assume we need the
> > safe range to be located below 4GB if is_32bit_domain() returns true.
> 
> Yes. Or we can say that if you are using a 32-bit domain then we don't (yet)
> support a safe range for range.
> > > So we would need some heuristic to decide whether to stole some RAM or use
> > > the safe space.
> > > Another possibility would be to add a new compatible in the DT that
> > > indicates the region is "big" enough.
> > I like the last idea, did you perhaps mean new property (optional) rather
> > than new compatible? Let's say "xen, safe-range" or "xen, extended-regions" 
> > ...
> 
> I actually meant adding an extra compatible because this is technically a
> change of the binding (even though it is backward compatible).
> 
> Although, I would be OK with the property. You may first want to ask the
> Device-Tree folks how they expect a binding to be extended in a backward
> compatible way.

I think we should expand the description in
Documentation/devicetree/bindings/arm/xen.txt to cover things other than
the grant table.

I don't think we necessarely need a new compatible string because as you
said the change is fully compatible. At the same time it is trivial to
add compatible strings, so that could be done as well.
Stefano Stabellini Aug. 6, 2021, 12:30 a.m. UTC | #25
On Wed, 4 Aug 2021, Julien Grall wrote:
> > +#define GUEST_SAFE_RANGE_BASE   xen_mk_ullong(0xDE00000000) /* 128GB */
> > +#define GUEST_SAFE_RANGE_SIZE   xen_mk_ullong(0x0200000000)
> > 
> > While the possible new DT bindings has not been agreed yet, I re-used
> > existing "reg" property under the hypervisor node to pass safe range as a
> > second region,
> > https://elixir.bootlin.com/linux/v5.14-rc4/source/Documentation/devicetree/bindings/arm/xen.txt#L10: 
> 
> So a single region works for a guest today, but for dom0 we will need multiple
> regions because it is may be difficult to find enough contiguous space for a
> single region.
> 
> That said, as dom0 is mapped 1:1 (including some guest mapping), there is also
> the question where to allocate the safe region. For grant table, we so far
> re-use the Xen address space because it is assumed it will space will always
> be bigger than the grant table.
> 
> I am not sure yet where we could allocate the safe regions. Stefano, do you
> have any ideas?

The safest choice would be the address range corresponding to memory
(/memory) not already allocated to Dom0.

For instance from my last boot logs:
(XEN) Allocating 1:1 mappings totalling 1600MB for dom0:
(XEN) BANK[0] 0x00000010000000-0x00000070000000 (1536MB)
(XEN) BANK[1] 0x00000078000000-0x0000007c000000 (64MB)

All the other ranges could be given as unallocated space:

- 0x0 - 0x10000000
- 0x70000000 - 0x78000000
- 0x8_0000_0000 - 0x8_8000_0000



The second best choice would be an hole: an address range not used by
anybody else (no reg property) and also not even mappable by a bus (not
covered by a ranges property). This is not the best choice because there
can cases where physical resources appear afterwards.
Jan Beulich Aug. 6, 2021, 6:09 a.m. UTC | #26
On 05.08.2021 18:37, Daniel P. Smith wrote:
> On 8/5/21 11:59 AM, Oleksandr wrote:
>> On 05.08.21 18:03, Daniel P. Smith wrote:
>>> On 7/28/21 12:18 PM, Oleksandr Tyshchenko wrote:
>>>> --- a/xen/common/memory.c
>>>> +++ b/xen/common/memory.c
>>>> @@ -1811,6 +1811,62 @@ long do_memory_op(unsigned long cmd,
>>>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>>               start_extent);
>>>>           break;
>>>>   +    case XENMEM_get_unallocated_space:
>>>> +    {
>>>> +        struct xen_get_unallocated_space xgus;
>>>> +        struct xen_unallocated_region *regions;
>>>> +
>>>> +        if ( unlikely(start_extent) )
>>>> +            return -EINVAL;
>>>> +
>>>> +        if ( copy_from_guest(&xgus, arg, 1) ||
>>>> +             !guest_handle_okay(xgus.buffer, xgus.nr_regions) )
>>>> +            return -EFAULT;
>>>> +
>>>> +        d = rcu_lock_domain_by_any_id(xgus.domid);
>>>> +        if ( d == NULL )
>>>> +            return -ESRCH;
>>>> +
>>>> +        rc = xsm_get_unallocated_space(XSM_HOOK, d);
>>> Not sure if you are aware but XSM_HOOK is a no-op check, meaning that
>>> you are allowing any domain to do this operation on any other domain. In
>>> most cases there is an XSM check at the beginning of the hypercall
>>> processing to do an initial clamp down but I am pretty sure there is no
>>> prior XSM check on this path. Based on my understanding of how this is
>>> intended, which may be incorrect, but I think you would actually want
>>> XSM_TARGET.the
>> Thank you for pointing this out.
>> I am aware what the XSM_HOOK is, but I was thinking what the default
>> action would be better suited for that hypercall, and failed to think of
>> a better alternative.
>> I was going to choose XSM_TARGET, but the description "/* Can perform on
>> self or your target domain */" confused me a bit, as there was no target
>> domain involved as I thought, XSM_PRIV
>> sounded too strictly to me, etc. So, I decided to leave a "hook" for the
>> RFC version. But, now I see that XSM_TARGET might be indeed better
>> choice across all possible variants.
> 
> If you unravel the craftiness that is xsm_default_action, there is
> actually a bit of hierarchy there. If you set the default_action to
> XSM_TARGET, it will first check if calling domain(src) is the target,
> then falls into the XSM_DM_PRIV check which is if src->target == target,
> and then finally checks if is_control_domain(src). That will constrict
> the operation so that a domain can call it on itself, a device model
> domain (stubdom) can call it on the domain it is backing, and the
> control domain can make the call. I am not a 100% sure on this but I do
> not believe a hardware domain would be able to make the call with it set
> to XSM_TARGET and not employing Flask.

Afaict (perhaps leaving aside late-hwdom, which I have little knowledge
of) right now we have is_control_domain(d) == is_hardware_domain(d).

Jan
Daniel P. Smith Aug. 6, 2021, 3:08 p.m. UTC | #27
On 8/6/21 2:09 AM, Jan Beulich wrote:
> On 05.08.2021 18:37, Daniel P. Smith wrote:
>> On 8/5/21 11:59 AM, Oleksandr wrote:
>>> On 05.08.21 18:03, Daniel P. Smith wrote:
>>>> On 7/28/21 12:18 PM, Oleksandr Tyshchenko wrote:
>>>>> --- a/xen/common/memory.c
>>>>> +++ b/xen/common/memory.c
>>>>> @@ -1811,6 +1811,62 @@ long do_memory_op(unsigned long cmd,
>>>>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>>>               start_extent);
>>>>>           break;
>>>>>   +    case XENMEM_get_unallocated_space:
>>>>> +    {
>>>>> +        struct xen_get_unallocated_space xgus;
>>>>> +        struct xen_unallocated_region *regions;
>>>>> +
>>>>> +        if ( unlikely(start_extent) )
>>>>> +            return -EINVAL;
>>>>> +
>>>>> +        if ( copy_from_guest(&xgus, arg, 1) ||
>>>>> +             !guest_handle_okay(xgus.buffer, xgus.nr_regions) )
>>>>> +            return -EFAULT;
>>>>> +
>>>>> +        d = rcu_lock_domain_by_any_id(xgus.domid);
>>>>> +        if ( d == NULL )
>>>>> +            return -ESRCH;
>>>>> +
>>>>> +        rc = xsm_get_unallocated_space(XSM_HOOK, d);
>>>> Not sure if you are aware but XSM_HOOK is a no-op check, meaning that
>>>> you are allowing any domain to do this operation on any other domain. In
>>>> most cases there is an XSM check at the beginning of the hypercall
>>>> processing to do an initial clamp down but I am pretty sure there is no
>>>> prior XSM check on this path. Based on my understanding of how this is
>>>> intended, which may be incorrect, but I think you would actually want
>>>> XSM_TARGET.the
>>> Thank you for pointing this out.
>>> I am aware what the XSM_HOOK is, but I was thinking what the default
>>> action would be better suited for that hypercall, and failed to think of
>>> a better alternative.
>>> I was going to choose XSM_TARGET, but the description "/* Can perform on
>>> self or your target domain */" confused me a bit, as there was no target
>>> domain involved as I thought, XSM_PRIV
>>> sounded too strictly to me, etc. So, I decided to leave a "hook" for the
>>> RFC version. But, now I see that XSM_TARGET might be indeed better
>>> choice across all possible variants.
>>
>> If you unravel the craftiness that is xsm_default_action, there is
>> actually a bit of hierarchy there. If you set the default_action to
>> XSM_TARGET, it will first check if calling domain(src) is the target,
>> then falls into the XSM_DM_PRIV check which is if src->target == target,
>> and then finally checks if is_control_domain(src). That will constrict
>> the operation so that a domain can call it on itself, a device model
>> domain (stubdom) can call it on the domain it is backing, and the
>> control domain can make the call. I am not a 100% sure on this but I do
>> not believe a hardware domain would be able to make the call with it set
>> to XSM_TARGET and not employing Flask.
> 
> Afaict (perhaps leaving aside late-hwdom, which I have little knowledge
> of) right now we have is_control_domain(d) == is_hardware_domain(d).

That is my fault for not being more explicit. When I refer to a
"hardware domain" I am referring to what you call "late-hwdom". When a
hardware domain that is not dom0 is constructed, it does not get the
`is_privileged` flag set to true which will result in
is_control_domain(d) to return false. Additionally there is currently no
`enum xsm_default` for hardware domain access/privilege and thus there
is no rule/access check defined in `default_action()` that allows any
of the XSM hooks to be restricted to the hardware domain. Which is what
I was referring to regarding the use of the hardware domain, aka
late-hwdom, without Flask. With Flask it becomes possible for the
hardware domain to be granted access to calls that are reserved to the
control domain under the dummy/default access policy,thus allowing it to
function fully.

dps
Oleksandr Tyshchenko Aug. 6, 2021, 6:03 p.m. UTC | #28
Hi Stefano


On 06.08.21 03:20, Stefano Stabellini wrote:
> On Thu, 5 Aug 2021, Julien Grall wrote:
>> Hi Oleksandr,
>>
>> On 05/08/2021 15:52, Oleksandr wrote:
>>> On 05.08.21 01:00, Julien Grall wrote:
>>>>
>>>> On 04/08/2021 21:56, Oleksandr wrote:
>>>>> Hi Julien, Stefano.
>>>> Hi Oleksandr,
>>>
>>> Hi, Julien
>>>
>>>
>>> Thank you for the prompt reply and explanations.
>>>
>>>
>>>>> On 02.08.21 22:12, Oleksandr wrote:
>>>>> I have done some experiments with Xen and toolstack according to the
>>>>> discussion above. So, I re-used DTB to pass a safe range to the domain.
>>>>> For the range I borrowed some space from the second RAM bank.
>>>>>
>>>>> -#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 1016GB of RAM
>>>>> @ 8GB */
>>>>> -#define GUEST_RAM1_SIZE   xen_mk_ullong(0xfe00000000)
>>>>> +#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 888GB of RAM @
>>>>> 8GB */
>>>>> +#define GUEST_RAM1_SIZE   xen_mk_ullong(0xDE00000000)
>>>>> +
>>>> I am a bit split with reducing the amount of RAM. On one hand large guest
>>>> is not unheard on the server side (at least in the x86 world). On the
>>>> other hand, I am not aware of anyone using Xen on Arm in such setup.
>>>>
>>>> So technically this will be a regression, but it may be OK.
>>> I got it.
>>>
>>>
>>>>
>>>> Regarding the range, this will be a problem as Xen configure the number of
>>>> the IPA bits based on the PA bits. The lowest possible address space ize
>>>> on 64-bit is 4GB.
>>>>
>>>>  From my understanding, this is because the number of IPA bits supported is
>>>> contrained by the PA bits. So the position and the size of the region
>>>> would need to depend on the P2M configuration.
>>> Indeed, I missed these bits that IPA bits on Arm64 might be < 40 bit, I
>>> remember, we select p2m_ipa_bits in setup_virt_paging() depending on pabits,
>>> moreover the p2m_ipa_bits might be even restricted by some external entity
>>> (IOMMU, if P2M is shared).
>>>
>>>
>>>> For simplicity, this could be the last few X bytes of the supported
>>>> address space.
>>> ok, agree. To summarize, so it sounds like we can't use the fixed safe range
>>> as in my example, it must be variable. Well, I hope, we will be able to
>>> achieve this without reducing the total amount of domain RAM in front
>>> (GUEST_RAM1_SIZE). After all, we know the IPA size and the domain RAM in
>>> advance, so we certainly can choose the start and size of the range. In
>>> case, we won't be able to find a suitable large chunk (for example, when IPA
>>> bits = 32, and domain has a lot of RAM assigned and as the result - almost
>>> all address space below 4GB is in use), we won't expose a safe range to a
>>> domain at all, and domain will just fall back to use real pages instead
>>> (actually, how it currently behaves on Arm).
>> I think it would be fine for a first approach. We can refine it in the future.
>> What matters is that we correctly define the binding/hypercall.
> I agree with Julien on both points. Looking at the existing device tree
> binding, it is almost exactly what we need, so I think it would be OK to
> use it.

ok, great.


>
>
>>>> For 32-bit domain, we also need to make sure the address is usable for
>>>> domain short page tables (not too long ago Debian was shipping the kernel
>>>> with them rather than LPAE). I haven't yet checked what's the limit here.
>>> Hmm, I didn't take this use-case into the account. So, I assume we need the
>>> safe range to be located below 4GB if is_32bit_domain() returns true.
>> Yes. Or we can say that if you are using a 32-bit domain then we don't (yet)
>> support a safe range for range.
>>>> So we would need some heuristic to decide whether to stole some RAM or use
>>>> the safe space.
>>>> Another possibility would be to add a new compatible in the DT that
>>>> indicates the region is "big" enough.
>>> I like the last idea, did you perhaps mean new property (optional) rather
>>> than new compatible? Let's say "xen, safe-range" or "xen, extended-regions"
>>> ...
>> I actually meant adding an extra compatible because this is technically a
>> change of the binding (even though it is backward compatible).
>>
>> Although, I would be OK with the property. You may first want to ask the
>> Device-Tree folks how they expect a binding to be extended in a backward
>> compatible way.
> I think we should expand the description in
> Documentation/devicetree/bindings/arm/xen.txt to cover things other than
> the grant table.

yes


>
> I don't think we necessarely need a new compatible string because as you
> said the change is fully compatible. At the same time it is trivial to
> add compatible strings, so that could be done as well.

ok, please note, I have already asked DT folks for the clarification 
(how to make it properly), so let's wait for input.
Oleksandr Tyshchenko Aug. 7, 2021, 5:03 p.m. UTC | #29
On 06.08.21 03:30, Stefano Stabellini wrote:

Hi Stefano

> On Wed, 4 Aug 2021, Julien Grall wrote:
>>> +#define GUEST_SAFE_RANGE_BASE   xen_mk_ullong(0xDE00000000) /* 128GB */
>>> +#define GUEST_SAFE_RANGE_SIZE   xen_mk_ullong(0x0200000000)
>>>
>>> While the possible new DT bindings has not been agreed yet, I re-used
>>> existing "reg" property under the hypervisor node to pass safe range as a
>>> second region,
>>> https://elixir.bootlin.com/linux/v5.14-rc4/source/Documentation/devicetree/bindings/arm/xen.txt#L10:
>> So a single region works for a guest today, but for dom0 we will need multiple
>> regions because it is may be difficult to find enough contiguous space for a
>> single region.
>>
>> That said, as dom0 is mapped 1:1 (including some guest mapping), there is also
>> the question where to allocate the safe region. For grant table, we so far
>> re-use the Xen address space because it is assumed it will space will always
>> be bigger than the grant table.
>>
>> I am not sure yet where we could allocate the safe regions. Stefano, do you
>> have any ideas?
> The safest choice would be the address range corresponding to memory
> (/memory) not already allocated to Dom0.
>
> For instance from my last boot logs:
> (XEN) Allocating 1:1 mappings totalling 1600MB for dom0:
> (XEN) BANK[0] 0x00000010000000-0x00000070000000 (1536MB)
> (XEN) BANK[1] 0x00000078000000-0x0000007c000000 (64MB)
>
> All the other ranges could be given as unallocated space:
>
> - 0x0 - 0x10000000
> - 0x70000000 - 0x78000000
> - 0x8_0000_0000 - 0x8_8000_0000

Thank you for the ideas.

If I got the idea correctly, yes, as these ranges represent the real 
RAM, so no I/O would be in conflict with them and as the result - no 
overlaps would be expected.
But, I wonder, would this work if we have IOMMU enabled for Dom0 and 
need to establish 1:1 mapping for the DMA devices to work with grant 
mappings...
In arm_iommu_map_page() we call guest_physmap_add_entry() with gfn = 
mfn, so the question is could we end up with this new gfn replacing the 
valid mapping
(with gfn allocated from the safe region)?


>
>
>
> The second best choice would be an hole: an address range not used by
> anybody else (no reg property) and also not even mappable by a bus (not
> covered by a ranges property). This is not the best choice because there
> can cases where physical resources appear afterwards.

Unfortunately, yes.
Julien Grall Aug. 9, 2021, 2:51 p.m. UTC | #30
Hi,

I am writing down here what we discussed on another thread and on IRC. 
This will be easier to track in a single thread.

On 04/08/2021 23:00, Julien Grall wrote:
> On 04/08/2021 21:56, Oleksandr wrote:
>> Now, I am wondering, would it be possible to update/clarify the 
>> current "reg" purpose and use it to pass a safe unallocated space for 
>> any Xen specific mappings (grant, foreign, whatever) instead of just 
>> for the grant table region. In case, it is not allowed for any reason 
>> (compatibility PoV, etc), would it be possible to extend a property by 
>> passing an extra range separately, something similar to how I 
>> described above?
> 
> I think it should be fine to re-use the same region so long the size of 
> the first bank is at least the size of the original region.

While answering to the DT binding question on the DT ML, I realized that 
this is probably not going to be fine because there is a bug in Xen when 
mapping grant-table frame.

The function gnttab_map_frame() is used to map the grant table frame. If 
there is an old mapping, it will first remove it.

The function is using the helper gnttab_map_frame() to find the 
corresponding GFN or return INVALID_GFN if not mapped.

On Arm, gnttab_map_frame() is implementing using an array index by the 
grant table frame number. The trouble is we don't update the array when 
the page is unmapped. So if the GFN is re-used before the grant-table is 
remapped, then we will end up to remove whatever was mapped there (this 
could be a foreign page...).

This behavior already happens today as the toolstack will use the first 
GFN of the region if Linux doesn't support the acquire resource 
interface. We are getting away in the Linux because the toolstack only 
map the first grant table frame and:
  - Newer Linux will not used the region provided by the DT and nothing 
will be mapped there.
  - Older Linux will use the region but still map the grant table frame 
0 to the same GFN.

I am not sure about U-boot and other OSes here.

This is not new but it is going to be become a bigger source of problem 
(read more chance to hit it) as we try to re-use the first region.

This means the first region should exclusively used for the grant-table 
(in a specific order) until the issue is properly fixed.

A potential fix is to update the array in p2m_put_l3_page(). The default 
max size of the array is 1024, so it might be fine to just walk it (it 
would be simply a comparison).

Note that this is not a problem on x86 because the is using the M2P. So 
when a mapping is removed, the mapping MFN -> GFN will also be removed.

Cheers,
Julien Grall Aug. 9, 2021, 3:42 p.m. UTC | #31
Hi Oleksandr,

On 07/08/2021 18:03, Oleksandr wrote:
> 
> On 06.08.21 03:30, Stefano Stabellini wrote:
> 
> Hi Stefano
> 
>> On Wed, 4 Aug 2021, Julien Grall wrote:
>>>> +#define GUEST_SAFE_RANGE_BASE   xen_mk_ullong(0xDE00000000) /* 
>>>> 128GB */
>>>> +#define GUEST_SAFE_RANGE_SIZE   xen_mk_ullong(0x0200000000)
>>>>
>>>> While the possible new DT bindings has not been agreed yet, I re-used
>>>> existing "reg" property under the hypervisor node to pass safe range 
>>>> as a
>>>> second region,
>>>> https://elixir.bootlin.com/linux/v5.14-rc4/source/Documentation/devicetree/bindings/arm/xen.txt#L10: 
>>>>
>>> So a single region works for a guest today, but for dom0 we will need 
>>> multiple
>>> regions because it is may be difficult to find enough contiguous 
>>> space for a
>>> single region.
>>>
>>> That said, as dom0 is mapped 1:1 (including some guest mapping), 
>>> there is also
>>> the question where to allocate the safe region. For grant table, we 
>>> so far
>>> re-use the Xen address space because it is assumed it will space will 
>>> always
>>> be bigger than the grant table.
>>>
>>> I am not sure yet where we could allocate the safe regions. Stefano, 
>>> do you
>>> have any ideas?
>> The safest choice would be the address range corresponding to memory
>> (/memory) not already allocated to Dom0.
>>
>> For instance from my last boot logs:
>> (XEN) Allocating 1:1 mappings totalling 1600MB for dom0:
>> (XEN) BANK[0] 0x00000010000000-0x00000070000000 (1536MB)
>> (XEN) BANK[1] 0x00000078000000-0x0000007c000000 (64MB)
>>
>> All the other ranges could be given as unallocated space:
>>
>> - 0x0 - 0x10000000
>> - 0x70000000 - 0x78000000
>> - 0x8_0000_0000 - 0x8_8000_0000
> 
> Thank you for the ideas.
> 
> If I got the idea correctly, yes, as these ranges represent the real 
> RAM, so no I/O would be in conflict with them and as the result - no 
> overlaps would be expected.
> But, I wonder, would this work if we have IOMMU enabled for Dom0 and 
> need to establish 1:1 mapping for the DMA devices to work with grant 
> mappings...
> In arm_iommu_map_page() we call guest_physmap_add_entry() with gfn = 
> mfn, so the question is could we end up with this new gfn replacing the 
> valid mapping
> (with gfn allocated from the safe region)?

Right, when we enable the IOMMU for dom0, Xen will add an extra mapping 
with GFN == MFN for foreign and grant pages. This is because Linux is 
not aware that whether a device is protected by an IOMMU. Therefore it 
is assuming it is not and will use the MFN to configure for DMA transaction.

We can't remove the mapping without significant changes in Linux and 
Xen. I would not mandate them for this work.

That said, I think it would be acceptable to have different way to find 
the region depending on the dom0 configuration. So we could use the RAM 
not used by dom0 when the IOMMU is turned off.

>> The second best choice would be an hole: an address range not used by
>> anybody else (no reg property) and also not even mappable by a bus (not
>> covered by a ranges property). This is not the best choice because there
>> can cases where physical resources appear afterwards.

Are you saying that the original device-tree doesn't even describe them 
in any way (i.e. reserved...)?

> 
> Unfortunately, yes.

So the decision where the safe region is located will be done by Xen. 
There is no involvement of the domain (it will discover the region from 
the DT). Therefore, I don't think we need to think about everything 
right now as we could adapt this is exact region is not part of the 
stable ABI.

The hotplug is one I would defer because this is not supported (and 
quite likely not working) in Xen upstream today.

Now regarding the case where dom0 is using the IOMMU. The assumption is 
Xen will be able to figure out all the regions used from the firmware 
table (ACPI or DT).

AFAIK, this assumption would be correct for DT. However, for ACPI, I 
remember we were not able to find all the MMIOs region in Xen (see [1] 
and [2]). So even this solution would not work for ACPI.

If I am not mistaken, we don't support IOMMU with ACPI yet. So we could 
defer the problem to when this is going to be supported.

Cheers,

[1] https://marc.info/?l=linux-arm-kernel&m=148469169210500&w=2
[2] Xen commit 80f9c316708400cea4417e36337267d3b26591db
Oleksandr Tyshchenko Aug. 9, 2021, 5:14 p.m. UTC | #32
On 09.08.21 17:51, Julien Grall wrote:
> Hi,


Hi Julien.


>
> I am writing down here what we discussed on another thread and on IRC. 
> This will be easier to track in a single thread.
>
> On 04/08/2021 23:00, Julien Grall wrote:
>> On 04/08/2021 21:56, Oleksandr wrote:
>>> Now, I am wondering, would it be possible to update/clarify the 
>>> current "reg" purpose and use it to pass a safe unallocated space 
>>> for any Xen specific mappings (grant, foreign, whatever) instead of 
>>> just for the grant table region. In case, it is not allowed for any 
>>> reason (compatibility PoV, etc), would it be possible to extend a 
>>> property by passing an extra range separately, something similar to 
>>> how I described above?
>>
>> I think it should be fine to re-use the same region so long the size 
>> of the first bank is at least the size of the original region.
>
> While answering to the DT binding question on the DT ML, I realized 
> that this is probably not going to be fine because there is a bug in 
> Xen when mapping grant-table frame.
>
> The function gnttab_map_frame() is used to map the grant table frame. 
> If there is an old mapping, it will first remove it.
>
> The function is using the helper gnttab_map_frame() to find the 
> corresponding GFN or return INVALID_GFN if not mapped.
>
> On Arm, gnttab_map_frame() is implementing using an array index by the 
> grant table frame number. The trouble is we don't update the array 
> when the page is unmapped. So if the GFN is re-used before the 
> grant-table is remapped, then we will end up to remove whatever was 
> mapped there (this could be a foreign page...).
>
> This behavior already happens today as the toolstack will use the 
> first GFN of the region if Linux doesn't support the acquire resource 
> interface. We are getting away in the Linux because the toolstack only 
> map the first grant table frame and:
>  - Newer Linux will not used the region provided by the DT and nothing 
> will be mapped there.
>  - Older Linux will use the region but still map the grant table frame 
> 0 to the same GFN.
>
> I am not sure about U-boot and other OSes here.
>
> This is not new but it is going to be become a bigger source of 
> problem (read more chance to hit it) as we try to re-use the first 
> region.
>
> This means the first region should exclusively used for the 
> grant-table (in a specific order) until the issue is properly fixed.

Thank you for the explanation, it is clear now.


>
>
> A potential fix is to update the array in p2m_put_l3_page(). The 
> default max size of the array is 1024, so it might be fine to just 
> walk it (it would be simply a comparison).

I think, this would work. Looks like we don't need to walk for each gfn 
which is being freed, we could just filter it by p2m_is_ram() ...


>
>
> Note that this is not a problem on x86 because the is using the M2P. 
> So when a mapping is removed, the mapping MFN -> GFN will also be 
> removed.
>
> Cheers,
>
Julien Grall Aug. 9, 2021, 5:18 p.m. UTC | #33
On 09/08/2021 18:14, Oleksandr wrote:
> 
> On 09.08.21 17:51, Julien Grall wrote:
> Hi Julien.

Hi Oleksandr,

>> I am writing down here what we discussed on another thread and on IRC. 
>> This will be easier to track in a single thread.
>>
>> On 04/08/2021 23:00, Julien Grall wrote:
>>> On 04/08/2021 21:56, Oleksandr wrote:
>>>> Now, I am wondering, would it be possible to update/clarify the 
>>>> current "reg" purpose and use it to pass a safe unallocated space 
>>>> for any Xen specific mappings (grant, foreign, whatever) instead of 
>>>> just for the grant table region. In case, it is not allowed for any 
>>>> reason (compatibility PoV, etc), would it be possible to extend a 
>>>> property by passing an extra range separately, something similar to 
>>>> how I described above?
>>>
>>> I think it should be fine to re-use the same region so long the size 
>>> of the first bank is at least the size of the original region.
>>
>> While answering to the DT binding question on the DT ML, I realized 
>> that this is probably not going to be fine because there is a bug in 
>> Xen when mapping grant-table frame.
>>
>> The function gnttab_map_frame() is used to map the grant table frame. 
>> If there is an old mapping, it will first remove it.
>>
>> The function is using the helper gnttab_map_frame() to find the 
>> corresponding GFN or return INVALID_GFN if not mapped.
>>
>> On Arm, gnttab_map_frame() is implementing using an array index by the 
>> grant table frame number. The trouble is we don't update the array 
>> when the page is unmapped. So if the GFN is re-used before the 
>> grant-table is remapped, then we will end up to remove whatever was 
>> mapped there (this could be a foreign page...).
>>
>> This behavior already happens today as the toolstack will use the 
>> first GFN of the region if Linux doesn't support the acquire resource 
>> interface. We are getting away in the Linux because the toolstack only 
>> map the first grant table frame and:
>>  - Newer Linux will not used the region provided by the DT and nothing 
>> will be mapped there.
>>  - Older Linux will use the region but still map the grant table frame 
>> 0 to the same GFN.
>>
>> I am not sure about U-boot and other OSes here.
>>
>> This is not new but it is going to be become a bigger source of 
>> problem (read more chance to hit it) as we try to re-use the first 
>> region.
>>
>> This means the first region should exclusively used for the 
>> grant-table (in a specific order) until the issue is properly fixed.
> 
> Thank you for the explanation, it is clear now.
> 
> 
>>
>>
>> A potential fix is to update the array in p2m_put_l3_page(). The 
>> default max size of the array is 1024, so it might be fine to just 
>> walk it (it would be simply a comparison).
> 
> I think, this would work. Looks like we don't need to walk for each gfn 
> which is being freed, we could just filter it by p2m_is_ram() ...

Well. This would still potentially result to a few unnecessary walk. I 
would consider to introduce a new P2M type or possibly add a check if 
the page is in xenheap (grant-table are xenheap pages so far).

Cheers,
Oleksandr Tyshchenko Aug. 9, 2021, 5:49 p.m. UTC | #34
On 09.08.21 20:18, Julien Grall wrote:


Hi Julien

>
>
> On 09/08/2021 18:14, Oleksandr wrote:
>>
>> On 09.08.21 17:51, Julien Grall wrote:
>> Hi Julien.
>
> Hi Oleksandr,
>
>>> I am writing down here what we discussed on another thread and on 
>>> IRC. This will be easier to track in a single thread.
>>>
>>> On 04/08/2021 23:00, Julien Grall wrote:
>>>> On 04/08/2021 21:56, Oleksandr wrote:
>>>>> Now, I am wondering, would it be possible to update/clarify the 
>>>>> current "reg" purpose and use it to pass a safe unallocated space 
>>>>> for any Xen specific mappings (grant, foreign, whatever) instead 
>>>>> of just for the grant table region. In case, it is not allowed for 
>>>>> any reason (compatibility PoV, etc), would it be possible to 
>>>>> extend a property by passing an extra range separately, something 
>>>>> similar to how I described above?
>>>>
>>>> I think it should be fine to re-use the same region so long the 
>>>> size of the first bank is at least the size of the original region.
>>>
>>> While answering to the DT binding question on the DT ML, I realized 
>>> that this is probably not going to be fine because there is a bug in 
>>> Xen when mapping grant-table frame.
>>>
>>> The function gnttab_map_frame() is used to map the grant table 
>>> frame. If there is an old mapping, it will first remove it.
>>>
>>> The function is using the helper gnttab_map_frame() to find the 
>>> corresponding GFN or return INVALID_GFN if not mapped.
>>>
>>> On Arm, gnttab_map_frame() is implementing using an array index by 
>>> the grant table frame number. The trouble is we don't update the 
>>> array when the page is unmapped. So if the GFN is re-used before the 
>>> grant-table is remapped, then we will end up to remove whatever was 
>>> mapped there (this could be a foreign page...).
>>>
>>> This behavior already happens today as the toolstack will use the 
>>> first GFN of the region if Linux doesn't support the acquire 
>>> resource interface. We are getting away in the Linux because the 
>>> toolstack only map the first grant table frame and:
>>>  - Newer Linux will not used the region provided by the DT and 
>>> nothing will be mapped there.
>>>  - Older Linux will use the region but still map the grant table 
>>> frame 0 to the same GFN.
>>>
>>> I am not sure about U-boot and other OSes here.
>>>
>>> This is not new but it is going to be become a bigger source of 
>>> problem (read more chance to hit it) as we try to re-use the first 
>>> region.
>>>
>>> This means the first region should exclusively used for the 
>>> grant-table (in a specific order) until the issue is properly fixed.
>>
>> Thank you for the explanation, it is clear now.
>>
>>
>>>
>>>
>>> A potential fix is to update the array in p2m_put_l3_page(). The 
>>> default max size of the array is 1024, so it might be fine to just 
>>> walk it (it would be simply a comparison).
>>
>> I think, this would work. Looks like we don't need to walk for each 
>> gfn which is being freed, we could just filter it by p2m_is_ram() ...
>
> Well. This would still potentially result to a few unnecessary walk. I 
> would consider to introduce a new P2M type or possibly add a check if 
> the page is in xenheap (grant-table are xenheap pages so far).

Indeed, this would be better, personally I would prefer to check if page 
is in xenheap.


>
>
> Cheers,
>
Oleksandr Tyshchenko Aug. 9, 2021, 6:24 p.m. UTC | #35
On 09.08.21 18:42, Julien Grall wrote:
> Hi Oleksandr,


Hi Julien.


Thank you for the input.


>
> On 07/08/2021 18:03, Oleksandr wrote:
>>
>> On 06.08.21 03:30, Stefano Stabellini wrote:
>>
>> Hi Stefano
>>
>>> On Wed, 4 Aug 2021, Julien Grall wrote:
>>>>> +#define GUEST_SAFE_RANGE_BASE xen_mk_ullong(0xDE00000000) /* 
>>>>> 128GB */
>>>>> +#define GUEST_SAFE_RANGE_SIZE xen_mk_ullong(0x0200000000)
>>>>>
>>>>> While the possible new DT bindings has not been agreed yet, I re-used
>>>>> existing "reg" property under the hypervisor node to pass safe 
>>>>> range as a
>>>>> second region,
>>>>> https://elixir.bootlin.com/linux/v5.14-rc4/source/Documentation/devicetree/bindings/arm/xen.txt#L10: 
>>>>>
>>>> So a single region works for a guest today, but for dom0 we will 
>>>> need multiple
>>>> regions because it is may be difficult to find enough contiguous 
>>>> space for a
>>>> single region.
>>>>
>>>> That said, as dom0 is mapped 1:1 (including some guest mapping), 
>>>> there is also
>>>> the question where to allocate the safe region. For grant table, we 
>>>> so far
>>>> re-use the Xen address space because it is assumed it will space 
>>>> will always
>>>> be bigger than the grant table.
>>>>
>>>> I am not sure yet where we could allocate the safe regions. 
>>>> Stefano, do you
>>>> have any ideas?
>>> The safest choice would be the address range corresponding to memory
>>> (/memory) not already allocated to Dom0.
>>>
>>> For instance from my last boot logs:
>>> (XEN) Allocating 1:1 mappings totalling 1600MB for dom0:
>>> (XEN) BANK[0] 0x00000010000000-0x00000070000000 (1536MB)
>>> (XEN) BANK[1] 0x00000078000000-0x0000007c000000 (64MB)
>>>
>>> All the other ranges could be given as unallocated space:
>>>
>>> - 0x0 - 0x10000000
>>> - 0x70000000 - 0x78000000
>>> - 0x8_0000_0000 - 0x8_8000_0000
>>
>> Thank you for the ideas.
>>
>> If I got the idea correctly, yes, as these ranges represent the real 
>> RAM, so no I/O would be in conflict with them and as the result - no 
>> overlaps would be expected.
>> But, I wonder, would this work if we have IOMMU enabled for Dom0 and 
>> need to establish 1:1 mapping for the DMA devices to work with grant 
>> mappings...
>> In arm_iommu_map_page() we call guest_physmap_add_entry() with gfn = 
>> mfn, so the question is could we end up with this new gfn replacing 
>> the valid mapping
>> (with gfn allocated from the safe region)?
>
> Right, when we enable the IOMMU for dom0, Xen will add an extra 
> mapping with GFN == MFN for foreign and grant pages. This is because 
> Linux is not aware that whether a device is protected by an IOMMU. 
> Therefore it is assuming it is not and will use the MFN to configure 
> for DMA transaction.
>
> We can't remove the mapping without significant changes in Linux and 
> Xen. I would not mandate them for this work.
>
> That said, I think it would be acceptable to have different way to 
> find the region depending on the dom0 configuration. So we could use 
> the RAM not used by dom0 when the IOMMU is turned off.

OK


>
>>> The second best choice would be an hole: an address range not used by
>>> anybody else (no reg property) and also not even mappable by a bus (not
>>> covered by a ranges property). This is not the best choice because 
>>> there
>>> can cases where physical resources appear afterwards.
>
> Are you saying that the original device-tree doesn't even describe 
> them in any way (i.e. reserved...)?
>
>>
>> Unfortunately, yes.
>
> So the decision where the safe region is located will be done by Xen. 
> There is no involvement of the domain (it will discover the region 
> from the DT). Therefore, I don't think we need to think about 
> everything right now as we could adapt this is exact region is not 
> part of the stable ABI.
>
> The hotplug is one I would defer because this is not supported (and 
> quite likely not working) in Xen upstream today.

Sounds reasonable.


>
>
> Now regarding the case where dom0 is using the IOMMU. The assumption 
> is Xen will be able to figure out all the regions used from the 
> firmware table (ACPI or DT).
>
> AFAIK, this assumption would be correct for DT. However, for ACPI, I 
> remember we were not able to find all the MMIOs region in Xen (see [1] 
> and [2]). So even this solution would not work for ACPI.
>
> If I am not mistaken, we don't support IOMMU with ACPI yet. So we 
> could defer the problem to when this is going to be supported.

Sounds reasonable.


To summarize:

0. Skip ACPI case for now, implement for DT case

1. If IOMMU is enabled for Dom0 -> provide holes found in Host DT as 
safe ranges

I would take into the account holes >= 1MB. I am wondering, do we need a 
special alignment here other than a PAGE_SIZE?

2. If IOMMU is disabled for Dom0 -> provide RAM which not assigned to 
Dom0 as safe ranges

We could even provide holes here as well.


>
>
> Cheers,
>
> [1] https://marc.info/?l=linux-arm-kernel&m=148469169210500&w=2
> [2] Xen commit 80f9c316708400cea4417e36337267d3b26591db
>
Julien Grall Aug. 9, 2021, 8:45 p.m. UTC | #36
On 09/08/2021 19:24, Oleksandr wrote:
> 
> On 09.08.21 18:42, Julien Grall wrote:
>> Hi Oleksandr,
> 
> 
> Hi Julien.
> 
> 
> Thank you for the input.
> 
> 
>>
>> On 07/08/2021 18:03, Oleksandr wrote:
>>>
>>> On 06.08.21 03:30, Stefano Stabellini wrote:
>>>
>>> Hi Stefano
>>>
>>>> On Wed, 4 Aug 2021, Julien Grall wrote:
>>>>>> +#define GUEST_SAFE_RANGE_BASE xen_mk_ullong(0xDE00000000) /* 
>>>>>> 128GB */
>>>>>> +#define GUEST_SAFE_RANGE_SIZE xen_mk_ullong(0x0200000000)
>>>>>>
>>>>>> While the possible new DT bindings has not been agreed yet, I re-used
>>>>>> existing "reg" property under the hypervisor node to pass safe 
>>>>>> range as a
>>>>>> second region,
>>>>>> https://elixir.bootlin.com/linux/v5.14-rc4/source/Documentation/devicetree/bindings/arm/xen.txt#L10: 
>>>>>>
>>>>> So a single region works for a guest today, but for dom0 we will 
>>>>> need multiple
>>>>> regions because it is may be difficult to find enough contiguous 
>>>>> space for a
>>>>> single region.
>>>>>
>>>>> That said, as dom0 is mapped 1:1 (including some guest mapping), 
>>>>> there is also
>>>>> the question where to allocate the safe region. For grant table, we 
>>>>> so far
>>>>> re-use the Xen address space because it is assumed it will space 
>>>>> will always
>>>>> be bigger than the grant table.
>>>>>
>>>>> I am not sure yet where we could allocate the safe regions. 
>>>>> Stefano, do you
>>>>> have any ideas?
>>>> The safest choice would be the address range corresponding to memory
>>>> (/memory) not already allocated to Dom0.
>>>>
>>>> For instance from my last boot logs:
>>>> (XEN) Allocating 1:1 mappings totalling 1600MB for dom0:
>>>> (XEN) BANK[0] 0x00000010000000-0x00000070000000 (1536MB)
>>>> (XEN) BANK[1] 0x00000078000000-0x0000007c000000 (64MB)
>>>>
>>>> All the other ranges could be given as unallocated space:
>>>>
>>>> - 0x0 - 0x10000000
>>>> - 0x70000000 - 0x78000000
>>>> - 0x8_0000_0000 - 0x8_8000_0000
>>>
>>> Thank you for the ideas.
>>>
>>> If I got the idea correctly, yes, as these ranges represent the real 
>>> RAM, so no I/O would be in conflict with them and as the result - no 
>>> overlaps would be expected.
>>> But, I wonder, would this work if we have IOMMU enabled for Dom0 and 
>>> need to establish 1:1 mapping for the DMA devices to work with grant 
>>> mappings...
>>> In arm_iommu_map_page() we call guest_physmap_add_entry() with gfn = 
>>> mfn, so the question is could we end up with this new gfn replacing 
>>> the valid mapping
>>> (with gfn allocated from the safe region)?
>>
>> Right, when we enable the IOMMU for dom0, Xen will add an extra 
>> mapping with GFN == MFN for foreign and grant pages. This is because 
>> Linux is not aware that whether a device is protected by an IOMMU. 
>> Therefore it is assuming it is not and will use the MFN to configure 
>> for DMA transaction.
>>
>> We can't remove the mapping without significant changes in Linux and 
>> Xen. I would not mandate them for this work.
>>
>> That said, I think it would be acceptable to have different way to 
>> find the region depending on the dom0 configuration. So we could use 
>> the RAM not used by dom0 when the IOMMU is turned off.
> 
> OK
> 
> 
>>
>>>> The second best choice would be an hole: an address range not used by
>>>> anybody else (no reg property) and also not even mappable by a bus (not
>>>> covered by a ranges property). This is not the best choice because 
>>>> there
>>>> can cases where physical resources appear afterwards.
>>
>> Are you saying that the original device-tree doesn't even describe 
>> them in any way (i.e. reserved...)?
>>
>>>
>>> Unfortunately, yes.
>>
>> So the decision where the safe region is located will be done by Xen. 
>> There is no involvement of the domain (it will discover the region 
>> from the DT). Therefore, I don't think we need to think about 
>> everything right now as we could adapt this is exact region is not 
>> part of the stable ABI.
>>
>> The hotplug is one I would defer because this is not supported (and 
>> quite likely not working) in Xen upstream today.
> 
> Sounds reasonable.
> 
> 
>>
>>
>> Now regarding the case where dom0 is using the IOMMU. The assumption 
>> is Xen will be able to figure out all the regions used from the 
>> firmware table (ACPI or DT).
>>
>> AFAIK, this assumption would be correct for DT. However, for ACPI, I 
>> remember we were not able to find all the MMIOs region in Xen (see [1] 
>> and [2]). So even this solution would not work for ACPI.
>>
>> If I am not mistaken, we don't support IOMMU with ACPI yet. So we 
>> could defer the problem to when this is going to be supported.
> 
> Sounds reasonable.
> 
> 
> To summarize:
> 
> 0. Skip ACPI case for now, implement for DT case

Just to be clear, I suggested to skip it when the IOMMU is enabled with 
ACPI. We should still support the case without IOMMU. The implementation 
would be the same as 2.

> 
> 1. If IOMMU is enabled for Dom0 -> provide holes found in Host DT as 
> safe ranges
> 
> I would take into the account holes >= 1MB.

May I ask why 1MB?

> I am wondering, do we need a 
> special alignment here other than a PAGE_SIZE?

It needs to be 64KB aligned so a guest using 64KB pages can use it.

> 
> 2. If IOMMU is disabled for Dom0 -> provide RAM which not assigned to 
> Dom0 as safe ranges
> 
> We could even provide holes here as well.

I would rather not. We likely need hack for the hotplug case. So I want 
to keep them contained to IOMMU unless there is a strong reason to do it.

Cheers,
Oleksandr Tyshchenko Aug. 9, 2021, 9:18 p.m. UTC | #37
On 09.08.21 23:45, Julien Grall wrote:


Hi Julien

>
>
> On 09/08/2021 19:24, Oleksandr wrote:
>>
>> On 09.08.21 18:42, Julien Grall wrote:
>>> Hi Oleksandr,
>>
>>
>> Hi Julien.
>>
>>
>> Thank you for the input.
>>
>>
>>>
>>> On 07/08/2021 18:03, Oleksandr wrote:
>>>>
>>>> On 06.08.21 03:30, Stefano Stabellini wrote:
>>>>
>>>> Hi Stefano
>>>>
>>>>> On Wed, 4 Aug 2021, Julien Grall wrote:
>>>>>>> +#define GUEST_SAFE_RANGE_BASE xen_mk_ullong(0xDE00000000) /* 
>>>>>>> 128GB */
>>>>>>> +#define GUEST_SAFE_RANGE_SIZE xen_mk_ullong(0x0200000000)
>>>>>>>
>>>>>>> While the possible new DT bindings has not been agreed yet, I 
>>>>>>> re-used
>>>>>>> existing "reg" property under the hypervisor node to pass safe 
>>>>>>> range as a
>>>>>>> second region,
>>>>>>> https://elixir.bootlin.com/linux/v5.14-rc4/source/Documentation/devicetree/bindings/arm/xen.txt#L10: 
>>>>>>>
>>>>>> So a single region works for a guest today, but for dom0 we will 
>>>>>> need multiple
>>>>>> regions because it is may be difficult to find enough contiguous 
>>>>>> space for a
>>>>>> single region.
>>>>>>
>>>>>> That said, as dom0 is mapped 1:1 (including some guest mapping), 
>>>>>> there is also
>>>>>> the question where to allocate the safe region. For grant table, 
>>>>>> we so far
>>>>>> re-use the Xen address space because it is assumed it will space 
>>>>>> will always
>>>>>> be bigger than the grant table.
>>>>>>
>>>>>> I am not sure yet where we could allocate the safe regions. 
>>>>>> Stefano, do you
>>>>>> have any ideas?
>>>>> The safest choice would be the address range corresponding to memory
>>>>> (/memory) not already allocated to Dom0.
>>>>>
>>>>> For instance from my last boot logs:
>>>>> (XEN) Allocating 1:1 mappings totalling 1600MB for dom0:
>>>>> (XEN) BANK[0] 0x00000010000000-0x00000070000000 (1536MB)
>>>>> (XEN) BANK[1] 0x00000078000000-0x0000007c000000 (64MB)
>>>>>
>>>>> All the other ranges could be given as unallocated space:
>>>>>
>>>>> - 0x0 - 0x10000000
>>>>> - 0x70000000 - 0x78000000
>>>>> - 0x8_0000_0000 - 0x8_8000_0000
>>>>
>>>> Thank you for the ideas.
>>>>
>>>> If I got the idea correctly, yes, as these ranges represent the 
>>>> real RAM, so no I/O would be in conflict with them and as the 
>>>> result - no overlaps would be expected.
>>>> But, I wonder, would this work if we have IOMMU enabled for Dom0 
>>>> and need to establish 1:1 mapping for the DMA devices to work with 
>>>> grant mappings...
>>>> In arm_iommu_map_page() we call guest_physmap_add_entry() with gfn 
>>>> = mfn, so the question is could we end up with this new gfn 
>>>> replacing the valid mapping
>>>> (with gfn allocated from the safe region)?
>>>
>>> Right, when we enable the IOMMU for dom0, Xen will add an extra 
>>> mapping with GFN == MFN for foreign and grant pages. This is because 
>>> Linux is not aware that whether a device is protected by an IOMMU. 
>>> Therefore it is assuming it is not and will use the MFN to configure 
>>> for DMA transaction.
>>>
>>> We can't remove the mapping without significant changes in Linux and 
>>> Xen. I would not mandate them for this work.
>>>
>>> That said, I think it would be acceptable to have different way to 
>>> find the region depending on the dom0 configuration. So we could use 
>>> the RAM not used by dom0 when the IOMMU is turned off.
>>
>> OK
>>
>>
>>>
>>>>> The second best choice would be an hole: an address range not used by
>>>>> anybody else (no reg property) and also not even mappable by a bus 
>>>>> (not
>>>>> covered by a ranges property). This is not the best choice because 
>>>>> there
>>>>> can cases where physical resources appear afterwards.
>>>
>>> Are you saying that the original device-tree doesn't even describe 
>>> them in any way (i.e. reserved...)?
>>>
>>>>
>>>> Unfortunately, yes.
>>>
>>> So the decision where the safe region is located will be done by 
>>> Xen. There is no involvement of the domain (it will discover the 
>>> region from the DT). Therefore, I don't think we need to think about 
>>> everything right now as we could adapt this is exact region is not 
>>> part of the stable ABI.
>>>
>>> The hotplug is one I would defer because this is not supported (and 
>>> quite likely not working) in Xen upstream today.
>>
>> Sounds reasonable.
>>
>>
>>>
>>>
>>> Now regarding the case where dom0 is using the IOMMU. The assumption 
>>> is Xen will be able to figure out all the regions used from the 
>>> firmware table (ACPI or DT).
>>>
>>> AFAIK, this assumption would be correct for DT. However, for ACPI, I 
>>> remember we were not able to find all the MMIOs region in Xen (see 
>>> [1] and [2]). So even this solution would not work for ACPI.
>>>
>>> If I am not mistaken, we don't support IOMMU with ACPI yet. So we 
>>> could defer the problem to when this is going to be supported.
>>
>> Sounds reasonable.
>>
>>
>> To summarize:
>>
>> 0. Skip ACPI case for now, implement for DT case
>
> Just to be clear, I suggested to skip it when the IOMMU is enabled 
> with ACPI. We should still support the case without IOMMU. The 
> implementation would be the same as 2.

yes, sorry for not being precise


>
>
>>
>> 1. If IOMMU is enabled for Dom0 -> provide holes found in Host DT as 
>> safe ranges
>>
>> I would take into the account holes >= 1MB.
>
> May I ask why 1MB?

Nothing special, just thinking to not bother with small regions which 
would not be too useful overall, but could bloat resulting reg property.

Anyway, I would be ok with any sizes.


>
>> I am wondering, do we need a special alignment here other than a 
>> PAGE_SIZE?
>
> It needs to be 64KB aligned so a guest using 64KB pages can use it.

ok, sounds reasonable


>
>>
>> 2. If IOMMU is disabled for Dom0 -> provide RAM which not assigned to 
>> Dom0 as safe ranges
>>
>> We could even provide holes here as well.
>
> I would rather not. We likely need hack for the hotplug case. So I 
> want to keep them contained to IOMMU unless there is a strong reason 
> to do it.

ok, I got it


>
> Cheers,
>
Wei Chen Aug. 10, 2021, 6:34 a.m. UTC | #38
Hi Oleksandr,

> -----Original Message-----
> From: Oleksandr <olekstysh@gmail.com>
> Sent: 2021年8月10日 2:25
> To: Julien Grall <julien@xen.org>
> Cc: Stefano Stabellini <sstabellini@kernel.org>; Andrew Cooper
> <andrew.cooper3@citrix.com>; xen-devel@lists.xenproject.org; Oleksandr
> Tyshchenko <oleksandr_tyshchenko@epam.com>; Daniel De Graaf
> <dgdegra@tycho.nsa.gov>; Daniel P. Smith <dpsmith@apertussolutions.com>;
> Ian Jackson <iwj@xenproject.org>; Wei Liu <wl@xen.org>; George Dunlap
> <george.dunlap@citrix.com>; Jan Beulich <jbeulich@suse.com>; Volodymyr
> Babchuk <Volodymyr_Babchuk@epam.com>; Roger Pau Monné
> <roger.pau@citrix.com>; Bertrand Marquis <Bertrand.Marquis@arm.com>; Wei
> Chen <Wei.Chen@arm.com>
> Subject: Re: [RFC PATCH] xen/memory: Introduce a hypercall to provide
> unallocated space
> 
> 
> On 09.08.21 18:42, Julien Grall wrote:
> > Hi Oleksandr,
> 
> 
> Hi Julien.
> 
> 
> Thank you for the input.
> 
> 
> >
> > On 07/08/2021 18:03, Oleksandr wrote:
> >>
> >> On 06.08.21 03:30, Stefano Stabellini wrote:
> >>
> >> Hi Stefano
> >>
> >>> On Wed, 4 Aug 2021, Julien Grall wrote:
> >>>>> +#define GUEST_SAFE_RANGE_BASE xen_mk_ullong(0xDE00000000) /*
> >>>>> 128GB */
> >>>>> +#define GUEST_SAFE_RANGE_SIZE xen_mk_ullong(0x0200000000)
> >>>>>
> >>>>> While the possible new DT bindings has not been agreed yet, I re-
> used
> >>>>> existing "reg" property under the hypervisor node to pass safe
> >>>>> range as a
> >>>>> second region,
> >>>>> https://elixir.bootlin.com/linux/v5.14-
> rc4/source/Documentation/devicetree/bindings/arm/xen.txt#L10:
> >>>>>
> >>>> So a single region works for a guest today, but for dom0 we will
> >>>> need multiple
> >>>> regions because it is may be difficult to find enough contiguous
> >>>> space for a
> >>>> single region.
> >>>>
> >>>> That said, as dom0 is mapped 1:1 (including some guest mapping),
> >>>> there is also
> >>>> the question where to allocate the safe region. For grant table, we
> >>>> so far
> >>>> re-use the Xen address space because it is assumed it will space
> >>>> will always
> >>>> be bigger than the grant table.
> >>>>
> >>>> I am not sure yet where we could allocate the safe regions.
> >>>> Stefano, do you
> >>>> have any ideas?
> >>> The safest choice would be the address range corresponding to memory
> >>> (/memory) not already allocated to Dom0.
> >>>
> >>> For instance from my last boot logs:
> >>> (XEN) Allocating 1:1 mappings totalling 1600MB for dom0:
> >>> (XEN) BANK[0] 0x00000010000000-0x00000070000000 (1536MB)
> >>> (XEN) BANK[1] 0x00000078000000-0x0000007c000000 (64MB)
> >>>
> >>> All the other ranges could be given as unallocated space:
> >>>
> >>> - 0x0 - 0x10000000
> >>> - 0x70000000 - 0x78000000
> >>> - 0x8_0000_0000 - 0x8_8000_0000
> >>
> >> Thank you for the ideas.
> >>
> >> If I got the idea correctly, yes, as these ranges represent the real
> >> RAM, so no I/O would be in conflict with them and as the result - no
> >> overlaps would be expected.
> >> But, I wonder, would this work if we have IOMMU enabled for Dom0 and
> >> need to establish 1:1 mapping for the DMA devices to work with grant
> >> mappings...
> >> In arm_iommu_map_page() we call guest_physmap_add_entry() with gfn =
> >> mfn, so the question is could we end up with this new gfn replacing
> >> the valid mapping
> >> (with gfn allocated from the safe region)?
> >
> > Right, when we enable the IOMMU for dom0, Xen will add an extra
> > mapping with GFN == MFN for foreign and grant pages. This is because
> > Linux is not aware that whether a device is protected by an IOMMU.
> > Therefore it is assuming it is not and will use the MFN to configure
> > for DMA transaction.
> >
> > We can't remove the mapping without significant changes in Linux and
> > Xen. I would not mandate them for this work.
> >
> > That said, I think it would be acceptable to have different way to
> > find the region depending on the dom0 configuration. So we could use
> > the RAM not used by dom0 when the IOMMU is turned off.
> 
> OK
> 
> 
> >
> >>> The second best choice would be an hole: an address range not used by
> >>> anybody else (no reg property) and also not even mappable by a bus
> (not
> >>> covered by a ranges property). This is not the best choice because
> >>> there
> >>> can cases where physical resources appear afterwards.
> >
> > Are you saying that the original device-tree doesn't even describe
> > them in any way (i.e. reserved...)?
> >
> >>
> >> Unfortunately, yes.
> >
> > So the decision where the safe region is located will be done by Xen.
> > There is no involvement of the domain (it will discover the region
> > from the DT). Therefore, I don't think we need to think about
> > everything right now as we could adapt this is exact region is not
> > part of the stable ABI.
> >
> > The hotplug is one I would defer because this is not supported (and
> > quite likely not working) in Xen upstream today.
> 
> Sounds reasonable.
> 
> 
> >
> >
> > Now regarding the case where dom0 is using the IOMMU. The assumption
> > is Xen will be able to figure out all the regions used from the
> > firmware table (ACPI or DT).
> >
> > AFAIK, this assumption would be correct for DT. However, for ACPI, I
> > remember we were not able to find all the MMIOs region in Xen (see [1]
> > and [2]). So even this solution would not work for ACPI.
> >
> > If I am not mistaken, we don't support IOMMU with ACPI yet. So we
> > could defer the problem to when this is going to be supported.
> 
> Sounds reasonable.
> 
> 
> To summarize:
> 
> 0. Skip ACPI case for now, implement for DT case
> 
> 1. If IOMMU is enabled for Dom0 -> provide holes found in Host DT as
> safe ranges
> 

Does static allocation and direct mapping driver domain can be treated
as this case?

> I would take into the account holes >= 1MB. I am wondering, do we need a
> special alignment here other than a PAGE_SIZE?
> 
> 2. If IOMMU is disabled for Dom0 -> provide RAM which not assigned to
> Dom0 as safe ranges
> 
> We could even provide holes here as well.
> 
> 
> >
> >
> > Cheers,
> >
> > [1] https://marc.info/?l=linux-arm-kernel&m=148469169210500&w=2
> > [2] Xen commit 80f9c316708400cea4417e36337267d3b26591db
> >
> --
> Regards,
> 
> Oleksandr Tyshchenko
Oleksandr Tyshchenko Aug. 10, 2021, 11:58 a.m. UTC | #39
On 10.08.21 09:34, Wei Chen wrote:
> Hi Oleksandr,

Hi Wei, Julien.


>
>> -----Original Message-----
>> From: Oleksandr <olekstysh@gmail.com>
>> Sent: 2021年8月10日 2:25
>> To: Julien Grall <julien@xen.org>
>> Cc: Stefano Stabellini <sstabellini@kernel.org>; Andrew Cooper
>> <andrew.cooper3@citrix.com>; xen-devel@lists.xenproject.org; Oleksandr
>> Tyshchenko <oleksandr_tyshchenko@epam.com>; Daniel De Graaf
>> <dgdegra@tycho.nsa.gov>; Daniel P. Smith <dpsmith@apertussolutions.com>;
>> Ian Jackson <iwj@xenproject.org>; Wei Liu <wl@xen.org>; George Dunlap
>> <george.dunlap@citrix.com>; Jan Beulich <jbeulich@suse.com>; Volodymyr
>> Babchuk <Volodymyr_Babchuk@epam.com>; Roger Pau Monné
>> <roger.pau@citrix.com>; Bertrand Marquis <Bertrand.Marquis@arm.com>; Wei
>> Chen <Wei.Chen@arm.com>
>> Subject: Re: [RFC PATCH] xen/memory: Introduce a hypercall to provide
>> unallocated space
>>
>>
>> On 09.08.21 18:42, Julien Grall wrote:
>>> Hi Oleksandr,
>>
>> Hi Julien.
>>
>>
>> Thank you for the input.
>>
>>
>>> On 07/08/2021 18:03, Oleksandr wrote:
>>>> On 06.08.21 03:30, Stefano Stabellini wrote:
>>>>
>>>> Hi Stefano
>>>>
>>>>> On Wed, 4 Aug 2021, Julien Grall wrote:
>>>>>>> +#define GUEST_SAFE_RANGE_BASE xen_mk_ullong(0xDE00000000) /*
>>>>>>> 128GB */
>>>>>>> +#define GUEST_SAFE_RANGE_SIZE xen_mk_ullong(0x0200000000)
>>>>>>>
>>>>>>> While the possible new DT bindings has not been agreed yet, I re-
>> used
>>>>>>> existing "reg" property under the hypervisor node to pass safe
>>>>>>> range as a
>>>>>>> second region,
>>>>>>> https://elixir.bootlin.com/linux/v5.14-
>> rc4/source/Documentation/devicetree/bindings/arm/xen.txt#L10:
>>>>>> So a single region works for a guest today, but for dom0 we will
>>>>>> need multiple
>>>>>> regions because it is may be difficult to find enough contiguous
>>>>>> space for a
>>>>>> single region.
>>>>>>
>>>>>> That said, as dom0 is mapped 1:1 (including some guest mapping),
>>>>>> there is also
>>>>>> the question where to allocate the safe region. For grant table, we
>>>>>> so far
>>>>>> re-use the Xen address space because it is assumed it will space
>>>>>> will always
>>>>>> be bigger than the grant table.
>>>>>>
>>>>>> I am not sure yet where we could allocate the safe regions.
>>>>>> Stefano, do you
>>>>>> have any ideas?
>>>>> The safest choice would be the address range corresponding to memory
>>>>> (/memory) not already allocated to Dom0.
>>>>>
>>>>> For instance from my last boot logs:
>>>>> (XEN) Allocating 1:1 mappings totalling 1600MB for dom0:
>>>>> (XEN) BANK[0] 0x00000010000000-0x00000070000000 (1536MB)
>>>>> (XEN) BANK[1] 0x00000078000000-0x0000007c000000 (64MB)
>>>>>
>>>>> All the other ranges could be given as unallocated space:
>>>>>
>>>>> - 0x0 - 0x10000000
>>>>> - 0x70000000 - 0x78000000
>>>>> - 0x8_0000_0000 - 0x8_8000_0000
>>>> Thank you for the ideas.
>>>>
>>>> If I got the idea correctly, yes, as these ranges represent the real
>>>> RAM, so no I/O would be in conflict with them and as the result - no
>>>> overlaps would be expected.
>>>> But, I wonder, would this work if we have IOMMU enabled for Dom0 and
>>>> need to establish 1:1 mapping for the DMA devices to work with grant
>>>> mappings...
>>>> In arm_iommu_map_page() we call guest_physmap_add_entry() with gfn =
>>>> mfn, so the question is could we end up with this new gfn replacing
>>>> the valid mapping
>>>> (with gfn allocated from the safe region)?
>>> Right, when we enable the IOMMU for dom0, Xen will add an extra
>>> mapping with GFN == MFN for foreign and grant pages. This is because
>>> Linux is not aware that whether a device is protected by an IOMMU.
>>> Therefore it is assuming it is not and will use the MFN to configure
>>> for DMA transaction.
>>>
>>> We can't remove the mapping without significant changes in Linux and
>>> Xen. I would not mandate them for this work.
>>>
>>> That said, I think it would be acceptable to have different way to
>>> find the region depending on the dom0 configuration. So we could use
>>> the RAM not used by dom0 when the IOMMU is turned off.
>> OK
>>
>>
>>>>> The second best choice would be an hole: an address range not used by
>>>>> anybody else (no reg property) and also not even mappable by a bus
>> (not
>>>>> covered by a ranges property). This is not the best choice because
>>>>> there
>>>>> can cases where physical resources appear afterwards.
>>> Are you saying that the original device-tree doesn't even describe
>>> them in any way (i.e. reserved...)?
>>>
>>>> Unfortunately, yes.
>>> So the decision where the safe region is located will be done by Xen.
>>> There is no involvement of the domain (it will discover the region
>>> from the DT). Therefore, I don't think we need to think about
>>> everything right now as we could adapt this is exact region is not
>>> part of the stable ABI.
>>>
>>> The hotplug is one I would defer because this is not supported (and
>>> quite likely not working) in Xen upstream today.
>> Sounds reasonable.
>>
>>
>>>
>>> Now regarding the case where dom0 is using the IOMMU. The assumption
>>> is Xen will be able to figure out all the regions used from the
>>> firmware table (ACPI or DT).
>>>
>>> AFAIK, this assumption would be correct for DT. However, for ACPI, I
>>> remember we were not able to find all the MMIOs region in Xen (see [1]
>>> and [2]). So even this solution would not work for ACPI.
>>>
>>> If I am not mistaken, we don't support IOMMU with ACPI yet. So we
>>> could defer the problem to when this is going to be supported.
>> Sounds reasonable.
>>
>>
>> To summarize:
>>
>> 0. Skip ACPI case for now, implement for DT case
>>
>> 1. If IOMMU is enabled for Dom0 -> provide holes found in Host DT as
>> safe ranges
>>
> Does static allocation and direct mapping driver domain can be treated
> as this case?
I am not sure I can answer this question correctly due to the limited 
knowledge of these features.

But, it feels to me that holes solution would work, at least I don't see 
why not.


I wonder, why these can't be treated as the case #2 (where we provide 
not assigned RAM), I also don't see why not, however there might be 
pitfalls with direct mapped driver domain. Julien, do you have any 
opinion on this?



>
>> I would take into the account holes >= 1MB. I am wondering, do we need a
>> special alignment here other than a PAGE_SIZE?
>>
>> 2. If IOMMU is disabled for Dom0 -> provide RAM which not assigned to
>> Dom0 as safe ranges
>>
>> We could even provide holes here as well.
>>
>>
>>>
>>> Cheers,
>>>
>>> [1] https://marc.info/?l=linux-arm-kernel&m=148469169210500&w=2
>>> [2] Xen commit 80f9c316708400cea4417e36337267d3b26591db
>>>
>> --
>> Regards,
>>
>> Oleksandr Tyshchenko
Julien Grall Aug. 10, 2021, 4:21 p.m. UTC | #40
On 10/08/2021 12:58, Oleksandr wrote:
> 
> On 10.08.21 09:34, Wei Chen wrote:
>> Hi Oleksandr,
> 
> Hi Wei, Julien.
Hi,

>>> -----Original Message-----
>>> From: Oleksandr <olekstysh@gmail.com>
>>> Sent: 2021年8月10日 2:25
>>> To: Julien Grall <julien@xen.org>
>>> Cc: Stefano Stabellini <sstabellini@kernel.org>; Andrew Cooper
>>> <andrew.cooper3@citrix.com>; xen-devel@lists.xenproject.org; Oleksandr
>>> Tyshchenko <oleksandr_tyshchenko@epam.com>; Daniel De Graaf
>>> <dgdegra@tycho.nsa.gov>; Daniel P. Smith <dpsmith@apertussolutions.com>;
>>> Ian Jackson <iwj@xenproject.org>; Wei Liu <wl@xen.org>; George Dunlap
>>> <george.dunlap@citrix.com>; Jan Beulich <jbeulich@suse.com>; Volodymyr
>>> Babchuk <Volodymyr_Babchuk@epam.com>; Roger Pau Monné
>>> <roger.pau@citrix.com>; Bertrand Marquis <Bertrand.Marquis@arm.com>; Wei
>>> Chen <Wei.Chen@arm.com>
>>> Subject: Re: [RFC PATCH] xen/memory: Introduce a hypercall to provide
>>> unallocated space
>>>
>>>
>>> On 09.08.21 18:42, Julien Grall wrote:
>>>> Hi Oleksandr,
>>>
>>> Hi Julien.
>>>
>>>
>>> Thank you for the input.
>>>
>>>
>>>> On 07/08/2021 18:03, Oleksandr wrote:
>>>>> On 06.08.21 03:30, Stefano Stabellini wrote:
>>>>>
>>>>> Hi Stefano
>>>>>
>>>>>> On Wed, 4 Aug 2021, Julien Grall wrote:
>>>>>>>> +#define GUEST_SAFE_RANGE_BASE xen_mk_ullong(0xDE00000000) /*
>>>>>>>> 128GB */
>>>>>>>> +#define GUEST_SAFE_RANGE_SIZE xen_mk_ullong(0x0200000000)
>>>>>>>>
>>>>>>>> While the possible new DT bindings has not been agreed yet, I re-
>>> used
>>>>>>>> existing "reg" property under the hypervisor node to pass safe
>>>>>>>> range as a
>>>>>>>> second region,
>>>>>>>> https://elixir.bootlin.com/linux/v5.14-
>>> rc4/source/Documentation/devicetree/bindings/arm/xen.txt#L10:
>>>>>>> So a single region works for a guest today, but for dom0 we will
>>>>>>> need multiple
>>>>>>> regions because it is may be difficult to find enough contiguous
>>>>>>> space for a
>>>>>>> single region.
>>>>>>>
>>>>>>> That said, as dom0 is mapped 1:1 (including some guest mapping),
>>>>>>> there is also
>>>>>>> the question where to allocate the safe region. For grant table, we
>>>>>>> so far
>>>>>>> re-use the Xen address space because it is assumed it will space
>>>>>>> will always
>>>>>>> be bigger than the grant table.
>>>>>>>
>>>>>>> I am not sure yet where we could allocate the safe regions.
>>>>>>> Stefano, do you
>>>>>>> have any ideas?
>>>>>> The safest choice would be the address range corresponding to memory
>>>>>> (/memory) not already allocated to Dom0.
>>>>>>
>>>>>> For instance from my last boot logs:
>>>>>> (XEN) Allocating 1:1 mappings totalling 1600MB for dom0:
>>>>>> (XEN) BANK[0] 0x00000010000000-0x00000070000000 (1536MB)
>>>>>> (XEN) BANK[1] 0x00000078000000-0x0000007c000000 (64MB)
>>>>>>
>>>>>> All the other ranges could be given as unallocated space:
>>>>>>
>>>>>> - 0x0 - 0x10000000
>>>>>> - 0x70000000 - 0x78000000
>>>>>> - 0x8_0000_0000 - 0x8_8000_0000
>>>>> Thank you for the ideas.
>>>>>
>>>>> If I got the idea correctly, yes, as these ranges represent the real
>>>>> RAM, so no I/O would be in conflict with them and as the result - no
>>>>> overlaps would be expected.
>>>>> But, I wonder, would this work if we have IOMMU enabled for Dom0 and
>>>>> need to establish 1:1 mapping for the DMA devices to work with grant
>>>>> mappings...
>>>>> In arm_iommu_map_page() we call guest_physmap_add_entry() with gfn =
>>>>> mfn, so the question is could we end up with this new gfn replacing
>>>>> the valid mapping
>>>>> (with gfn allocated from the safe region)?
>>>> Right, when we enable the IOMMU for dom0, Xen will add an extra
>>>> mapping with GFN == MFN for foreign and grant pages. This is because
>>>> Linux is not aware that whether a device is protected by an IOMMU.
>>>> Therefore it is assuming it is not and will use the MFN to configure
>>>> for DMA transaction.
>>>>
>>>> We can't remove the mapping without significant changes in Linux and
>>>> Xen. I would not mandate them for this work.
>>>>
>>>> That said, I think it would be acceptable to have different way to
>>>> find the region depending on the dom0 configuration. So we could use
>>>> the RAM not used by dom0 when the IOMMU is turned off.
>>> OK
>>>
>>>
>>>>>> The second best choice would be an hole: an address range not used by
>>>>>> anybody else (no reg property) and also not even mappable by a bus
>>> (not
>>>>>> covered by a ranges property). This is not the best choice because
>>>>>> there
>>>>>> can cases where physical resources appear afterwards.
>>>> Are you saying that the original device-tree doesn't even describe
>>>> them in any way (i.e. reserved...)?
>>>>
>>>>> Unfortunately, yes.
>>>> So the decision where the safe region is located will be done by Xen.
>>>> There is no involvement of the domain (it will discover the region
>>>> from the DT). Therefore, I don't think we need to think about
>>>> everything right now as we could adapt this is exact region is not
>>>> part of the stable ABI.
>>>>
>>>> The hotplug is one I would defer because this is not supported (and
>>>> quite likely not working) in Xen upstream today.
>>> Sounds reasonable.
>>>
>>>
>>>>
>>>> Now regarding the case where dom0 is using the IOMMU. The assumption
>>>> is Xen will be able to figure out all the regions used from the
>>>> firmware table (ACPI or DT).
>>>>
>>>> AFAIK, this assumption would be correct for DT. However, for ACPI, I
>>>> remember we were not able to find all the MMIOs region in Xen (see [1]
>>>> and [2]). So even this solution would not work for ACPI.
>>>>
>>>> If I am not mistaken, we don't support IOMMU with ACPI yet. So we
>>>> could defer the problem to when this is going to be supported.
>>> Sounds reasonable.
>>>
>>>
>>> To summarize:
>>>
>>> 0. Skip ACPI case for now, implement for DT case
>>>
>>> 1. If IOMMU is enabled for Dom0 -> provide holes found in Host DT as
>>> safe ranges
>>>
>> Does static allocation and direct mapping driver domain can be treated
>> as this case?
> I am not sure I can answer this question correctly due to the limited 
> knowledge of these features.
> 
> But, it feels to me that holes solution would work, at least I don't see 
> why not.
> 
> 
> I wonder, why these can't be treated as the case #2 (where we provide 
> not assigned RAM), I also don't see why not, however there might be 
> pitfalls with direct mapped driver domain. Julien, do you have any 
> opinion on this?

So whether the memory is statically allocated or dynamically allocated 
should not matter here. For direct mapped domain, then they should be 
treated the same way as dom0.

By that I mean if the IOMMU is not enabled for the domain, then we can 
use thew unallocated RAM. Otherwise, we would need to find some holes.

Cheers,
Julien Grall Aug. 10, 2021, 4:28 p.m. UTC | #41
On 09/08/2021 22:18, Oleksandr wrote:
> 
> On 09.08.21 23:45, Julien Grall wrote:
> 
> 
> Hi Julien


Hi Oleksandr,

>> On 09/08/2021 19:24, Oleksandr wrote:
>>>
>>> On 09.08.21 18:42, Julien Grall wrote:
>>> 1. If IOMMU is enabled for Dom0 -> provide holes found in Host DT as 
>>> safe ranges
>>>
>>> I would take into the account holes >= 1MB.
>>
>> May I ask why 1MB?
> 
> Nothing special, just thinking to not bother with small regions which 
> would not be too useful overall, but could bloat resulting reg property.
> 
> Anyway, I would be ok with any sizes.

I was more interesting with the reasoning behind your choice rather than 
saying we want more.

I agree that we may want to skip smaller region. I am guess that an high 
number of small regions will also increase the bookeeping in Linux.

But I would probably suggest to align the safe region to a 2MB 
(superpage for 4KB page granularity) just because the backend (or even 
Xen) may be able to do some optimization there.

Cheers,
Oleksandr Tyshchenko Aug. 10, 2021, 4:49 p.m. UTC | #42
On 10.08.21 19:21, Julien Grall wrote:

Hi Julien

>
>
> On 10/08/2021 12:58, Oleksandr wrote:
>>
>> On 10.08.21 09:34, Wei Chen wrote:
>>> Hi Oleksandr,
>>
>> Hi Wei, Julien.
> Hi,
>
>>>> -----Original Message-----
>>>> From: Oleksandr <olekstysh@gmail.com>
>>>> Sent: 2021年8月10日 2:25
>>>> To: Julien Grall <julien@xen.org>
>>>> Cc: Stefano Stabellini <sstabellini@kernel.org>; Andrew Cooper
>>>> <andrew.cooper3@citrix.com>; xen-devel@lists.xenproject.org; Oleksandr
>>>> Tyshchenko <oleksandr_tyshchenko@epam.com>; Daniel De Graaf
>>>> <dgdegra@tycho.nsa.gov>; Daniel P. Smith 
>>>> <dpsmith@apertussolutions.com>;
>>>> Ian Jackson <iwj@xenproject.org>; Wei Liu <wl@xen.org>; George Dunlap
>>>> <george.dunlap@citrix.com>; Jan Beulich <jbeulich@suse.com>; Volodymyr
>>>> Babchuk <Volodymyr_Babchuk@epam.com>; Roger Pau Monné
>>>> <roger.pau@citrix.com>; Bertrand Marquis 
>>>> <Bertrand.Marquis@arm.com>; Wei
>>>> Chen <Wei.Chen@arm.com>
>>>> Subject: Re: [RFC PATCH] xen/memory: Introduce a hypercall to provide
>>>> unallocated space
>>>>
>>>>
>>>> On 09.08.21 18:42, Julien Grall wrote:
>>>>> Hi Oleksandr,
>>>>
>>>> Hi Julien.
>>>>
>>>>
>>>> Thank you for the input.
>>>>
>>>>
>>>>> On 07/08/2021 18:03, Oleksandr wrote:
>>>>>> On 06.08.21 03:30, Stefano Stabellini wrote:
>>>>>>
>>>>>> Hi Stefano
>>>>>>
>>>>>>> On Wed, 4 Aug 2021, Julien Grall wrote:
>>>>>>>>> +#define GUEST_SAFE_RANGE_BASE xen_mk_ullong(0xDE00000000) /*
>>>>>>>>> 128GB */
>>>>>>>>> +#define GUEST_SAFE_RANGE_SIZE xen_mk_ullong(0x0200000000)
>>>>>>>>>
>>>>>>>>> While the possible new DT bindings has not been agreed yet, I re-
>>>> used
>>>>>>>>> existing "reg" property under the hypervisor node to pass safe
>>>>>>>>> range as a
>>>>>>>>> second region,
>>>>>>>>> https://elixir.bootlin.com/linux/v5.14-
>>>> rc4/source/Documentation/devicetree/bindings/arm/xen.txt#L10:
>>>>>>>> So a single region works for a guest today, but for dom0 we will
>>>>>>>> need multiple
>>>>>>>> regions because it is may be difficult to find enough contiguous
>>>>>>>> space for a
>>>>>>>> single region.
>>>>>>>>
>>>>>>>> That said, as dom0 is mapped 1:1 (including some guest mapping),
>>>>>>>> there is also
>>>>>>>> the question where to allocate the safe region. For grant 
>>>>>>>> table, we
>>>>>>>> so far
>>>>>>>> re-use the Xen address space because it is assumed it will space
>>>>>>>> will always
>>>>>>>> be bigger than the grant table.
>>>>>>>>
>>>>>>>> I am not sure yet where we could allocate the safe regions.
>>>>>>>> Stefano, do you
>>>>>>>> have any ideas?
>>>>>>> The safest choice would be the address range corresponding to 
>>>>>>> memory
>>>>>>> (/memory) not already allocated to Dom0.
>>>>>>>
>>>>>>> For instance from my last boot logs:
>>>>>>> (XEN) Allocating 1:1 mappings totalling 1600MB for dom0:
>>>>>>> (XEN) BANK[0] 0x00000010000000-0x00000070000000 (1536MB)
>>>>>>> (XEN) BANK[1] 0x00000078000000-0x0000007c000000 (64MB)
>>>>>>>
>>>>>>> All the other ranges could be given as unallocated space:
>>>>>>>
>>>>>>> - 0x0 - 0x10000000
>>>>>>> - 0x70000000 - 0x78000000
>>>>>>> - 0x8_0000_0000 - 0x8_8000_0000
>>>>>> Thank you for the ideas.
>>>>>>
>>>>>> If I got the idea correctly, yes, as these ranges represent the real
>>>>>> RAM, so no I/O would be in conflict with them and as the result - no
>>>>>> overlaps would be expected.
>>>>>> But, I wonder, would this work if we have IOMMU enabled for Dom0 and
>>>>>> need to establish 1:1 mapping for the DMA devices to work with grant
>>>>>> mappings...
>>>>>> In arm_iommu_map_page() we call guest_physmap_add_entry() with gfn =
>>>>>> mfn, so the question is could we end up with this new gfn replacing
>>>>>> the valid mapping
>>>>>> (with gfn allocated from the safe region)?
>>>>> Right, when we enable the IOMMU for dom0, Xen will add an extra
>>>>> mapping with GFN == MFN for foreign and grant pages. This is because
>>>>> Linux is not aware that whether a device is protected by an IOMMU.
>>>>> Therefore it is assuming it is not and will use the MFN to configure
>>>>> for DMA transaction.
>>>>>
>>>>> We can't remove the mapping without significant changes in Linux and
>>>>> Xen. I would not mandate them for this work.
>>>>>
>>>>> That said, I think it would be acceptable to have different way to
>>>>> find the region depending on the dom0 configuration. So we could use
>>>>> the RAM not used by dom0 when the IOMMU is turned off.
>>>> OK
>>>>
>>>>
>>>>>>> The second best choice would be an hole: an address range not 
>>>>>>> used by
>>>>>>> anybody else (no reg property) and also not even mappable by a bus
>>>> (not
>>>>>>> covered by a ranges property). This is not the best choice because
>>>>>>> there
>>>>>>> can cases where physical resources appear afterwards.
>>>>> Are you saying that the original device-tree doesn't even describe
>>>>> them in any way (i.e. reserved...)?
>>>>>
>>>>>> Unfortunately, yes.
>>>>> So the decision where the safe region is located will be done by Xen.
>>>>> There is no involvement of the domain (it will discover the region
>>>>> from the DT). Therefore, I don't think we need to think about
>>>>> everything right now as we could adapt this is exact region is not
>>>>> part of the stable ABI.
>>>>>
>>>>> The hotplug is one I would defer because this is not supported (and
>>>>> quite likely not working) in Xen upstream today.
>>>> Sounds reasonable.
>>>>
>>>>
>>>>>
>>>>> Now regarding the case where dom0 is using the IOMMU. The assumption
>>>>> is Xen will be able to figure out all the regions used from the
>>>>> firmware table (ACPI or DT).
>>>>>
>>>>> AFAIK, this assumption would be correct for DT. However, for ACPI, I
>>>>> remember we were not able to find all the MMIOs region in Xen (see 
>>>>> [1]
>>>>> and [2]). So even this solution would not work for ACPI.
>>>>>
>>>>> If I am not mistaken, we don't support IOMMU with ACPI yet. So we
>>>>> could defer the problem to when this is going to be supported.
>>>> Sounds reasonable.
>>>>
>>>>
>>>> To summarize:
>>>>
>>>> 0. Skip ACPI case for now, implement for DT case
>>>>
>>>> 1. If IOMMU is enabled for Dom0 -> provide holes found in Host DT as
>>>> safe ranges
>>>>
>>> Does static allocation and direct mapping driver domain can be treated
>>> as this case?
>> I am not sure I can answer this question correctly due to the limited 
>> knowledge of these features.
>>
>> But, it feels to me that holes solution would work, at least I don't 
>> see why not.
>>
>>
>> I wonder, why these can't be treated as the case #2 (where we provide 
>> not assigned RAM), I also don't see why not, however there might be 
>> pitfalls with direct mapped driver domain. Julien, do you have any 
>> opinion on this?
>
> So whether the memory is statically allocated or dynamically allocated 
> should not matter here. For direct mapped domain, then they should be 
> treated the same way as dom0.
>
> By that I mean if the IOMMU is not enabled for the domain, then we can 
> use thew unallocated RAM. Otherwise, we would need to find some holes.

OK, thank you for the clarification.


>
>
> Cheers,
>
Oleksandr Tyshchenko Aug. 10, 2021, 5:03 p.m. UTC | #43
On 10.08.21 19:28, Julien Grall wrote:

Hi Julien.

>
>
> On 09/08/2021 22:18, Oleksandr wrote:
>>
>> On 09.08.21 23:45, Julien Grall wrote:
>>
>>
>> Hi Julien
>
>
> Hi Oleksandr,
>
>>> On 09/08/2021 19:24, Oleksandr wrote:
>>>>
>>>> On 09.08.21 18:42, Julien Grall wrote:
>>>> 1. If IOMMU is enabled for Dom0 -> provide holes found in Host DT 
>>>> as safe ranges
>>>>
>>>> I would take into the account holes >= 1MB.
>>>
>>> May I ask why 1MB?
>>
>> Nothing special, just thinking to not bother with small regions which 
>> would not be too useful overall, but could bloat resulting reg property.
>>
>> Anyway, I would be ok with any sizes.
>
> I was more interesting with the reasoning behind your choice rather 
> than saying we want more.
>
> I agree that we may want to skip smaller region. I am guess that an 
> high number of small regions will also increase the bookeeping in Linux.
>
> But I would probably suggest to align the safe region to a 2MB 
> (superpage for 4KB page granularity) just because the backend (or even 
> Xen) may be able to do some optimization there.

OK, agree regarding the alignment. But, what about the size? I assume, 
it should be a multiple of 2MB.


>
>
> Cheers,
>
Oleksandr Tyshchenko Aug. 13, 2021, 9:45 p.m. UTC | #44
On 09.08.21 17:51, Julien Grall wrote:
> Hi,

Hi Julien, all


>
> I am writing down here what we discussed on another thread and on IRC. 
> This will be easier to track in a single thread.
>
> On 04/08/2021 23:00, Julien Grall wrote:
>> On 04/08/2021 21:56, Oleksandr wrote:
>>> Now, I am wondering, would it be possible to update/clarify the 
>>> current "reg" purpose and use it to pass a safe unallocated space 
>>> for any Xen specific mappings (grant, foreign, whatever) instead of 
>>> just for the grant table region. In case, it is not allowed for any 
>>> reason (compatibility PoV, etc), would it be possible to extend a 
>>> property by passing an extra range separately, something similar to 
>>> how I described above?
>>
>> I think it should be fine to re-use the same region so long the size 
>> of the first bank is at least the size of the original region.
>
> While answering to the DT binding question on the DT ML, I realized 
> that this is probably not going to be fine because there is a bug in 
> Xen when mapping grant-table frame.
>
> The function gnttab_map_frame() is used to map the grant table frame. 
> If there is an old mapping, it will first remove it.
>
> The function is using the helper gnttab_map_frame() to find the 
> corresponding GFN or return INVALID_GFN if not mapped.
>
> On Arm, gnttab_map_frame() is implementing using an array index by the 
> grant table frame number. The trouble is we don't update the array 
> when the page is unmapped. So if the GFN is re-used before the 
> grant-table is remapped, then we will end up to remove whatever was 
> mapped there (this could be a foreign page...).
>
> This behavior already happens today as the toolstack will use the 
> first GFN of the region if Linux doesn't support the acquire resource 
> interface. We are getting away in the Linux because the toolstack only 
> map the first grant table frame and:
>  - Newer Linux will not used the region provided by the DT and nothing 
> will be mapped there.
>  - Older Linux will use the region but still map the grant table frame 
> 0 to the same GFN.
>
> I am not sure about U-boot and other OSes here.
>
> This is not new but it is going to be become a bigger source of 
> problem (read more chance to hit it) as we try to re-use the first 
> region.
>
> This means the first region should exclusively used for the 
> grant-table (in a specific order) until the issue is properly fixed.
>
> A potential fix is to update the array in p2m_put_l3_page(). The 
> default max size of the array is 1024, so it might be fine to just 
> walk it (it would be simply a comparison).
>
> Note that this is not a problem on x86 because the is using the M2P. 
> So when a mapping is removed, the mapping MFN -> GFN will also be 
> removed.
>
> Cheers,


The fix is already pushed:

https://lore.kernel.org/xen-devel/1628890077-12545-1-git-send-email-olekstysh@gmail.com/


Thanks!
Oleksandr Tyshchenko Aug. 13, 2021, 11:49 p.m. UTC | #45
On 05.08.21 20:25, Julien Grall wrote:
> Hi Oleksandr,


Hi Julien, all


>
> On 05/08/2021 15:52, Oleksandr wrote:
>>
>> On 05.08.21 01:00, Julien Grall wrote:
>>>
>>>
>>> On 04/08/2021 21:56, Oleksandr wrote:
>>>>
>>>> Hi Julien, Stefano.
>>>
>>> Hi Oleksandr,
>>
>>
>> Hi, Julien
>>
>>
>> Thank you for the prompt reply and explanations.
>>
>>
>>>
>>>>
>>>> On 02.08.21 22:12, Oleksandr wrote:
>>>> I have done some experiments with Xen and toolstack according to 
>>>> the discussion above. So, I re-used DTB to pass a safe range to the 
>>>> domain. For the range I borrowed some space from the second RAM bank.
>>>>
>>>> -#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 1016GB of 
>>>> RAM @ 8GB */
>>>> -#define GUEST_RAM1_SIZE   xen_mk_ullong(0xfe00000000)
>>>> +#define GUEST_RAM1_BASE   xen_mk_ullong(0x0200000000) /* 888GB of 
>>>> RAM @ 8GB */
>>>> +#define GUEST_RAM1_SIZE   xen_mk_ullong(0xDE00000000)
>>>> +
>>>
>>> I am a bit split with reducing the amount of RAM. On one hand large 
>>> guest is not unheard on the server side (at least in the x86 world). 
>>> On the other hand, I am not aware of anyone using Xen on Arm in such 
>>> setup.
>>>
>>> So technically this will be a regression, but it may be OK.
>>
>> I got it.
>>
>>
>>>
>>>
>>> Regarding the range, this will be a problem as Xen configure the 
>>> number of the IPA bits based on the PA bits. The lowest possible 
>>> address space ize on 64-bit is 4GB.
>>>
>>> From my understanding, this is because the number of IPA bits 
>>> supported is contrained by the PA bits. So the position and the size 
>>> of the region
>>> would need to depend on the P2M configuration.
>>
>> Indeed, I missed these bits that IPA bits on Arm64 might be < 40 bit, 
>> I remember, we select p2m_ipa_bits in setup_virt_paging() depending 
>> on pabits, moreover the p2m_ipa_bits might be even restricted by some 
>> external entity (IOMMU, if P2M is shared).
>>
>>
>>>
>>> For simplicity, this could be the last few X bytes of the supported 
>>> address space.
>> ok, agree. To summarize, so it sounds like we can't use the fixed 
>> safe range as in my example, it must be variable. Well, I hope, we 
>> will be able to achieve this without reducing the total amount of 
>> domain RAM in front (GUEST_RAM1_SIZE). After all, we know the IPA 
>> size and the domain RAM in advance, so we certainly can choose the 
>> start and size of the range. In case, we won't be able to find a 
>> suitable large chunk (for example, when IPA bits = 32, and domain has 
>> a lot of RAM assigned and as the result - almost all address space 
>> below 4GB is in use), we won't expose a safe range to a domain at 
>> all, and domain will just fall back to use real pages instead 
>> (actually, how it currently behaves on Arm).
>
> I think it would be fine for a first approach. We can refine it in the 
> future. What matters is that we correctly define the binding/hypercall.
>
>>
>> A side note: we would likely need the toolstack (that generates 
>> device-tree for guests) to query IPA size, or similar.
>
> I think we can use XEN_SYSCTL_* (or possibly hypfs) for that.

The RFC patch (which is one of the prereq patches for safe range) is 
already pushed:
https://lore.kernel.org/xen-devel/1628897304-20793-1-git-send-email-olekstysh@gmail.com/
What we need is to reach an agreement regarding the interface. We can 
discuss it there.


Thanks!


>
>>>
>>>
>>> For 32-bit domain, we also need to make sure the address is usable 
>>> for domain short page tables (not too long ago Debian was shipping 
>>> the kernel with them rather than LPAE). I haven't yet checked what's 
>>> the limit here.
>>
>> Hmm, I didn't take this use-case into the account. So, I assume we 
>> need the safe range to be located below 4GB if is_32bit_domain() 
>> returns true.
>
> Yes. Or we can say that if you are using a 32-bit domain then we don't 
> (yet) support a safe range for range.
>>> So we would need some heuristic to decide whether to stole some RAM 
>>> or use the safe space.
>>> Another possibility would be to add a new compatible in the DT that 
>>> indicates the region is "big" enough.
>> I like the last idea, did you perhaps mean new property (optional) 
>> rather than new compatible? Let's say "xen, safe-range" or "xen, 
>> extended-regions"  ...
>
> I actually meant adding an extra compatible because this is 
> technically a change of the binding (even though it is backward 
> compatible).
>
> Although, I would be OK with the property. You may first want to ask 
> the Device-Tree folks how they expect a binding to be extended in a 
> backward compatible way.
>
> Cheers,
>
Julien Grall Aug. 17, 2021, 5:53 p.m. UTC | #46
Hi Oleksandr,

On 10/08/2021 18:03, Oleksandr wrote:
> 
> On 10.08.21 19:28, Julien Grall wrote:
> 
> Hi Julien.
> 
>>
>>
>> On 09/08/2021 22:18, Oleksandr wrote:
>>>
>>> On 09.08.21 23:45, Julien Grall wrote:
>>>
>>>
>>> Hi Julien
>>
>>
>> Hi Oleksandr,
>>
>>>> On 09/08/2021 19:24, Oleksandr wrote:
>>>>>
>>>>> On 09.08.21 18:42, Julien Grall wrote:
>>>>> 1. If IOMMU is enabled for Dom0 -> provide holes found in Host DT 
>>>>> as safe ranges
>>>>>
>>>>> I would take into the account holes >= 1MB.
>>>>
>>>> May I ask why 1MB?
>>>
>>> Nothing special, just thinking to not bother with small regions which 
>>> would not be too useful overall, but could bloat resulting reg property.
>>>
>>> Anyway, I would be ok with any sizes.
>>
>> I was more interesting with the reasoning behind your choice rather 
>> than saying we want more.
>>
>> I agree that we may want to skip smaller region. I am guess that an 
>> high number of small regions will also increase the bookeeping in Linux.
>>
>> But I would probably suggest to align the safe region to a 2MB 
>> (superpage for 4KB page granularity) just because the backend (or even 
>> Xen) may be able to do some optimization there.
> 
> OK, agree regarding the alignment. But, what about the size? I assume, 
> it should be a multiple of 2MB.

I would say yes. In the documentation for the guest, I would just write 
"64KB" to give us some flexibility to reduce the alignment/size if we 
encounter platform where we can meet those limits (hopefully there is 
none...).

Cheers,
Julien Grall Aug. 17, 2021, 5:54 p.m. UTC | #47
On 17/08/2021 18:53, Julien Grall wrote:
> Hi Oleksandr,
> 
> On 10/08/2021 18:03, Oleksandr wrote:
>>
>> On 10.08.21 19:28, Julien Grall wrote:
>>
>> Hi Julien.
>>
>>>
>>>
>>> On 09/08/2021 22:18, Oleksandr wrote:
>>>>
>>>> On 09.08.21 23:45, Julien Grall wrote:
>>>>
>>>>
>>>> Hi Julien
>>>
>>>
>>> Hi Oleksandr,
>>>
>>>>> On 09/08/2021 19:24, Oleksandr wrote:
>>>>>>
>>>>>> On 09.08.21 18:42, Julien Grall wrote:
>>>>>> 1. If IOMMU is enabled for Dom0 -> provide holes found in Host DT 
>>>>>> as safe ranges
>>>>>>
>>>>>> I would take into the account holes >= 1MB.
>>>>>
>>>>> May I ask why 1MB?
>>>>
>>>> Nothing special, just thinking to not bother with small regions 
>>>> which would not be too useful overall, but could bloat resulting reg 
>>>> property.
>>>>
>>>> Anyway, I would be ok with any sizes.
>>>
>>> I was more interesting with the reasoning behind your choice rather 
>>> than saying we want more.
>>>
>>> I agree that we may want to skip smaller region. I am guess that an 
>>> high number of small regions will also increase the bookeeping in Linux.
>>>
>>> But I would probably suggest to align the safe region to a 2MB 
>>> (superpage for 4KB page granularity) just because the backend (or 
>>> even Xen) may be able to do some optimization there.
>>
>> OK, agree regarding the alignment. But, what about the size? I assume, 
>> it should be a multiple of 2MB.
> 
> I would say yes. In the documentation for the guest, I would just write 
> "64KB" to give us some flexibility to reduce the alignment/size if we 
> encounter platform where we can meet those limits (hopefully there is 
> none...).

Whoops, I meant cannot rather than can.

Cheers,
Julien Grall Aug. 17, 2021, 5:59 p.m. UTC | #48
Hi Jan,

On 03/08/2021 13:49, Jan Beulich wrote:
>>>>> Once a safe range (or ranges) has been chosen, any subsequent action
>>>>> which overlaps with the ranges must be rejected, as it will violate the
>>>>> guarantees provided.
>>>>>
>>>>> Furthermore, the ranges should be made available to the guest via normal
>>>>> memory map means.  On x86, this is via the E820 table, and on ARM I
>>>>> presume the DTB.  There is no need for a new hypercall.
>>>>
>>>> Device-Tree only works if you have a guest using it. How about ACPI?
>>>
>>> ACPI inherits E820 from x86 (its a trivial format), and UEFI was also
>>> based on it.
>>>
>>> But whichever...  All firmware interfaces have a memory map.
>>
>> This will be UEFI memory map. However, I am a bit confused how we can
>> tell the OS the region will be used for grant/foreign mapping. Is it
>> possible to reserved a new type?
> 
> As with about any non-abandoned specification it is in principle
> possible to define/reserve new types. Question how practical it is,
> i.e. in particular how long it may take to get to the point where
> we have a firmly reserved type. Short of this I wonder whether you,
> Andrew, were thinking to re-use an existing type (in which case the
> question of disambiguation arises).

Copying/pasting the IRC discussion related to this:

[11:32:19] <Diziet> julieng: I have skimread the thread "[RFC PATCH] 
xen/memory: Introduce a hypercall to provide unallocated space"
[11:32:56] <Diziet> My impression is that it is converging on a workable 
solution but I am not sure.  Does it need any help ?
[12:20:32] <julieng> Diziet: I think we have a solution for Arm and DT. 
We are waiting on andyhhp for the ACPI part. He suggested to use the 
memory map but it is unclear how we could describe the safe region.
[13:01:49] <andyhhp> that's easy, seeing as we already have hypercall to 
convay that information, but feel free to skip the x86 side for v1 if 
that helps
[13:02:15] <andyhhp> it wants an extention to the PVH spec to define a 
new memory type
[13:04:09] <julieng> andyhhp: This doesn't really address the question 
to how we can define the memory type because this is not a spec we own. 
See 5176e91c-1971-9004-af65-7a4aefc7eb78@suse.com for more details..
[13:04:27] <andyhhp> after which it wants to appear in XENMEM_memory_map
[13:04:32] <andyhhp> this is a spec we own
[13:04:43] <julieng> We don't own the E820.
[13:05:19] <julieng> Nor the UEFI memory map.
[13:05:24] <andyhhp> no, we don't, but that's not the spec in question
[13:06:06] <andyhhp> the spec in question is the PVH start info and/or 
XENMEM_memory_map, both of which are "in the format of the E820" table, 
not "is an E820 table"
[13:06:27] <andyhhp> with almost 4 billion type identifiers avaialble
[13:07:03] <julieng> So what you are saying is let's pick up a random 
number and hope no-one will use it?
[13:07:34] <julieng> Because neither XENMEM_memory_map nor PVH start 
info exist for ACPI on Arm.
[13:08:17] <andyhhp> we (xen) are the source of this information, via a 
Xen specified API/ABI
[13:08:41] <andyhhp> we are entirely within our rights to document an 
extention, which has defined meaning under Xen
[13:09:03] <andyhhp> and yeah - you choose some Xen specific constant to 
go in the high bits of the type id or something
[13:09:04] <julieng> I agree for a domU. But for dom0, the memory map is 
the same as the host. So we have to make sure the number doesn't clash.
[13:09:49] <andyhhp> the chance of actually having a collision is 0, 
because in 30 years or so, about 10 types have been declared and handed out
[13:10:15] <andyhhp> and if a collision does occur, we add a second 
hypercall saying "this is a raw non-xenified E820 table, and here is the 
xenified one"

> 
> As a result I wonder whether a "middle" approach wouldn't be better:
> Have the range be determined up front (by tool stack or Xen), but
> communicate it to the guest by PV means (hypercall, shared info,
> start info, or yet some other table).
> 
> Jan
>
Oleksandr Tyshchenko Aug. 27, 2021, 8:34 p.m. UTC | #49
On 17.08.21 20:54, Julien Grall wrote:

Hi Julien


> On 17/08/2021 18:53, Julien Grall wrote:
>> Hi Oleksandr,
>>
>> On 10/08/2021 18:03, Oleksandr wrote:
>>>
>>> On 10.08.21 19:28, Julien Grall wrote:
>>>
>>> Hi Julien.
>>>
>>>>
>>>>
>>>> On 09/08/2021 22:18, Oleksandr wrote:
>>>>>
>>>>> On 09.08.21 23:45, Julien Grall wrote:
>>>>>
>>>>>
>>>>> Hi Julien
>>>>
>>>>
>>>> Hi Oleksandr,
>>>>
>>>>>> On 09/08/2021 19:24, Oleksandr wrote:
>>>>>>>
>>>>>>> On 09.08.21 18:42, Julien Grall wrote:
>>>>>>> 1. If IOMMU is enabled for Dom0 -> provide holes found in Host 
>>>>>>> DT as safe ranges
>>>>>>>
>>>>>>> I would take into the account holes >= 1MB.
>>>>>>
>>>>>> May I ask why 1MB?
>>>>>
>>>>> Nothing special, just thinking to not bother with small regions 
>>>>> which would not be too useful overall, but could bloat resulting 
>>>>> reg property.
>>>>>
>>>>> Anyway, I would be ok with any sizes.
>>>>
>>>> I was more interesting with the reasoning behind your choice rather 
>>>> than saying we want more.
>>>>
>>>> I agree that we may want to skip smaller region. I am guess that an 
>>>> high number of small regions will also increase the bookeeping in 
>>>> Linux.
>>>>
>>>> But I would probably suggest to align the safe region to a 2MB 
>>>> (superpage for 4KB page granularity) just because the backend (or 
>>>> even Xen) may be able to do some optimization there.
>>>
>>> OK, agree regarding the alignment. But, what about the size? I 
>>> assume, it should be a multiple of 2MB.
>>
>> I would say yes. In the documentation for the guest, I would just 
>> write "64KB" to give us some flexibility to reduce the alignment/size 
>> if we encounter platform where we can meet those limits (hopefully 
>> there is none...).
>
> Whoops, I meant cannot rather than can.

ok, makes sense, thank you for the clarification.



>
> Cheers,
>
Henry Wang Sept. 7, 2021, 8:53 a.m. UTC | #50
Hi Oleksandr,

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> Oleksandr Tyshchenko
> Sent: Thursday, July 29, 2021 12:19 AM
> To: xen-devel@lists.xenproject.org
> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>; Daniel De
> Graaf <dgdegra@tycho.nsa.gov>; Daniel P. Smith
> <dpsmith@apertussolutions.com>; Ian Jackson <iwj@xenproject.org>; Wei
> Liu <wl@xen.org>; Andrew Cooper <andrew.cooper3@citrix.com>; George
> Dunlap <george.dunlap@citrix.com>; Jan Beulich <jbeulich@suse.com>;
> Julien Grall <julien@xen.org>; Stefano Stabellini <sstabellini@kernel.org>;
> Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>; Roger Pau Monné
> <roger.pau@citrix.com>; Bertrand Marquis <Bertrand.Marquis@arm.com>;
> Wei Chen <Wei.Chen@arm.com>
> Subject: [RFC PATCH] xen/memory: Introduce a hypercall to provide
> unallocated space
> 
> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> 
> Add XENMEM_get_unallocated_space hypercall which purpose is to
> query hypervisor to find regions of guest physical address space
> which are unused and can be used to create grant/foreign mappings
> instead of wasting real pages from the domain memory for
> establishing these mappings. The problem with the current Linux
> on Xen on Arm behaviour is if we want to map some guest memory
> regions in advance or to perform cache mappings in the backend
> we might run out of memory in the host (see XSA-300).
> This of course, depends on the both host and guest memory sizes.
> 
> The "unallocated space" can't be figured out precisely by
> the domain on Arm without hypervisor involvement:
> - not all device I/O regions are known by the time domain starts
>   creating grant/foreign mappings
> - the Dom0 is not aware of memory regions used for the identity
>   mappings needed for the PV drivers to work
> In both cases we might end up re-using these regions by
> a mistake. So, the hypervisor which maintains the P2M for the domain
> is in the best position to provide "unallocated space".
> 
> The arch code is in charge of finding these regions and filling
> in corresponding array in new helper arch_get_unallocated_space().
> 
> This patch implements common and Arm parts, the x86 specific bits
> are left uniplemented for now.
> 
> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>

Tested-by: Henry Wang <Henry.Wang@arm.com>

Kind regards,

Henry

> ---
> Current patch is based on the latest staging branch:
> 73c932d tools/libxc: use uint32_t for pirq in xc_domain_irq_permission
> and also available at:
> https://github.com/otyshchenko1/xen/commits/map_opt_ml1
> 
> The corresponding Linux changes you can find at:
> https://github.com/otyshchenko1/linux/commits/map_opt_ml1
> I am going to publish them soon.
> ---
>  tools/flask/policy/modules/dom0.te  |  2 +-
>  xen/arch/arm/mm.c                   | 18 ++++++++++++
>  xen/common/memory.c                 | 56
> +++++++++++++++++++++++++++++++++++++
>  xen/include/asm-arm/mm.h            |  4 +++
>  xen/include/asm-x86/mm.h            |  8 ++++++
>  xen/include/public/memory.h         | 37 +++++++++++++++++++++++-
>  xen/include/xsm/dummy.h             |  6 ++++
>  xen/include/xsm/xsm.h               |  6 ++++
>  xen/xsm/dummy.c                     |  1 +
>  xen/xsm/flask/hooks.c               |  6 ++++
>  xen/xsm/flask/policy/access_vectors |  2 ++
>  11 files changed, 144 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/flask/policy/modules/dom0.te
> b/tools/flask/policy/modules/dom0.te
> index 0a63ce1..b40091f 100644
> --- a/tools/flask/policy/modules/dom0.te
> +++ b/tools/flask/policy/modules/dom0.te
> @@ -39,7 +39,7 @@ allow dom0_t dom0_t:domain {
>  };
>  allow dom0_t dom0_t:domain2 {
>  	set_cpu_policy gettsc settsc setscheduler set_vnumainfo
> -	get_vnumainfo psr_cmt_op psr_alloc get_cpu_policy
> +	get_vnumainfo psr_cmt_op psr_alloc get_cpu_policy
> get_unallocated_space
>  };
>  allow dom0_t dom0_t:resource { add remove };
> 
> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
> index 0e07335..8a70fe0 100644
> --- a/xen/arch/arm/mm.c
> +++ b/xen/arch/arm/mm.c
> @@ -1635,6 +1635,24 @@ unsigned long get_upper_mfn_bound(void)
>      return max_page - 1;
>  }
> 
> +#define GPFN_ALIGN_TO_GB(x) (((x) + (1UL << 18) - 1) & (~((1UL << 18) - 1)))
> +
> +int arch_get_unallocated_space(struct domain *d,
> +                               struct xen_unallocated_region *regions,
> +                               unsigned int *nr_regions)
> +{
> +    /*
> +     * Everything not mapped into P2M could be treated as unused. Taking
> into
> +     * the account that we have a lot of unallocated space above
> max_mapped_gfn
> +     * and in order to simplify things, just provide a single 8GB memory
> +     * region aligned to 1GB boundary for now.
> +     */
> +    regions->start_gpfn =
> GPFN_ALIGN_TO_GB(domain_get_maximum_gpfn(d) + 1);
> +    regions->nr_gpfns = (1UL << 18) * 8;
> +    *nr_regions = 1;
> +
> +    return 0;
> +}
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index e07bd9a..3f9b816 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1811,6 +1811,62 @@ long do_memory_op(unsigned long cmd,
> XEN_GUEST_HANDLE_PARAM(void) arg)
>              start_extent);
>          break;
> 
> +    case XENMEM_get_unallocated_space:
> +    {
> +        struct xen_get_unallocated_space xgus;
> +        struct xen_unallocated_region *regions;
> +
> +        if ( unlikely(start_extent) )
> +            return -EINVAL;
> +
> +        if ( copy_from_guest(&xgus, arg, 1) ||
> +             !guest_handle_okay(xgus.buffer, xgus.nr_regions) )
> +            return -EFAULT;
> +
> +        d = rcu_lock_domain_by_any_id(xgus.domid);
> +        if ( d == NULL )
> +            return -ESRCH;
> +
> +        rc = xsm_get_unallocated_space(XSM_HOOK, d);
> +        if ( rc )
> +        {
> +            rcu_unlock_domain(d);
> +            return rc;
> +        }
> +
> +        if ( !xgus.nr_regions || xgus.nr_regions >
> XEN_MAX_UNALLOCATED_REGIONS )
> +        {
> +            rcu_unlock_domain(d);
> +            return -EINVAL;
> +        }
> +
> +        regions = xzalloc_array(struct xen_unallocated_region,
> xgus.nr_regions);
> +        if ( !regions )
> +        {
> +            rcu_unlock_domain(d);
> +            return -ENOMEM;
> +        }
> +
> +        rc = arch_get_unallocated_space(d, regions, &xgus.nr_regions);
> +        if ( rc )
> +            goto unallocated_out;
> +
> +        if ( __copy_to_guest(xgus.buffer, regions, xgus.nr_regions) )
> +        {
> +            rc = -EFAULT;
> +            goto unallocated_out;
> +        }
> +
> +        if ( __copy_to_guest(arg, &xgus, 1) )
> +            rc = -EFAULT;
> +
> +unallocated_out:
> +        rcu_unlock_domain(d);
> +        xfree(regions);
> +
> +        break;
> +    }
> +
>      default:
>          rc = arch_memory_op(cmd, arg);
>          break;
> diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
> index ded74d2..99a2824 100644
> --- a/xen/include/asm-arm/mm.h
> +++ b/xen/include/asm-arm/mm.h
> @@ -359,6 +359,10 @@ void clear_and_clean_page(struct page_info *page);
> 
>  unsigned int arch_get_dma_bitsize(void);
> 
> +int arch_get_unallocated_space(struct domain *d,
> +                               struct xen_unallocated_region *regions,
> +                               unsigned int *nr_regions);
> +
>  #endif /*  __ARCH_ARM_MM__ */
>  /*
>   * Local variables:
> diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
> index cb90527..6244bc4 100644
> --- a/xen/include/asm-x86/mm.h
> +++ b/xen/include/asm-x86/mm.h
> @@ -652,4 +652,12 @@ static inline bool arch_mfn_in_directmap(unsigned
> long mfn)
>      return mfn <= (virt_to_mfn(eva - 1) + 1);
>  }
> 
> +static inline
> +int arch_get_unallocated_space(struct domain *d,
> +                               struct xen_unallocated_region *regions,
> +                               unsigned int *nr_regions)
> +{
> +    return -EOPNOTSUPP;
> +}
> +
>  #endif /* __ASM_X86_MM_H__ */
> diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
> index 383a946..040201b 100644
> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -739,7 +739,42 @@ struct xen_vnuma_topology_info {
>  typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);
> 
> -/* Next available subop number is 29 */
> +/*
> + * Get the unallocated space (regions of guest physical address space which
> + * are unused) and can be used to create grant/foreign mappings.
> + */
> +#define XENMEM_get_unallocated_space 29
> +struct xen_unallocated_region {
> +    xen_pfn_t start_gpfn;
> +    xen_ulong_t nr_gpfns;
> +};
> +typedef struct xen_unallocated_region xen_unallocated_region_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_unallocated_region_t);
> +
> +#define XEN_MAX_UNALLOCATED_REGIONS 32
> +
> +struct xen_get_unallocated_space {
> +    /* IN - Which domain to provide unallocated space for */
> +    domid_t domid;
> +
> +    /*
> +     * IN/OUT - As an IN parameter number of memory regions which
> +     *          can be written to the buffer (maximum size of the array)
> +     *          As OUT parameter number of memory regions which
> +     *          have been written to the buffer
> +     */
> +    unsigned int nr_regions;
> +
> +    /*
> +     * OUT - An array of memory regions, the regions must be placed in
> +     *       ascending order, there must be no overlap between them.
> +     */
> +    XEN_GUEST_HANDLE(xen_unallocated_region_t) buffer;
> +};
> +typedef struct xen_get_unallocated_space xen_get_unallocated_space_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_get_unallocated_space_t);
> +
> +/* Next available subop number is 30 */
> 
>  #endif /* __XEN_PUBLIC_MEMORY_H__ */
> 
> diff --git a/xen/include/xsm/dummy.h b/xen/include/xsm/dummy.h
> index 363c6d7..2761fbd 100644
> --- a/xen/include/xsm/dummy.h
> +++ b/xen/include/xsm/dummy.h
> @@ -772,3 +772,9 @@ static XSM_INLINE int
> xsm_domain_resource_map(XSM_DEFAULT_ARG struct domain *d)
>      XSM_ASSERT_ACTION(XSM_DM_PRIV);
>      return xsm_default_action(action, current->domain, d);
>  }
> +
> +static XSM_INLINE int xsm_get_unallocated_space(XSM_DEFAULT_ARG
> struct domain *d)
> +{
> +    XSM_ASSERT_ACTION(XSM_HOOK);
> +    return xsm_default_action(action, current->domain, d);
> +}
> diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h
> index ad3cddb..b01619c 100644
> --- a/xen/include/xsm/xsm.h
> +++ b/xen/include/xsm/xsm.h
> @@ -180,6 +180,7 @@ struct xsm_operations {
>      int (*dm_op) (struct domain *d);
>      int (*xen_version) (uint32_t cmd);
>      int (*domain_resource_map) (struct domain *d);
> +    int (*get_unallocated_space) (struct domain *d);
>  #ifdef CONFIG_ARGO
>      int (*argo_enable) (const struct domain *d);
>      int (*argo_register_single_source) (const struct domain *d,
> @@ -701,6 +702,11 @@ static inline int
> xsm_domain_resource_map(xsm_default_t def, struct domain *d)
>      return xsm_ops->domain_resource_map(d);
>  }
> 
> +static inline int xsm_get_unallocated_space(xsm_default_t def, struct
> domain *d)
> +{
> +    return xsm_ops->get_unallocated_space(d);
> +}
> +
>  #ifdef CONFIG_ARGO
>  static inline int xsm_argo_enable(const struct domain *d)
>  {
> diff --git a/xen/xsm/dummy.c b/xen/xsm/dummy.c
> index de44b10..45efadb 100644
> --- a/xen/xsm/dummy.c
> +++ b/xen/xsm/dummy.c
> @@ -151,6 +151,7 @@ void __init xsm_fixup_ops (struct xsm_operations
> *ops)
>      set_to_dummy_if_null(ops, dm_op);
>      set_to_dummy_if_null(ops, xen_version);
>      set_to_dummy_if_null(ops, domain_resource_map);
> +    set_to_dummy_if_null(ops, get_unallocated_space);
>  #ifdef CONFIG_ARGO
>      set_to_dummy_if_null(ops, argo_enable);
>      set_to_dummy_if_null(ops, argo_register_single_source);
> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
> index f1a1217..38a9b20 100644
> --- a/xen/xsm/flask/hooks.c
> +++ b/xen/xsm/flask/hooks.c
> @@ -1715,6 +1715,11 @@ static int flask_domain_resource_map(struct
> domain *d)
>      return current_has_perm(d, SECCLASS_DOMAIN2,
> DOMAIN2__RESOURCE_MAP);
>  }
> 
> +static int flask_get_unallocated_space(struct domain *d)
> +{
> +    return current_has_perm(d, SECCLASS_DOMAIN2,
> DOMAIN2__GET_UNALLOCATED_SPACE);
> +}
> +
>  #ifdef CONFIG_ARGO
>  static int flask_argo_enable(const struct domain *d)
>  {
> @@ -1875,6 +1880,7 @@ static struct xsm_operations flask_ops = {
>      .dm_op = flask_dm_op,
>      .xen_version = flask_xen_version,
>      .domain_resource_map = flask_domain_resource_map,
> +    .get_unallocated_space = flask_get_unallocated_space,
>  #ifdef CONFIG_ARGO
>      .argo_enable = flask_argo_enable,
>      .argo_register_single_source = flask_argo_register_single_source,
> diff --git a/xen/xsm/flask/policy/access_vectors
> b/xen/xsm/flask/policy/access_vectors
> index 6359c7f..3cbdc19 100644
> --- a/xen/xsm/flask/policy/access_vectors
> +++ b/xen/xsm/flask/policy/access_vectors
> @@ -245,6 +245,8 @@ class domain2
>      resource_map
>  # XEN_DOMCTL_get_cpu_policy
>      get_cpu_policy
> +# XENMEM_get_unallocated_space
> +    get_unallocated_space
>  }
> 
>  # Similar to class domain, but primarily contains domctls related to HVM
> domains
> --
> 2.7.4
>
Oleksandr Tyshchenko Sept. 7, 2021, 9:34 p.m. UTC | #51
On 07.09.21 11:53, Henry Wang wrote:
> Hi Oleksandr,

Hi Henry


>
>> -----Original Message-----
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
>> Oleksandr Tyshchenko
>> Sent: Thursday, July 29, 2021 12:19 AM
>> To: xen-devel@lists.xenproject.org
>> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>; Daniel De
>> Graaf <dgdegra@tycho.nsa.gov>; Daniel P. Smith
>> <dpsmith@apertussolutions.com>; Ian Jackson <iwj@xenproject.org>; Wei
>> Liu <wl@xen.org>; Andrew Cooper <andrew.cooper3@citrix.com>; George
>> Dunlap <george.dunlap@citrix.com>; Jan Beulich <jbeulich@suse.com>;
>> Julien Grall <julien@xen.org>; Stefano Stabellini <sstabellini@kernel.org>;
>> Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>; Roger Pau Monné
>> <roger.pau@citrix.com>; Bertrand Marquis <Bertrand.Marquis@arm.com>;
>> Wei Chen <Wei.Chen@arm.com>
>> Subject: [RFC PATCH] xen/memory: Introduce a hypercall to provide
>> unallocated space
>>
>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>
>> Add XENMEM_get_unallocated_space hypercall which purpose is to
>> query hypervisor to find regions of guest physical address space
>> which are unused and can be used to create grant/foreign mappings
>> instead of wasting real pages from the domain memory for
>> establishing these mappings. The problem with the current Linux
>> on Xen on Arm behaviour is if we want to map some guest memory
>> regions in advance or to perform cache mappings in the backend
>> we might run out of memory in the host (see XSA-300).
>> This of course, depends on the both host and guest memory sizes.
>>
>> The "unallocated space" can't be figured out precisely by
>> the domain on Arm without hypervisor involvement:
>> - not all device I/O regions are known by the time domain starts
>>    creating grant/foreign mappings
>> - the Dom0 is not aware of memory regions used for the identity
>>    mappings needed for the PV drivers to work
>> In both cases we might end up re-using these regions by
>> a mistake. So, the hypervisor which maintains the P2M for the domain
>> is in the best position to provide "unallocated space".
>>
>> The arch code is in charge of finding these regions and filling
>> in corresponding array in new helper arch_get_unallocated_space().
>>
>> This patch implements common and Arm parts, the x86 specific bits
>> are left uniplemented for now.
>>
>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> Tested-by: Henry Wang <Henry.Wang@arm.com>

Thank you!


Please note, I have pushed new version (which is notably different) 
based on the discussion in current thread:

https://lore.kernel.org/xen-devel/1631034578-12598-1-git-send-email-olekstysh@gmail.com/



>
> Kind regards,
>
> Henry
>
diff mbox series

Patch

diff --git a/tools/flask/policy/modules/dom0.te b/tools/flask/policy/modules/dom0.te
index 0a63ce1..b40091f 100644
--- a/tools/flask/policy/modules/dom0.te
+++ b/tools/flask/policy/modules/dom0.te
@@ -39,7 +39,7 @@  allow dom0_t dom0_t:domain {
 };
 allow dom0_t dom0_t:domain2 {
 	set_cpu_policy gettsc settsc setscheduler set_vnumainfo
-	get_vnumainfo psr_cmt_op psr_alloc get_cpu_policy
+	get_vnumainfo psr_cmt_op psr_alloc get_cpu_policy get_unallocated_space
 };
 allow dom0_t dom0_t:resource { add remove };
 
diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
index 0e07335..8a70fe0 100644
--- a/xen/arch/arm/mm.c
+++ b/xen/arch/arm/mm.c
@@ -1635,6 +1635,24 @@  unsigned long get_upper_mfn_bound(void)
     return max_page - 1;
 }
 
+#define GPFN_ALIGN_TO_GB(x) (((x) + (1UL << 18) - 1) & (~((1UL << 18) - 1)))
+
+int arch_get_unallocated_space(struct domain *d,
+                               struct xen_unallocated_region *regions,
+                               unsigned int *nr_regions)
+{
+    /*
+     * Everything not mapped into P2M could be treated as unused. Taking into
+     * the account that we have a lot of unallocated space above max_mapped_gfn
+     * and in order to simplify things, just provide a single 8GB memory
+     * region aligned to 1GB boundary for now.
+     */
+    regions->start_gpfn = GPFN_ALIGN_TO_GB(domain_get_maximum_gpfn(d) + 1);
+    regions->nr_gpfns = (1UL << 18) * 8;
+    *nr_regions = 1;
+
+    return 0;
+}
 /*
  * Local variables:
  * mode: C
diff --git a/xen/common/memory.c b/xen/common/memory.c
index e07bd9a..3f9b816 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1811,6 +1811,62 @@  long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             start_extent);
         break;
 
+    case XENMEM_get_unallocated_space:
+    {
+        struct xen_get_unallocated_space xgus;
+        struct xen_unallocated_region *regions;
+
+        if ( unlikely(start_extent) )
+            return -EINVAL;
+
+        if ( copy_from_guest(&xgus, arg, 1) ||
+             !guest_handle_okay(xgus.buffer, xgus.nr_regions) )
+            return -EFAULT;
+
+        d = rcu_lock_domain_by_any_id(xgus.domid);
+        if ( d == NULL )
+            return -ESRCH;
+
+        rc = xsm_get_unallocated_space(XSM_HOOK, d);
+        if ( rc )
+        {
+            rcu_unlock_domain(d);
+            return rc;
+        }
+
+        if ( !xgus.nr_regions || xgus.nr_regions > XEN_MAX_UNALLOCATED_REGIONS )
+        {
+            rcu_unlock_domain(d);
+            return -EINVAL;
+        }
+
+        regions = xzalloc_array(struct xen_unallocated_region, xgus.nr_regions);
+        if ( !regions )
+        {
+            rcu_unlock_domain(d);
+            return -ENOMEM;
+        }
+
+        rc = arch_get_unallocated_space(d, regions, &xgus.nr_regions);
+        if ( rc )
+            goto unallocated_out;
+
+        if ( __copy_to_guest(xgus.buffer, regions, xgus.nr_regions) )
+        {
+            rc = -EFAULT;
+            goto unallocated_out;
+        }
+
+        if ( __copy_to_guest(arg, &xgus, 1) )
+            rc = -EFAULT;
+
+unallocated_out:
+        rcu_unlock_domain(d);
+        xfree(regions);
+
+        break;
+    }
+
     default:
         rc = arch_memory_op(cmd, arg);
         break;
diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
index ded74d2..99a2824 100644
--- a/xen/include/asm-arm/mm.h
+++ b/xen/include/asm-arm/mm.h
@@ -359,6 +359,10 @@  void clear_and_clean_page(struct page_info *page);
 
 unsigned int arch_get_dma_bitsize(void);
 
+int arch_get_unallocated_space(struct domain *d,
+                               struct xen_unallocated_region *regions,
+                               unsigned int *nr_regions);
+
 #endif /*  __ARCH_ARM_MM__ */
 /*
  * Local variables:
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index cb90527..6244bc4 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -652,4 +652,12 @@  static inline bool arch_mfn_in_directmap(unsigned long mfn)
     return mfn <= (virt_to_mfn(eva - 1) + 1);
 }
 
+static inline
+int arch_get_unallocated_space(struct domain *d,
+                               struct xen_unallocated_region *regions,
+                               unsigned int *nr_regions)
+{
+    return -EOPNOTSUPP;
+}
+
 #endif /* __ASM_X86_MM_H__ */
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 383a946..040201b 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -739,7 +739,42 @@  struct xen_vnuma_topology_info {
 typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t;
 DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);
 
-/* Next available subop number is 29 */
+/*
+ * Get the unallocated space (regions of guest physical address space which
+ * are unused) and can be used to create grant/foreign mappings.
+ */
+#define XENMEM_get_unallocated_space 29
+struct xen_unallocated_region {
+    xen_pfn_t start_gpfn;
+    xen_ulong_t nr_gpfns;
+};
+typedef struct xen_unallocated_region xen_unallocated_region_t;
+DEFINE_XEN_GUEST_HANDLE(xen_unallocated_region_t);
+
+#define XEN_MAX_UNALLOCATED_REGIONS 32
+
+struct xen_get_unallocated_space {
+    /* IN - Which domain to provide unallocated space for */
+    domid_t domid;
+
+    /*
+     * IN/OUT - As an IN parameter number of memory regions which
+     *          can be written to the buffer (maximum size of the array)
+     *          As OUT parameter number of memory regions which
+     *          have been written to the buffer
+     */
+    unsigned int nr_regions;
+
+    /*
+     * OUT - An array of memory regions, the regions must be placed in
+     *       ascending order, there must be no overlap between them.
+     */
+    XEN_GUEST_HANDLE(xen_unallocated_region_t) buffer;
+};
+typedef struct xen_get_unallocated_space xen_get_unallocated_space_t;
+DEFINE_XEN_GUEST_HANDLE(xen_get_unallocated_space_t);
+
+/* Next available subop number is 30 */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
 
diff --git a/xen/include/xsm/dummy.h b/xen/include/xsm/dummy.h
index 363c6d7..2761fbd 100644
--- a/xen/include/xsm/dummy.h
+++ b/xen/include/xsm/dummy.h
@@ -772,3 +772,9 @@  static XSM_INLINE int xsm_domain_resource_map(XSM_DEFAULT_ARG struct domain *d)
     XSM_ASSERT_ACTION(XSM_DM_PRIV);
     return xsm_default_action(action, current->domain, d);
 }
+
+static XSM_INLINE int xsm_get_unallocated_space(XSM_DEFAULT_ARG struct domain *d)
+{
+    XSM_ASSERT_ACTION(XSM_HOOK);
+    return xsm_default_action(action, current->domain, d);
+}
diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h
index ad3cddb..b01619c 100644
--- a/xen/include/xsm/xsm.h
+++ b/xen/include/xsm/xsm.h
@@ -180,6 +180,7 @@  struct xsm_operations {
     int (*dm_op) (struct domain *d);
     int (*xen_version) (uint32_t cmd);
     int (*domain_resource_map) (struct domain *d);
+    int (*get_unallocated_space) (struct domain *d);
 #ifdef CONFIG_ARGO
     int (*argo_enable) (const struct domain *d);
     int (*argo_register_single_source) (const struct domain *d,
@@ -701,6 +702,11 @@  static inline int xsm_domain_resource_map(xsm_default_t def, struct domain *d)
     return xsm_ops->domain_resource_map(d);
 }
 
+static inline int xsm_get_unallocated_space(xsm_default_t def, struct domain *d)
+{
+    return xsm_ops->get_unallocated_space(d);
+}
+
 #ifdef CONFIG_ARGO
 static inline int xsm_argo_enable(const struct domain *d)
 {
diff --git a/xen/xsm/dummy.c b/xen/xsm/dummy.c
index de44b10..45efadb 100644
--- a/xen/xsm/dummy.c
+++ b/xen/xsm/dummy.c
@@ -151,6 +151,7 @@  void __init xsm_fixup_ops (struct xsm_operations *ops)
     set_to_dummy_if_null(ops, dm_op);
     set_to_dummy_if_null(ops, xen_version);
     set_to_dummy_if_null(ops, domain_resource_map);
+    set_to_dummy_if_null(ops, get_unallocated_space);
 #ifdef CONFIG_ARGO
     set_to_dummy_if_null(ops, argo_enable);
     set_to_dummy_if_null(ops, argo_register_single_source);
diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
index f1a1217..38a9b20 100644
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -1715,6 +1715,11 @@  static int flask_domain_resource_map(struct domain *d)
     return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__RESOURCE_MAP);
 }
 
+static int flask_get_unallocated_space(struct domain *d)
+{
+    return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__GET_UNALLOCATED_SPACE);
+}
+
 #ifdef CONFIG_ARGO
 static int flask_argo_enable(const struct domain *d)
 {
@@ -1875,6 +1880,7 @@  static struct xsm_operations flask_ops = {
     .dm_op = flask_dm_op,
     .xen_version = flask_xen_version,
     .domain_resource_map = flask_domain_resource_map,
+    .get_unallocated_space = flask_get_unallocated_space,
 #ifdef CONFIG_ARGO
     .argo_enable = flask_argo_enable,
     .argo_register_single_source = flask_argo_register_single_source,
diff --git a/xen/xsm/flask/policy/access_vectors b/xen/xsm/flask/policy/access_vectors
index 6359c7f..3cbdc19 100644
--- a/xen/xsm/flask/policy/access_vectors
+++ b/xen/xsm/flask/policy/access_vectors
@@ -245,6 +245,8 @@  class domain2
     resource_map
 # XEN_DOMCTL_get_cpu_policy
     get_cpu_policy
+# XENMEM_get_unallocated_space
+    get_unallocated_space
 }
 
 # Similar to class domain, but primarily contains domctls related to HVM domains