diff mbox

[v3,3/3] tools: introduce parameter max_wp_ram_ranges.

Message ID	1454064314-7799-4-git-send-email-yu.c.zhang@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> From: Yu Zhang <yu.c.zhang@linux.intel.com> To: xen-devel@lists.xen.org Date: Fri, 29 Jan 2016 18:45:14 +0800 Message-Id: <1454064314-7799-4-git-send-email-yu.c.zhang@linux.intel.com> In-Reply-To: <1454064314-7799-1-git-send-email-yu.c.zhang@linux.intel.com> References: <1454064314-7799-1-git-send-email-yu.c.zhang@linux.intel.com> Cc: kevin.tian@intel.com, keir@xen.org, ian.campbell@citrix.com, stefano.stabellini@eu.citrix.com, andrew.cooper3@citrix.com, ian.jackson@eu.citrix.com, Paul.Durrant@citrix.com, zhiyuan.lv@intel.com, jbeulich@suse.com, wei.liu2@citrix.com Subject: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges. Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org

Message ID

1454064314-7799-4-git-send-email-yu.c.zhang@linux.intel.com (mailing list archive)

State

New, archived

Headers

From: Yu Zhang <yu.c.zhang@linux.intel.com>
To: xen-devel@lists.xen.org
Date: Fri, 29 Jan 2016 18:45:14 +0800
Message-Id: <1454064314-7799-4-git-send-email-yu.c.zhang@linux.intel.com>
In-Reply-To: <1454064314-7799-1-git-send-email-yu.c.zhang@linux.intel.com>
References: <1454064314-7799-1-git-send-email-yu.c.zhang@linux.intel.com>
Cc: kevin.tian@intel.com, keir@xen.org, ian.campbell@citrix.com,
	stefano.stabellini@eu.citrix.com, andrew.cooper3@citrix.com,
	ian.jackson@eu.citrix.com, Paul.Durrant@citrix.com,
	zhiyuan.lv@intel.com, jbeulich@suse.com, wei.liu2@citrix.com
Subject: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
	max_wp_ram_ranges.
Precedence: list
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org

Commit Message

Yu Zhang Jan. 29, 2016, 10:45 a.m. UTC

A new parameter - max_wp_ram_ranges is added to set the upper limit
of write-protected ram ranges to be tracked inside one ioreq server
rangeset.

Ioreq server uses a group of rangesets to track the I/O or memory
resources to be emulated. Default limit of ranges that one rangeset
can allocate is set to a small value, due to the fact that these ranges
are allocated in xen heap. Yet for the write-protected ram ranges,
there are circumstances under which the upper limit inside one rangeset
should exceed the default one. E.g. in XenGT, when tracking the PPGTTs(
per-process graphic translation tables) on Intel BDW platforms, number
of page tables concerned will be of several thousand.

For XenGT runing on Intel BDW platform, 8192 is a suggested value for
this parameter in most cases. But users who set his item explicitly
are also supposed to know the specific scenarios that necessitate this
configuration. Especially when this parameter is used:
1> for virtual devices other than vGPUs in XenGT;
2> for XenGT, there also might be some extreme cases, e.g. too many
graphic related applications in one VM, which create a great deal of
per-process graphic translation tables;
3> for XenGT, future cpu platforms which provide even more per-process
graphic translation tables.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
---
 docs/man/xl.cfg.pod.5           | 26 ++++++++++++++++++++++++++
 tools/libxl/libxl.h             |  5 +++++
 tools/libxl/libxl_dom.c         |  3 +++
 tools/libxl/libxl_types.idl     |  1 +
 tools/libxl/xl_cmdimpl.c        | 17 +++++++++++++++++
 xen/arch/x86/hvm/hvm.c          |  7 ++++++-
 xen/include/public/hvm/params.h |  5 ++++-
 7 files changed, 62 insertions(+), 2 deletions(-)

Comments

Jan Beulich Jan. 29, 2016, 4:33 p.m. UTC | #1

>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote:
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s,
>  {
>      unsigned int i;
>      int rc;
> +    unsigned int max_wp_ram_ranges =
> +        s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES];

You're still losing the upper 32 bits here. Iirc you agreed to range
check the value before storing into params[]...

Jan

Yu Zhang Jan. 30, 2016, 2:38 p.m. UTC | #2

On 1/30/2016 12:33 AM, Jan Beulich wrote:
>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote:
>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s,
>>   {
>>       unsigned int i;
>>       int rc;
>> +    unsigned int max_wp_ram_ranges =
>> +        s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES];
>
> You're still losing the upper 32 bits here. Iirc you agreed to range
> check the value before storing into params[]...
>
> Jan
>
>

Thanks, Jan. :)
In this version, the check is added in routine parse_config_data().
If option 'max_wp_ram_ranges' is configured with an unreasonable value,
the xl will terminate, before calling xc_hvm_param_set(). Does this
change meet your requirement? Or maybe did I have some misunderstanding
on this issue?

B.R.
Yu

Jan Beulich Feb. 1, 2016, 7:52 a.m. UTC | #3

>>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote:

> On 1/30/2016 12:33 AM, Jan Beulich wrote:
>>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote:
>>> --- a/xen/arch/x86/hvm/hvm.c
>>> +++ b/xen/arch/x86/hvm/hvm.c
>>> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s,
>>>   {
>>>       unsigned int i;
>>>       int rc;
>>> +    unsigned int max_wp_ram_ranges =
>>> +        s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES];
>>
>> You're still losing the upper 32 bits here. Iirc you agreed to range
>> check the value before storing into params[]...
> 
> Thanks, Jan. :)
> In this version, the check is added in routine parse_config_data().
> If option 'max_wp_ram_ranges' is configured with an unreasonable value,
> the xl will terminate, before calling xc_hvm_param_set(). Does this
> change meet your requirement? Or maybe did I have some misunderstanding
> on this issue?

Checking in the tools is desirable, but the hypervisor shouldn't rely
on any tool side checking.

Jan

Wei Liu Feb. 1, 2016, 11:57 a.m. UTC | #4

On Fri, Jan 29, 2016 at 06:45:14PM +0800, Yu Zhang wrote:
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> index 25507c7..0c19dee 100644
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -35,6 +35,7 @@
>  #include <inttypes.h>
>  #include <limits.h>
>  #include <xen/hvm/e820.h>
> +#include <xenctrl.h>
>  
>  #include "libxl.h"
>  #include "libxl_utils.h"
> @@ -1626,6 +1627,22 @@ static void parse_config_data(const char *config_source,
>  
>          if (!xlu_cfg_get_long (config, "rdm_mem_boundary", &l, 0))
>              b_info->u.hvm.rdm_mem_boundary_memkb = l * 1024;
> +
> +        if (!xlu_cfg_get_long (config, "max_wp_ram_ranges", &l, 0)) {
> +            uint64_t nr_pages = (b_info->max_memkb << 10) >> XC_PAGE_SHIFT;
> +
> +            /* Due to rangeset's ability to combine continuous ranges, this
> +             * parameter shall not be configured with values greater than half
> +             * of the number of VM's page frames. It also shall not exceed 4G,
> +             * because of the limitation from the rangeset side. */
> +            if (l > (nr_pages / 2) || l > UINT32_MAX) {
> +                fprintf(stderr, "ERROR: Invalid value for \"max_wp_ram_ranges\". "
> +                        "Shall not exceed %ld or 4G.\n", nr_pages / 2);
> +                exit(1);
> +            }
> +            b_info->u.hvm.max_wp_ram_ranges = l;
> +        }
> +

Xl is only one of the applications that use libxl (the library).  This
check should be inside libxl so that all applications (xl, libvirt and
others) have the same behaviour.

Take a look at initiate_domain_create where numerous validations are
done.

Wei.

Wei Liu Feb. 1, 2016, 12:02 p.m. UTC | #5

On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote:
> >>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote:
> 
> > On 1/30/2016 12:33 AM, Jan Beulich wrote:
> >>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote:
> >>> --- a/xen/arch/x86/hvm/hvm.c
> >>> +++ b/xen/arch/x86/hvm/hvm.c
> >>> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s,
> >>>   {
> >>>       unsigned int i;
> >>>       int rc;
> >>> +    unsigned int max_wp_ram_ranges =
> >>> +        s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES];
> >>
> >> You're still losing the upper 32 bits here. Iirc you agreed to range
> >> check the value before storing into params[]...
> > 
> > Thanks, Jan. :)
> > In this version, the check is added in routine parse_config_data().
> > If option 'max_wp_ram_ranges' is configured with an unreasonable value,
> > the xl will terminate, before calling xc_hvm_param_set(). Does this
> > change meet your requirement? Or maybe did I have some misunderstanding
> > on this issue?
> 
> Checking in the tools is desirable, but the hypervisor shouldn't rely
> on any tool side checking.
> 

As in hypervisor needs to sanitise all input from toolstack? I don't
think Xen does that today.

What is the difference between this particular configuration option and
all other options in the same hvm_set_conf_params function?

Wei.

> Jan
>

Jan Beulich Feb. 1, 2016, 12:15 p.m. UTC | #6

>>> On 01.02.16 at 13:02, <wei.liu2@citrix.com> wrote:
> On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote:
>> >>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote:
>> 
>> > On 1/30/2016 12:33 AM, Jan Beulich wrote:
>> >>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote:
>> >>> --- a/xen/arch/x86/hvm/hvm.c
>> >>> +++ b/xen/arch/x86/hvm/hvm.c
>> >>> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct 
> hvm_ioreq_server *s,
>> >>>   {
>> >>>       unsigned int i;
>> >>>       int rc;
>> >>> +    unsigned int max_wp_ram_ranges =
>> >>> +        s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES];
>> >>
>> >> You're still losing the upper 32 bits here. Iirc you agreed to range
>> >> check the value before storing into params[]...
>> > 
>> > Thanks, Jan. :)
>> > In this version, the check is added in routine parse_config_data().
>> > If option 'max_wp_ram_ranges' is configured with an unreasonable value,
>> > the xl will terminate, before calling xc_hvm_param_set(). Does this
>> > change meet your requirement? Or maybe did I have some misunderstanding
>> > on this issue?
>> 
>> Checking in the tools is desirable, but the hypervisor shouldn't rely
>> on any tool side checking.
>> 
> 
> As in hypervisor needs to sanitise all input from toolstack? I don't
> think Xen does that today.

If it doesn't, then that's a bug. Note that in many cases (domctl-s
and alike) such bogus trusting in the tool stack behaving correctly
is only not a security issue due to XSA-77. Yet with XSA-77 we
made quite clear that we shouldn't knowingly allow in further such
issues (it'll be hard enough to find and address all existing ones).

Jan

Wei Liu Feb. 1, 2016, 12:49 p.m. UTC | #7

On Mon, Feb 01, 2016 at 05:15:16AM -0700, Jan Beulich wrote:
> >>> On 01.02.16 at 13:02, <wei.liu2@citrix.com> wrote:
> > On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote:
> >> >>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote:
> >> 
> >> > On 1/30/2016 12:33 AM, Jan Beulich wrote:
> >> >>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote:
> >> >>> --- a/xen/arch/x86/hvm/hvm.c
> >> >>> +++ b/xen/arch/x86/hvm/hvm.c
> >> >>> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct 
> > hvm_ioreq_server *s,
> >> >>>   {
> >> >>>       unsigned int i;
> >> >>>       int rc;
> >> >>> +    unsigned int max_wp_ram_ranges =
> >> >>> +        s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES];
> >> >>
> >> >> You're still losing the upper 32 bits here. Iirc you agreed to range
> >> >> check the value before storing into params[]...
> >> > 
> >> > Thanks, Jan. :)
> >> > In this version, the check is added in routine parse_config_data().
> >> > If option 'max_wp_ram_ranges' is configured with an unreasonable value,
> >> > the xl will terminate, before calling xc_hvm_param_set(). Does this
> >> > change meet your requirement? Or maybe did I have some misunderstanding
> >> > on this issue?
> >> 
> >> Checking in the tools is desirable, but the hypervisor shouldn't rely
> >> on any tool side checking.
> >> 
> > 
> > As in hypervisor needs to sanitise all input from toolstack? I don't
> > think Xen does that today.
> 
> If it doesn't, then that's a bug. Note that in many cases (domctl-s
> and alike) such bogus trusting in the tool stack behaving correctly
> is only not a security issue due to XSA-77. Yet with XSA-77 we
> made quite clear that we shouldn't knowingly allow in further such
> issues (it'll be hard enough to find and address all existing ones).
> 

So are you suggesting pulling the check done in toolstack into
hypervisor?

Wei.

> Jan
>

Jan Beulich Feb. 1, 2016, 1:07 p.m. UTC | #8

>>> On 01.02.16 at 13:49, <wei.liu2@citrix.com> wrote:
> On Mon, Feb 01, 2016 at 05:15:16AM -0700, Jan Beulich wrote:
>> >>> On 01.02.16 at 13:02, <wei.liu2@citrix.com> wrote:
>> > On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote:
>> >> >>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote:
>> >> 
>> >> > On 1/30/2016 12:33 AM, Jan Beulich wrote:
>> >> >>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote:
>> >> >>> --- a/xen/arch/x86/hvm/hvm.c
>> >> >>> +++ b/xen/arch/x86/hvm/hvm.c
>> >> >>> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct 
>> > hvm_ioreq_server *s,
>> >> >>>   {
>> >> >>>       unsigned int i;
>> >> >>>       int rc;
>> >> >>> +    unsigned int max_wp_ram_ranges =
>> >> >>> +        s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES];
>> >> >>
>> >> >> You're still losing the upper 32 bits here. Iirc you agreed to range
>> >> >> check the value before storing into params[]...
>> >> > 
>> >> > Thanks, Jan. :)
>> >> > In this version, the check is added in routine parse_config_data().
>> >> > If option 'max_wp_ram_ranges' is configured with an unreasonable value,
>> >> > the xl will terminate, before calling xc_hvm_param_set(). Does this
>> >> > change meet your requirement? Or maybe did I have some misunderstanding
>> >> > on this issue?
>> >> 
>> >> Checking in the tools is desirable, but the hypervisor shouldn't rely
>> >> on any tool side checking.
>> >> 
>> > 
>> > As in hypervisor needs to sanitise all input from toolstack? I don't
>> > think Xen does that today.
>> 
>> If it doesn't, then that's a bug. Note that in many cases (domctl-s
>> and alike) such bogus trusting in the tool stack behaving correctly
>> is only not a security issue due to XSA-77. Yet with XSA-77 we
>> made quite clear that we shouldn't knowingly allow in further such
>> issues (it'll be hard enough to find and address all existing ones).
> 
> So are you suggesting pulling the check done in toolstack into
> hypervisor?

I think the check in the tools should stay (allowing for a
distinguishable error message to be issued); all I'm saying is
that doing the check in the tools is not enough.

Jan

Yu Zhang Feb. 1, 2016, 3:14 p.m. UTC | #9

On 2/1/2016 9:07 PM, Jan Beulich wrote:
>>>> On 01.02.16 at 13:49, <wei.liu2@citrix.com> wrote:
>> On Mon, Feb 01, 2016 at 05:15:16AM -0700, Jan Beulich wrote:
>>>>>> On 01.02.16 at 13:02, <wei.liu2@citrix.com> wrote:
>>>> On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote:
>>>>>>>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote:
>>>>>
>>>>>> On 1/30/2016 12:33 AM, Jan Beulich wrote:
>>>>>>>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote:
>>>>>>>> --- a/xen/arch/x86/hvm/hvm.c
>>>>>>>> +++ b/xen/arch/x86/hvm/hvm.c
>>>>>>>> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct
>>>> hvm_ioreq_server *s,
>>>>>>>>    {
>>>>>>>>        unsigned int i;
>>>>>>>>        int rc;
>>>>>>>> +    unsigned int max_wp_ram_ranges =
>>>>>>>> +        s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES];
>>>>>>>
>>>>>>> You're still losing the upper 32 bits here. Iirc you agreed to range
>>>>>>> check the value before storing into params[]...
>>>>>>
>>>>>> Thanks, Jan. :)
>>>>>> In this version, the check is added in routine parse_config_data().
>>>>>> If option 'max_wp_ram_ranges' is configured with an unreasonable value,
>>>>>> the xl will terminate, before calling xc_hvm_param_set(). Does this
>>>>>> change meet your requirement? Or maybe did I have some misunderstanding
>>>>>> on this issue?
>>>>>
>>>>> Checking in the tools is desirable, but the hypervisor shouldn't rely
>>>>> on any tool side checking.
>>>>>
>>>>
>>>> As in hypervisor needs to sanitise all input from toolstack? I don't
>>>> think Xen does that today.
>>>
>>> If it doesn't, then that's a bug. Note that in many cases (domctl-s
>>> and alike) such bogus trusting in the tool stack behaving correctly
>>> is only not a security issue due to XSA-77. Yet with XSA-77 we
>>> made quite clear that we shouldn't knowingly allow in further such
>>> issues (it'll be hard enough to find and address all existing ones).
>>
>> So are you suggesting pulling the check done in toolstack into
>> hypervisor?
>
> I think the check in the tools should stay (allowing for a
> distinguishable error message to be issued); all I'm saying is
> that doing the check in the tools is not enough.
>
> Jan
>

Thank you Jan and Wei. And sorry for the late response.
But I still do not quite understand. :)
If tool stack can guarantee the validity of a parameter,
under which circumstances will hypervisor be threatened?
I'm not familiar with XSA-77, and I'll read it ASAP.

B.R.
Yu
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

Yu Zhang Feb. 1, 2016, 3:15 p.m. UTC | #10

On 2/1/2016 7:57 PM, Wei Liu wrote:
> On Fri, Jan 29, 2016 at 06:45:14PM +0800, Yu Zhang wrote:
>> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
>> index 25507c7..0c19dee 100644
>> --- a/tools/libxl/xl_cmdimpl.c
>> +++ b/tools/libxl/xl_cmdimpl.c
>> @@ -35,6 +35,7 @@
>>   #include <inttypes.h>
>>   #include <limits.h>
>>   #include <xen/hvm/e820.h>
>> +#include <xenctrl.h>
>>
>>   #include "libxl.h"
>>   #include "libxl_utils.h"
>> @@ -1626,6 +1627,22 @@ static void parse_config_data(const char *config_source,
>>
>>           if (!xlu_cfg_get_long (config, "rdm_mem_boundary", &l, 0))
>>               b_info->u.hvm.rdm_mem_boundary_memkb = l * 1024;
>> +
>> +        if (!xlu_cfg_get_long (config, "max_wp_ram_ranges", &l, 0)) {
>> +            uint64_t nr_pages = (b_info->max_memkb << 10) >> XC_PAGE_SHIFT;
>> +
>> +            /* Due to rangeset's ability to combine continuous ranges, this
>> +             * parameter shall not be configured with values greater than half
>> +             * of the number of VM's page frames. It also shall not exceed 4G,
>> +             * because of the limitation from the rangeset side. */
>> +            if (l > (nr_pages / 2) || l > UINT32_MAX) {
>> +                fprintf(stderr, "ERROR: Invalid value for \"max_wp_ram_ranges\". "
>> +                        "Shall not exceed %ld or 4G.\n", nr_pages / 2);
>> +                exit(1);
>> +            }
>> +            b_info->u.hvm.max_wp_ram_ranges = l;
>> +        }
>> +
>
> Xl is only one of the applications that use libxl (the library).  This
> check should be inside libxl so that all applications (xl, libvirt and
> others) have the same behaviour.
>
> Take a look at initiate_domain_create where numerous validations are
> done.
>
> Wei.
>

Thank you, Wei. I'll try to move this part into
initiate_domain_create().

B.R.
Yu
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

Jan Beulich Feb. 1, 2016, 4:16 p.m. UTC | #11

>>> On 01.02.16 at 16:14, <yu.c.zhang@linux.intel.com> wrote:
> But I still do not quite understand. :)
> If tool stack can guarantee the validity of a parameter,
> under which circumstances will hypervisor be threatened?

At least in disaggregated environments the hypervisor cannot
trust the (parts of the) tool stack(s) living outside of Dom0. But
even without disaggregation in mind it is bad practice to have
the hypervisor assume the tool stack will only pass sane values.
Just at the example of the param you're introducing: You don't
even do the validation in libxc, so any (theoretical) tool stack
no based on xl/libxl would not be guaranteed to pass a sane
value. And even if you moved it into libxc, one could still argue
that there could an even more theoretical tool stack not even
building on top of libxc.

Jan

Yu Zhang Feb. 1, 2016, 4:19 p.m. UTC | #12

On 2/1/2016 11:14 PM, Yu, Zhang wrote:
>
>
> On 2/1/2016 9:07 PM, Jan Beulich wrote:
>>>>> On 01.02.16 at 13:49, <wei.liu2@citrix.com> wrote:
>>> On Mon, Feb 01, 2016 at 05:15:16AM -0700, Jan Beulich wrote:
>>>>>>> On 01.02.16 at 13:02, <wei.liu2@citrix.com> wrote:
>>>>> On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote:
>>>>>>>>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote:
>>>>>>
>>>>>>> On 1/30/2016 12:33 AM, Jan Beulich wrote:
>>>>>>>>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote:
>>>>>>>>> --- a/xen/arch/x86/hvm/hvm.c
>>>>>>>>> +++ b/xen/arch/x86/hvm/hvm.c
>>>>>>>>> @@ -940,6 +940,8 @@ static int
>>>>>>>>> hvm_ioreq_server_alloc_rangesets(struct
>>>>> hvm_ioreq_server *s,
>>>>>>>>>    {
>>>>>>>>>        unsigned int i;
>>>>>>>>>        int rc;
>>>>>>>>> +    unsigned int max_wp_ram_ranges =
>>>>>>>>> +
>>>>>>>>> s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES];
>>>>>>>>
>>>>>>>> You're still losing the upper 32 bits here. Iirc you agreed to
>>>>>>>> range
>>>>>>>> check the value before storing into params[]...
>>>>>>>
>>>>>>> Thanks, Jan. :)
>>>>>>> In this version, the check is added in routine parse_config_data().
>>>>>>> If option 'max_wp_ram_ranges' is configured with an unreasonable
>>>>>>> value,
>>>>>>> the xl will terminate, before calling xc_hvm_param_set(). Does this
>>>>>>> change meet your requirement? Or maybe did I have some
>>>>>>> misunderstanding
>>>>>>> on this issue?
>>>>>>
>>>>>> Checking in the tools is desirable, but the hypervisor shouldn't rely
>>>>>> on any tool side checking.
>>>>>>
>>>>>
>>>>> As in hypervisor needs to sanitise all input from toolstack? I don't
>>>>> think Xen does that today.
>>>>
>>>> If it doesn't, then that's a bug. Note that in many cases (domctl-s
>>>> and alike) such bogus trusting in the tool stack behaving correctly
>>>> is only not a security issue due to XSA-77. Yet with XSA-77 we
>>>> made quite clear that we shouldn't knowingly allow in further such
>>>> issues (it'll be hard enough to find and address all existing ones).
>>>
>>> So are you suggesting pulling the check done in toolstack into
>>> hypervisor?
>>
>> I think the check in the tools should stay (allowing for a
>> distinguishable error message to be issued); all I'm saying is
>> that doing the check in the tools is not enough.
>>
>> Jan
>>
>
> Thank you Jan and Wei. And sorry for the late response.
> But I still do not quite understand. :)
> If tool stack can guarantee the validity of a parameter,
> under which circumstances will hypervisor be threatened?
> I'm not familiar with XSA-77, and I'll read it ASAP.
>
> B.R.
> Yu

Sorry to bother you, Jan.
After a second thought, I guess one of the security concern
is when some APP is trying to trigger the HVMOP_set_param
directly with some illegal values.
So, we need also validate this param in hvm_allow_set_param,
current although hvm_allow_set_param has not performed any
validation other parameters. We need to do this for the new
ones. Is this understanding correct?
Another question is: as to the tool stack side, do you think
an error message would suffice? Shouldn't xl be terminated?

Thanks
Yu

>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

Yu Zhang Feb. 1, 2016, 4:33 p.m. UTC | #13

On 2/2/2016 12:16 AM, Jan Beulich wrote:
>>>> On 01.02.16 at 16:14, <yu.c.zhang@linux.intel.com> wrote:
>> But I still do not quite understand. :)
>> If tool stack can guarantee the validity of a parameter,
>> under which circumstances will hypervisor be threatened?
>
> At least in disaggregated environments the hypervisor cannot
> trust the (parts of the) tool stack(s) living outside of Dom0. But
> even without disaggregation in mind it is bad practice to have
> the hypervisor assume the tool stack will only pass sane values.
> Just at the example of the param you're introducing: You don't
> even do the validation in libxc, so any (theoretical) tool stack
> no based on xl/libxl would not be guaranteed to pass a sane
> value. And even if you moved it into libxc, one could still argue
> that there could an even more theoretical tool stack not even
> building on top of libxc.
>
> Jan
>

Great. Thank you very much for your patience to explain.
Just sent out another mail about my understanding a moment ago,
seems I partially get it. :)
My vnc connection is too slow, will change the code tomorrow.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

Yu

Jan Beulich Feb. 1, 2016, 4:35 p.m. UTC | #14

>>> On 01.02.16 at 17:19, <yu.c.zhang@linux.intel.com> wrote:
> After a second thought, I guess one of the security concern
> is when some APP is trying to trigger the HVMOP_set_param
> directly with some illegal values.

Not sure what "directly" is supposed to mean here.

> So, we need also validate this param in hvm_allow_set_param,
> current although hvm_allow_set_param has not performed any
> validation other parameters. We need to do this for the new
> ones. Is this understanding correct?

Yes.

> Another question is: as to the tool stack side, do you think
> an error message would suffice? Shouldn't xl be terminated?

I have no idea what consistent behavior in such a case would
be - I'll defer input on this to the tool stack maintainers.

Jan

Yu Zhang Feb. 1, 2016, 4:37 p.m. UTC | #15

On 2/2/2016 12:35 AM, Jan Beulich wrote:
>>>> On 01.02.16 at 17:19, <yu.c.zhang@linux.intel.com> wrote:
>> After a second thought, I guess one of the security concern
>> is when some APP is trying to trigger the HVMOP_set_param
>> directly with some illegal values.
>
> Not sure what "directly" is supposed to mean here.
>
I mean with no validation by itself, like libxc...

>> So, we need also validate this param in hvm_allow_set_param,
>> current although hvm_allow_set_param has not performed any
>> validation other parameters. We need to do this for the new
>> ones. Is this understanding correct?
>
> Yes.
>
>> Another question is: as to the tool stack side, do you think
>> an error message would suffice? Shouldn't xl be terminated?
>
> I have no idea what consistent behavior in such a case would
> be - I'll defer input on this to the tool stack maintainers.
>

Thank you.
Wei, which one do you prefer?

Yu

Ian Jackson Feb. 1, 2016, 5:05 p.m. UTC | #16

Yu, Zhang writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."):
> On 2/2/2016 12:35 AM, Jan Beulich wrote:
>> On 01.02.16 at 17:19, <yu.c.zhang@linux.intel.com> wrote:
> >> So, we need also validate this param in hvm_allow_set_param,
> >> current although hvm_allow_set_param has not performed any
> >> validation other parameters. We need to do this for the new
> >> ones. Is this understanding correct?
> >
> > Yes.
> >
> >> Another question is: as to the tool stack side, do you think
> >> an error message would suffice? Shouldn't xl be terminated?
> >
> > I have no idea what consistent behavior in such a case would
> > be - I'll defer input on this to the tool stack maintainers.
> 
> Thank you.
> Wei, which one do you prefer?

I think that arrangements should be made for the hypercall failure to
be properly reported to the caller, and properly logged.

I don't think it is desirable to duplicate the sanity check in
xl/libxl/libxc.  That would simply result in there being two limits to
update.

I have to say, though, that the situation with this parameter seems
quite unsatisfactory.  It seems to be a kind of bodge.

The changeable limit is there to prevent excessive resource usage by a
guest.  But the docs suggest that the excessive usage might be
normal.  That sounds like a suboptimal design to me.

For reference, here is the docs proposed in this patch:

  =item B<max_wp_ram_ranges=N>

  Limit the maximum write-protected ram ranges that can be tracked
  inside one ioreq server rangeset.

  Ioreq server uses a group of rangesets to track the I/O or memory
  resources to be emulated. Default limit of ranges that one rangeset
  can allocate is set to a small value, due to the fact that these
  ranges are allocated in xen heap. Yet for the write-protected ram
  ranges, there are circumstances under which the upper limit inside
  one rangeset should exceed the default one. E.g. in Intel GVT-g,
  when tracking the PPGTT(per-process graphic translation tables) on
  Intel broadwell platforms, the number of page tables concerned will
  be of several thousand.

  For Intel GVT-g broadwell platform, 8192 is a suggested value for
  this parameter in most cases. But users who set his item explicitly
  are also supposed to know the specific scenarios that necessitate
  this configuration. Especially when this parameter is used:
  1> for virtual devices other than vGPU in GVT-g;
  2> for GVT-g, there also might be some extreme cases, e.g. too many
  graphic related applications in one VM, which create a great deal of
  per-process graphic translation tables;
  3> for GVT-g, future cpu platforms which provide even more per-process
  graphic translation tables.

Having said that, if the hypervisor maintainers are happy with a
situation where this value is configured explicitly, and the
configurations where a non-default value is required is expected to be
rare, then I guess we can live with it.

Ian.

Yu Zhang Feb. 2, 2016, 8:04 a.m. UTC | #17

Thanks for your reply, Ian.

On 2/2/2016 1:05 AM, Ian Jackson wrote:
> Yu, Zhang writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."):
>> On 2/2/2016 12:35 AM, Jan Beulich wrote:
>>> On 01.02.16 at 17:19, <yu.c.zhang@linux.intel.com> wrote:
>>>> So, we need also validate this param in hvm_allow_set_param,
>>>> current although hvm_allow_set_param has not performed any
>>>> validation other parameters. We need to do this for the new
>>>> ones. Is this understanding correct?
>>>
>>> Yes.
>>>
>>>> Another question is: as to the tool stack side, do you think
>>>> an error message would suffice? Shouldn't xl be terminated?
>>>
>>> I have no idea what consistent behavior in such a case would
>>> be - I'll defer input on this to the tool stack maintainers.
>>
>> Thank you.
>> Wei, which one do you prefer?
>
> I think that arrangements should be made for the hypercall failure to
> be properly reported to the caller, and properly logged.
>
> I don't think it is desirable to duplicate the sanity check in
> xl/libxl/libxc.  That would simply result in there being two limits to
> update.
>

Sorry, I do not follow. What does "being two limits to update" mean?

> I have to say, though, that the situation with this parameter seems
> quite unsatisfactory.  It seems to be a kind of bodge.
>

By "situation with this parameter", do you mean:
a> the introduction of this parameter in tool stack, or
b> the sanitizing for this parameter(In fact I'd prefer not to treat
the check of this parameter as a sanitizing, cause it only checks
the input against 4G to avoid data missing from uint64 to uint32
assignment in hvm_ioreq_server_alloc_rangesets)?



> The changeable limit is there to prevent excessive resource usage by a
> guest.  But the docs suggest that the excessive usage might be
> normal.  That sounds like a suboptimal design to me.
>

Yes, there might be situations that this limit be set to some large
value. But I that situation would be very rare. Like the docs
suggested, for XenGT, 8K is a big enough one for most cases.

>For reference, here is the docs proposed in this patch:
>
>    =item B<max_wp_ram_ranges=N>
>
>    Limit the maximum write-protected ram ranges that can be tracked
>    inside one ioreq server rangeset.
>
>    Ioreq server uses a group of rangesets to track the I/O or memory
>    resources to be emulated. Default limit of ranges that one rangeset
>    can allocate is set to a small value, due to the fact that these
>    ranges are allocated in xen heap. Yet for the write-protected ram
>    ranges, there are circumstances under which the upper limit inside
>    one rangeset should exceed the default one. E.g. in Intel GVT-g,
>    when tracking the PPGTT(per-process graphic translation tables) on
>    Intel broadwell platforms, the number of page tables concerned will
>    be of several thousand.
>
>    For Intel GVT-g broadwell platform, 8192 is a suggested value for
>    this parameter in most cases. But users who set his item explicitly
>    are also supposed to know the specific scenarios that necessitate
>    this configuration. Especially when this parameter is used:
>    1> for virtual devices other than vGPU in GVT-g;
>    2> for GVT-g, there also might be some extreme cases, e.g. too many
>    graphic related applications in one VM, which create a great deal of
>    per-process graphic translation tables;
>    3> for GVT-g, future cpu platforms which provide even more per-process
>    graphic translation tables.
>
> Having said that, if the hypervisor maintainers are happy with a
> situation where this value is configured explicitly, and the
> configurations where a non-default value is required is expected to be
> rare, then I guess we can live with it.
>
> Ian.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

Thanks
Yu

Jan Beulich Feb. 2, 2016, 10:32 a.m. UTC | #18

>>> On 01.02.16 at 18:05, <Ian.Jackson@eu.citrix.com> wrote:
> Having said that, if the hypervisor maintainers are happy with a
> situation where this value is configured explicitly, and the
> configurations where a non-default value is required is expected to be
> rare, then I guess we can live with it.

Well, from the very beginning I have been not very happy with
the introduction of this, and I still consider it half way acceptable
only because of not seeing any good alternative. If we look at
it strictly, it's in violation of the rule we set forth after XSA-77:
No introduction of new code making the system susceptible to
bad (malicious) tool stack behavior, and hence we should reject
it. Yet that would leave XenGT in a state where it would have no
perspective of ever getting merged, which doesn't seem very
desirable either.

Jan

Yu Zhang Feb. 2, 2016, 10:56 a.m. UTC | #19

On 2/2/2016 6:32 PM, Jan Beulich wrote:
>>>> On 01.02.16 at 18:05, <Ian.Jackson@eu.citrix.com> wrote:
>> Having said that, if the hypervisor maintainers are happy with a
>> situation where this value is configured explicitly, and the
>> configurations where a non-default value is required is expected to be
>> rare, then I guess we can live with it.
>
> Well, from the very beginning I have been not very happy with
> the introduction of this, and I still consider it half way acceptable
> only because of not seeing any good alternative. If we look at
> it strictly, it's in violation of the rule we set forth after XSA-77:
> No introduction of new code making the system susceptible to
> bad (malicious) tool stack behavior, and hence we should reject
> it. Yet that would leave XenGT in a state where it would have no
> perspective of ever getting merged, which doesn't seem very
> desirable either.
>
> Jan
>

Thanks, Jan.
I understand your concern, and to be honest, I do not think
this is an optimal solution. But I also have no better idea
in mind.  :(
Another option may be: instead of opening this parameter to
the tool stack, we use a XenGT flag, which set the rangeset
limit to a default value. But like I said, this default value
may not always work on future XenGT platforms.


B.R.
Yu

Jan Beulich Feb. 2, 2016, 11:12 a.m. UTC | #20

>>> On 02.02.16 at 11:56, <yu.c.zhang@linux.intel.com> wrote:
> I understand your concern, and to be honest, I do not think
> this is an optimal solution. But I also have no better idea
> in mind.  :(
> Another option may be: instead of opening this parameter to
> the tool stack, we use a XenGT flag, which set the rangeset
> limit to a default value. But like I said, this default value
> may not always work on future XenGT platforms.

Assuming that you think of something set e.g. by hypervisor
command line option: How would that work? I.e. how would
that limit the resource use for all VMs not using XenGT? Or if
you mean a flag settable in the domain config - how would you
avoid a malicious admin to set this flag for all the VMs created
in the controlled partition of the system?

Jan

Andrew Cooper Feb. 2, 2016, 11:31 a.m. UTC | #21

On 02/02/16 10:32, Jan Beulich wrote:
>>>> On 01.02.16 at 18:05, <Ian.Jackson@eu.citrix.com> wrote:
>> Having said that, if the hypervisor maintainers are happy with a
>> situation where this value is configured explicitly, and the
>> configurations where a non-default value is required is expected to be
>> rare, then I guess we can live with it.
> Well, from the very beginning I have been not very happy with
> the introduction of this, and I still consider it half way acceptable
> only because of not seeing any good alternative. If we look at
> it strictly, it's in violation of the rule we set forth after XSA-77:
> No introduction of new code making the system susceptible to
> bad (malicious) tool stack behavior

Lets take a step back here.

If your toolstack is malicious, you have already lost.  Coding Xen
around this is a waste of time.

The XSM case is for splitting out some of the privileged domains
responsibilities to less privileged domains.  In these cases, we do
indeed want to assure that the somewhat-privileged entity cannot abuse
anything outside its area of privilege.

This specific issue concerns resource allocation during domain building
and is an area which can never ever be given to a less privileged entity.

~Andrew

Jan Beulich Feb. 2, 2016, 11:43 a.m. UTC | #22

>>> On 02.02.16 at 12:31, <andrew.cooper3@citrix.com> wrote:
> This specific issue concerns resource allocation during domain building
> and is an area which can never ever be given to a less privileged entity.

Which is because of ...? (And if so, why would we have put
XEN_DOMCTL_createdomain on the XSA-77 waiver list?)

Jan

Wei Liu Feb. 2, 2016, 11:51 a.m. UTC | #23

On Tue, Feb 02, 2016 at 04:04:14PM +0800, Yu, Zhang wrote:
> Thanks for your reply, Ian.
> 
> On 2/2/2016 1:05 AM, Ian Jackson wrote:
> >Yu, Zhang writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."):
> >>On 2/2/2016 12:35 AM, Jan Beulich wrote:
> >>>On 01.02.16 at 17:19, <yu.c.zhang@linux.intel.com> wrote:
> >>>>So, we need also validate this param in hvm_allow_set_param,
> >>>>current although hvm_allow_set_param has not performed any
> >>>>validation other parameters. We need to do this for the new
> >>>>ones. Is this understanding correct?
> >>>
> >>>Yes.
> >>>
> >>>>Another question is: as to the tool stack side, do you think
> >>>>an error message would suffice? Shouldn't xl be terminated?
> >>>
> >>>I have no idea what consistent behavior in such a case would
> >>>be - I'll defer input on this to the tool stack maintainers.
> >>
> >>Thank you.
> >>Wei, which one do you prefer?
> >
> >I think that arrangements should be made for the hypercall failure to
> >be properly reported to the caller, and properly logged.
> >
> >I don't think it is desirable to duplicate the sanity check in
> >xl/libxl/libxc.  That would simply result in there being two limits to
> >update.
> >
> 
> Sorry, I do not follow. What does "being two limits to update" mean?
> 

I can't speak for Ian, but my understanding is that if the code logic is
duplicated in several places, you need to update all of them whenever
you change the logic. But, he has expressed if this is blocker for this
series, so I will let him clarify.

Wei.

Yu Zhang Feb. 2, 2016, 1:56 p.m. UTC | #24

On 2/2/2016 7:51 PM, Wei Liu wrote:
> On Tue, Feb 02, 2016 at 04:04:14PM +0800, Yu, Zhang wrote:
>> Thanks for your reply, Ian.
>>
>> On 2/2/2016 1:05 AM, Ian Jackson wrote:
>>> Yu, Zhang writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."):
>>>> On 2/2/2016 12:35 AM, Jan Beulich wrote:
>>>>> On 01.02.16 at 17:19, <yu.c.zhang@linux.intel.com> wrote:
>>>>>> So, we need also validate this param in hvm_allow_set_param,
>>>>>> current although hvm_allow_set_param has not performed any
>>>>>> validation other parameters. We need to do this for the new
>>>>>> ones. Is this understanding correct?
>>>>>
>>>>> Yes.
>>>>>
>>>>>> Another question is: as to the tool stack side, do you think
>>>>>> an error message would suffice? Shouldn't xl be terminated?
>>>>>
>>>>> I have no idea what consistent behavior in such a case would
>>>>> be - I'll defer input on this to the tool stack maintainers.
>>>>
>>>> Thank you.
>>>> Wei, which one do you prefer?
>>>
>>> I think that arrangements should be made for the hypercall failure to
>>> be properly reported to the caller, and properly logged.
>>>
>>> I don't think it is desirable to duplicate the sanity check in
>>> xl/libxl/libxc.  That would simply result in there being two limits to
>>> update.
>>>
>>
>> Sorry, I do not follow. What does "being two limits to update" mean?
>>
>
> I can't speak for Ian, but my understanding is that if the code logic is
> duplicated in several places, you need to update all of them whenever
> you change the logic. But, he has expressed if this is blocker for this
> series, so I will let him clarify.

Thank you, Wei. This explanation helps. :)

B.R.
Yu

Yu Zhang Feb. 2, 2016, 2:01 p.m. UTC | #25

On 2/2/2016 7:12 PM, Jan Beulich wrote:
>>>> On 02.02.16 at 11:56, <yu.c.zhang@linux.intel.com> wrote:
>> I understand your concern, and to be honest, I do not think
>> this is an optimal solution. But I also have no better idea
>> in mind.  :(
>> Another option may be: instead of opening this parameter to
>> the tool stack, we use a XenGT flag, which set the rangeset
>> limit to a default value. But like I said, this default value
>> may not always work on future XenGT platforms.
>
> Assuming that you think of something set e.g. by hypervisor
> command line option: How would that work? I.e. how would
> that limit the resource use for all VMs not using XenGT? Or if
> you mean a flag settable in the domain config - how would you
> avoid a malicious admin to set this flag for all the VMs created
> in the controlled partition of the system?
>

Well, I am not satisfied with this new parameter, because:
1> exposing an option like max_wp_ram_ranges to the user seems too
detailed;
2> but if not, using a XenGT flag means it would be hard for hypervisor
to find a default value which can work in all situations theoretically,
although in practice, 8K is already a big enough one.

However, as to the security concern you raised, I can not fully
understand. :) E.g. I believe a malicious admin can also breach the
system even without this patch. This argument may not be convincing to 
you, but as to this specific case, even if an admin set XenGT flag to
all VMs, what harm will this action do? It only means the ioreq server 
can at most allocate 8K ranges, will that consume all the Xen heaps, 
especially for 64 bit Xen?

Anyway, despite different opinions, I still need to say thank you
for your explanation. Upstreaming XenGT features is my task, it is 
painfully rewarding, to receive suggestions from community maintainers,
which helps a newbie like me better understand the virtualization
technology. :)

Thanks
Yu

Andrew Cooper Feb. 2, 2016, 2:20 p.m. UTC | #26

On 02/02/16 11:43, Jan Beulich wrote:
>>>> On 02.02.16 at 12:31, <andrew.cooper3@citrix.com> wrote:
>> This specific issue concerns resource allocation during domain building
>> and is an area which can never ever be given to a less privileged entity.
> Which is because of ...? (And if so, why would we have put
> XEN_DOMCTL_createdomain on the XSA-77 waiver list?)

That list came out of the blue as far as the Xen community went.

The purpose of XEN_DOMCTL_createdomain is to mutate the set of valid
identifiers in Xen on which XSM permissions are based, and any entity
capable of making the hypercall can at the very least cause reuse of an
existing identifier.

For a different example, take XEN_DOMCTL_gdbsx_guestmemio.  This
hypercall specifically permits the caller to change arbitrary memory,
including that of the Xen itself.

Neither of these two operations will ever be safe in the hands of
anything but a fully privileged entity.  Pretending otherwise isn't
going to change this fact.

~Andrew

Jan Beulich Feb. 2, 2016, 2:42 p.m. UTC | #27

>>> On 02.02.16 at 15:01, <yu.c.zhang@linux.intel.com> wrote:
> On 2/2/2016 7:12 PM, Jan Beulich wrote:
>>>>> On 02.02.16 at 11:56, <yu.c.zhang@linux.intel.com> wrote:
>>> I understand your concern, and to be honest, I do not think
>>> this is an optimal solution. But I also have no better idea
>>> in mind.  :(
>>> Another option may be: instead of opening this parameter to
>>> the tool stack, we use a XenGT flag, which set the rangeset
>>> limit to a default value. But like I said, this default value
>>> may not always work on future XenGT platforms.
>>
>> Assuming that you think of something set e.g. by hypervisor
>> command line option: How would that work? I.e. how would
>> that limit the resource use for all VMs not using XenGT? Or if
>> you mean a flag settable in the domain config - how would you
>> avoid a malicious admin to set this flag for all the VMs created
>> in the controlled partition of the system?
> 
> Well, I am not satisfied with this new parameter, because:
> 1> exposing an option like max_wp_ram_ranges to the user seems too
> detailed;
> 2> but if not, using a XenGT flag means it would be hard for hypervisor
> to find a default value which can work in all situations theoretically,
> although in practice, 8K is already a big enough one.
> 
> However, as to the security concern you raised, I can not fully
> understand. :) E.g. I believe a malicious admin can also breach the
> system even without this patch. This argument may not be convincing to 
> you, but as to this specific case, even if an admin set XenGT flag to
> all VMs, what harm will this action do? It only means the ioreq server 
> can at most allocate 8K ranges, will that consume all the Xen heaps, 
> especially for 64 bit Xen?

First of all so far you meant to set a limit of 4G, which - taking a
handful of domains - if fully used would take even a mid-size
host out of memory. And then you have to consider bad effects
resulting from Xen itself not normally having a lot of memory left
(especially when "dom0_mem=" is not forcing most of the memory
to be in Xen's hands), which may mean that one domain
exhausting Xen's memory can affect another domain if Xen can't
allocate memory it needs to support that other domain, in the
worst case leading to a domain crash. And this all is still leaving
aside Xen's own health...

Jan

Yu Zhang Feb. 2, 2016, 3 p.m. UTC | #28

On 2/2/2016 10:42 PM, Jan Beulich wrote:
>>>> On 02.02.16 at 15:01, <yu.c.zhang@linux.intel.com> wrote:
>> On 2/2/2016 7:12 PM, Jan Beulich wrote:
>>>>>> On 02.02.16 at 11:56, <yu.c.zhang@linux.intel.com> wrote:
>>>> I understand your concern, and to be honest, I do not think
>>>> this is an optimal solution. But I also have no better idea
>>>> in mind.  :(
>>>> Another option may be: instead of opening this parameter to
>>>> the tool stack, we use a XenGT flag, which set the rangeset
>>>> limit to a default value. But like I said, this default value
>>>> may not always work on future XenGT platforms.
>>>
>>> Assuming that you think of something set e.g. by hypervisor
>>> command line option: How would that work? I.e. how would
>>> that limit the resource use for all VMs not using XenGT? Or if
>>> you mean a flag settable in the domain config - how would you
>>> avoid a malicious admin to set this flag for all the VMs created
>>> in the controlled partition of the system?
>>
>> Well, I am not satisfied with this new parameter, because:
>> 1> exposing an option like max_wp_ram_ranges to the user seems too
>> detailed;
>> 2> but if not, using a XenGT flag means it would be hard for hypervisor
>> to find a default value which can work in all situations theoretically,
>> although in practice, 8K is already a big enough one.
>>
>> However, as to the security concern you raised, I can not fully
>> understand. :) E.g. I believe a malicious admin can also breach the
>> system even without this patch. This argument may not be convincing to
>> you, but as to this specific case, even if an admin set XenGT flag to
>> all VMs, what harm will this action do? It only means the ioreq server
>> can at most allocate 8K ranges, will that consume all the Xen heaps,
>> especially for 64 bit Xen?
>
> First of all so far you meant to set a limit of 4G, which - taking a
> handful of domains - if fully used would take even a mid-size
> host out of memory. And then you have to consider bad effects
> resulting from Xen itself not normally having a lot of memory left
> (especially when "dom0_mem=" is not forcing most of the memory
> to be in Xen's hands), which may mean that one domain
> exhausting Xen's memory can affect another domain if Xen can't
> allocate memory it needs to support that other domain, in the
> worst case leading to a domain crash. And this all is still leaving
> aside Xen's own health...
>

Thanks, Jan.
The limit of 4G is to avoid the data missing from uint64 to uint32
assignment. And I can accept the 8K limit for XenGT in practice.
After all, it is vGPU page tables we are trying to trap and emulate,
not normal page frames.

And I guess the reason that one domain exhausting Xen's memory can
affect another domain is because rangeset uses Xen heap, instead of the
per-domain memory. So what about we use a 8K limit by now for XenGT,
and in the future, if a per-domain memory allocation solution for
rangeset is ready, we do need to limit the rangeset size. Does this
sounds more acceptable?

B.R.
Yu

Yu Zhang Feb. 2, 2016, 3:19 p.m. UTC | #29

On 2/2/2016 11:21 PM, Jan Beulich wrote:
>>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote:
>> The limit of 4G is to avoid the data missing from uint64 to uint32
>> assignment. And I can accept the 8K limit for XenGT in practice.
>> After all, it is vGPU page tables we are trying to trap and emulate,
>> not normal page frames.
>>
>> And I guess the reason that one domain exhausting Xen's memory can
>> affect another domain is because rangeset uses Xen heap, instead of the
>> per-domain memory. So what about we use a 8K limit by now for XenGT,
>> and in the future, if a per-domain memory allocation solution for
>> rangeset is ready, we do need to limit the rangeset size. Does this
>> sounds more acceptable?
>
> The lower the limit the better (but no matter how low the limit
> it won't make this a pretty thing). Anyway I'd still like to wait
> for what Ian may further say on this.
>

OK then. :)
Ian, do you have any suggestions?

Thanks
Yu

Jan Beulich Feb. 2, 2016, 3:21 p.m. UTC | #30

>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote:
> The limit of 4G is to avoid the data missing from uint64 to uint32
> assignment. And I can accept the 8K limit for XenGT in practice.
> After all, it is vGPU page tables we are trying to trap and emulate,
> not normal page frames.
> 
> And I guess the reason that one domain exhausting Xen's memory can
> affect another domain is because rangeset uses Xen heap, instead of the
> per-domain memory. So what about we use a 8K limit by now for XenGT,
> and in the future, if a per-domain memory allocation solution for
> rangeset is ready, we do need to limit the rangeset size. Does this
> sounds more acceptable?

The lower the limit the better (but no matter how low the limit
it won't make this a pretty thing). Anyway I'd still like to wait
for what Ian may further say on this.

Jan

Yu Zhang Feb. 3, 2016, 7:10 a.m. UTC | #31

On 2/2/2016 11:21 PM, Jan Beulich wrote:
>>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote:
>> The limit of 4G is to avoid the data missing from uint64 to uint32
>> assignment. And I can accept the 8K limit for XenGT in practice.
>> After all, it is vGPU page tables we are trying to trap and emulate,
>> not normal page frames.
>>
>> And I guess the reason that one domain exhausting Xen's memory can
>> affect another domain is because rangeset uses Xen heap, instead of the
>> per-domain memory. So what about we use a 8K limit by now for XenGT,
>> and in the future, if a per-domain memory allocation solution for
>> rangeset is ready, we do need to limit the rangeset size. Does this
>> sounds more acceptable?
>
> The lower the limit the better (but no matter how low the limit
> it won't make this a pretty thing). Anyway I'd still like to wait
> for what Ian may further say on this.
>
Hi Jan, I just had a discussion with my colleague. We believe 8K could
be the biggest limit for the write-protected ram ranges. If in the
future, number of vGPU page tables exceeds this limit, we will modify
our back-end device model to find a trade-off method, instead of
extending this limit. If you can accept this value as the upper bound
of rangeset, maybe we do not need to add any tool stack parameters, but
define a MAX_NR_WR_RAM_RANGES for the write-protected ram rangesset. As
to other rangesets, we keep their limit as 256. Does this sounds OK? :)

B.R.
Yu

Jan Beulich Feb. 3, 2016, 8:32 a.m. UTC | #32

>>> On 03.02.16 at 08:10, <yu.c.zhang@linux.intel.com> wrote:
> On 2/2/2016 11:21 PM, Jan Beulich wrote:
>>>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote:
>>> The limit of 4G is to avoid the data missing from uint64 to uint32
>>> assignment. And I can accept the 8K limit for XenGT in practice.
>>> After all, it is vGPU page tables we are trying to trap and emulate,
>>> not normal page frames.
>>>
>>> And I guess the reason that one domain exhausting Xen's memory can
>>> affect another domain is because rangeset uses Xen heap, instead of the
>>> per-domain memory. So what about we use a 8K limit by now for XenGT,
>>> and in the future, if a per-domain memory allocation solution for
>>> rangeset is ready, we do need to limit the rangeset size. Does this
>>> sounds more acceptable?
>>
>> The lower the limit the better (but no matter how low the limit
>> it won't make this a pretty thing). Anyway I'd still like to wait
>> for what Ian may further say on this.
>>
> Hi Jan, I just had a discussion with my colleague. We believe 8K could
> be the biggest limit for the write-protected ram ranges. If in the
> future, number of vGPU page tables exceeds this limit, we will modify
> our back-end device model to find a trade-off method, instead of
> extending this limit. If you can accept this value as the upper bound
> of rangeset, maybe we do not need to add any tool stack parameters, but
> define a MAX_NR_WR_RAM_RANGES for the write-protected ram rangesset. As
> to other rangesets, we keep their limit as 256. Does this sounds OK? :)

I'm getting the impression that we're moving in circles. A blanket
limit above the 256 one for all domains is _not_ going to be
acceptable; going to 8k will still need host admin consent. With
your rangeset performance improvement patch, each range is
going to be tracked by a 40 byte structure (up from 32), which
already means an overhead increase for all the other ranges. 8k
of wp ranges implies an overhead beyond 448k (including the
xmalloc() overhead), which is not _that_ much, but also not
negligible.

Jan

Paul Durrant Feb. 3, 2016, 12:20 p.m. UTC | #33

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 03 February 2016 08:33
> To: Zhang Yu
> Cc: Andrew Cooper; Ian Campbell; Paul Durrant; Wei Liu; Ian Jackson; Stefano
> Stabellini; Kevin Tian; zhiyuan.lv@intel.com; xen-devel@lists.xen.org; Keir
> (Xen.org)
> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> >>> On 03.02.16 at 08:10, <yu.c.zhang@linux.intel.com> wrote:
> > On 2/2/2016 11:21 PM, Jan Beulich wrote:
> >>>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote:
> >>> The limit of 4G is to avoid the data missing from uint64 to uint32
> >>> assignment. And I can accept the 8K limit for XenGT in practice.
> >>> After all, it is vGPU page tables we are trying to trap and emulate,
> >>> not normal page frames.
> >>>
> >>> And I guess the reason that one domain exhausting Xen's memory can
> >>> affect another domain is because rangeset uses Xen heap, instead of
> the
> >>> per-domain memory. So what about we use a 8K limit by now for XenGT,
> >>> and in the future, if a per-domain memory allocation solution for
> >>> rangeset is ready, we do need to limit the rangeset size. Does this
> >>> sounds more acceptable?
> >>
> >> The lower the limit the better (but no matter how low the limit
> >> it won't make this a pretty thing). Anyway I'd still like to wait
> >> for what Ian may further say on this.
> >>
> > Hi Jan, I just had a discussion with my colleague. We believe 8K could
> > be the biggest limit for the write-protected ram ranges. If in the
> > future, number of vGPU page tables exceeds this limit, we will modify
> > our back-end device model to find a trade-off method, instead of
> > extending this limit. If you can accept this value as the upper bound
> > of rangeset, maybe we do not need to add any tool stack parameters, but
> > define a MAX_NR_WR_RAM_RANGES for the write-protected ram
> rangesset. As
> > to other rangesets, we keep their limit as 256. Does this sounds OK? :)
> 
> I'm getting the impression that we're moving in circles. A blanket
> limit above the 256 one for all domains is _not_ going to be
> acceptable; going to 8k will still need host admin consent. With
> your rangeset performance improvement patch, each range is
> going to be tracked by a 40 byte structure (up from 32), which
> already means an overhead increase for all the other ranges. 8k
> of wp ranges implies an overhead beyond 448k (including the
> xmalloc() overhead), which is not _that_ much, but also not
> negligible.
> 

... which means we are still going to need a toolstack parameter to set the limit. We already have a parameter for VRAM size so is having a parameter for max. GTT shadow ranges such a bad thing? Is the fact that the memory comes from xenheap rather than domheap the real problem?

  Paul

> Jan

Jan Beulich Feb. 3, 2016, 12:35 p.m. UTC | #34

>>> On 03.02.16 at 13:20, <Paul.Durrant@citrix.com> wrote:
>>  -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: 03 February 2016 08:33
>> To: Zhang Yu
>> Cc: Andrew Cooper; Ian Campbell; Paul Durrant; Wei Liu; Ian Jackson; Stefano
>> Stabellini; Kevin Tian; zhiyuan.lv@intel.com; xen-devel@lists.xen.org; Keir
>> (Xen.org)
>> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
>> max_wp_ram_ranges.
>> 
>> >>> On 03.02.16 at 08:10, <yu.c.zhang@linux.intel.com> wrote:
>> > On 2/2/2016 11:21 PM, Jan Beulich wrote:
>> >>>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote:
>> >>> The limit of 4G is to avoid the data missing from uint64 to uint32
>> >>> assignment. And I can accept the 8K limit for XenGT in practice.
>> >>> After all, it is vGPU page tables we are trying to trap and emulate,
>> >>> not normal page frames.
>> >>>
>> >>> And I guess the reason that one domain exhausting Xen's memory can
>> >>> affect another domain is because rangeset uses Xen heap, instead of
>> the
>> >>> per-domain memory. So what about we use a 8K limit by now for XenGT,
>> >>> and in the future, if a per-domain memory allocation solution for
>> >>> rangeset is ready, we do need to limit the rangeset size. Does this
>> >>> sounds more acceptable?
>> >>
>> >> The lower the limit the better (but no matter how low the limit
>> >> it won't make this a pretty thing). Anyway I'd still like to wait
>> >> for what Ian may further say on this.
>> >>
>> > Hi Jan, I just had a discussion with my colleague. We believe 8K could
>> > be the biggest limit for the write-protected ram ranges. If in the
>> > future, number of vGPU page tables exceeds this limit, we will modify
>> > our back-end device model to find a trade-off method, instead of
>> > extending this limit. If you can accept this value as the upper bound
>> > of rangeset, maybe we do not need to add any tool stack parameters, but
>> > define a MAX_NR_WR_RAM_RANGES for the write-protected ram
>> rangesset. As
>> > to other rangesets, we keep their limit as 256. Does this sounds OK? :)
>> 
>> I'm getting the impression that we're moving in circles. A blanket
>> limit above the 256 one for all domains is _not_ going to be
>> acceptable; going to 8k will still need host admin consent. With
>> your rangeset performance improvement patch, each range is
>> going to be tracked by a 40 byte structure (up from 32), which
>> already means an overhead increase for all the other ranges. 8k
>> of wp ranges implies an overhead beyond 448k (including the
>> xmalloc() overhead), which is not _that_ much, but also not
>> negligible.
>> 
> 
> ... which means we are still going to need a toolstack parameter to set the 
> limit. We already have a parameter for VRAM size so is having a parameter for 
> max. GTT shadow ranges such a bad thing?

It's workable, but not nice (see also Ian's earlier response).

> Is the fact that the memory comes 
> from xenheap rather than domheap the real problem?

Not the primary one, since except on huge memory machines
both heaps are identical. To me the primary one is the quite
more significant resource consumption in the first place (I'm not
going to repeat what I've written in already way too many
replies before).

Jan

Paul Durrant Feb. 3, 2016, 12:50 p.m. UTC | #35

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 03 February 2016 12:36
> To: Paul Durrant
> Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu;
> Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen-devel@lists.xen.org; Keir
> (Xen.org)
> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> >>> On 03.02.16 at 13:20, <Paul.Durrant@citrix.com> wrote:
> >>  -----Original Message-----
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: 03 February 2016 08:33
> >> To: Zhang Yu
> >> Cc: Andrew Cooper; Ian Campbell; Paul Durrant; Wei Liu; Ian Jackson;
> Stefano
> >> Stabellini; Kevin Tian; zhiyuan.lv@intel.com; xen-devel@lists.xen.org; Keir
> >> (Xen.org)
> >> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> >> max_wp_ram_ranges.
> >>
> >> >>> On 03.02.16 at 08:10, <yu.c.zhang@linux.intel.com> wrote:
> >> > On 2/2/2016 11:21 PM, Jan Beulich wrote:
> >> >>>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote:
> >> >>> The limit of 4G is to avoid the data missing from uint64 to uint32
> >> >>> assignment. And I can accept the 8K limit for XenGT in practice.
> >> >>> After all, it is vGPU page tables we are trying to trap and emulate,
> >> >>> not normal page frames.
> >> >>>
> >> >>> And I guess the reason that one domain exhausting Xen's memory
> can
> >> >>> affect another domain is because rangeset uses Xen heap, instead of
> >> the
> >> >>> per-domain memory. So what about we use a 8K limit by now for
> XenGT,
> >> >>> and in the future, if a per-domain memory allocation solution for
> >> >>> rangeset is ready, we do need to limit the rangeset size. Does this
> >> >>> sounds more acceptable?
> >> >>
> >> >> The lower the limit the better (but no matter how low the limit
> >> >> it won't make this a pretty thing). Anyway I'd still like to wait
> >> >> for what Ian may further say on this.
> >> >>
> >> > Hi Jan, I just had a discussion with my colleague. We believe 8K could
> >> > be the biggest limit for the write-protected ram ranges. If in the
> >> > future, number of vGPU page tables exceeds this limit, we will modify
> >> > our back-end device model to find a trade-off method, instead of
> >> > extending this limit. If you can accept this value as the upper bound
> >> > of rangeset, maybe we do not need to add any tool stack parameters,
> but
> >> > define a MAX_NR_WR_RAM_RANGES for the write-protected ram
> >> rangesset. As
> >> > to other rangesets, we keep their limit as 256. Does this sounds OK? :)
> >>
> >> I'm getting the impression that we're moving in circles. A blanket
> >> limit above the 256 one for all domains is _not_ going to be
> >> acceptable; going to 8k will still need host admin consent. With
> >> your rangeset performance improvement patch, each range is
> >> going to be tracked by a 40 byte structure (up from 32), which
> >> already means an overhead increase for all the other ranges. 8k
> >> of wp ranges implies an overhead beyond 448k (including the
> >> xmalloc() overhead), which is not _that_ much, but also not
> >> negligible.
> >>
> >
> > ... which means we are still going to need a toolstack parameter to set the
> > limit. We already have a parameter for VRAM size so is having a parameter
> for
> > max. GTT shadow ranges such a bad thing?
> 
> It's workable, but not nice (see also Ian's earlier response).
> 
> > Is the fact that the memory comes
> > from xenheap rather than domheap the real problem?
> 
> Not the primary one, since except on huge memory machines
> both heaps are identical. To me the primary one is the quite
> more significant resource consumption in the first place (I'm not
> going to repeat what I've written in already way too many
> replies before).

Ok. Well the only way round tracking specific ranges for emulation (and consequently suffering the overhead) is tracking by type. For XenGT I guess it would be possible to live with a situation where a single ioreq server can register all wp mem emulations for a given VM. I can't say I particularly like that way of doing things but if it's the only way forward then I guess we may have to live with it.

  Paul

> 
> Jan

Jan Beulich Feb. 3, 2016, 1 p.m. UTC | #36

>>> On 03.02.16 at 13:50, <Paul.Durrant@citrix.com> wrote:
>>  -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: 03 February 2016 12:36
>> To: Paul Durrant
>> Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu;
>> Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen-devel@lists.xen.org; Keir
>> (Xen.org)
>> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
>> max_wp_ram_ranges.
>> 
>> >>> On 03.02.16 at 13:20, <Paul.Durrant@citrix.com> wrote:
>> >>  -----Original Message-----
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: 03 February 2016 08:33
>> >> To: Zhang Yu
>> >> Cc: Andrew Cooper; Ian Campbell; Paul Durrant; Wei Liu; Ian Jackson;
>> Stefano
>> >> Stabellini; Kevin Tian; zhiyuan.lv@intel.com; xen-devel@lists.xen.org; Keir
>> >> (Xen.org)
>> >> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
>> >> max_wp_ram_ranges.
>> >>
>> >> >>> On 03.02.16 at 08:10, <yu.c.zhang@linux.intel.com> wrote:
>> >> > On 2/2/2016 11:21 PM, Jan Beulich wrote:
>> >> >>>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote:
>> >> >>> The limit of 4G is to avoid the data missing from uint64 to uint32
>> >> >>> assignment. And I can accept the 8K limit for XenGT in practice.
>> >> >>> After all, it is vGPU page tables we are trying to trap and emulate,
>> >> >>> not normal page frames.
>> >> >>>
>> >> >>> And I guess the reason that one domain exhausting Xen's memory
>> can
>> >> >>> affect another domain is because rangeset uses Xen heap, instead of
>> >> the
>> >> >>> per-domain memory. So what about we use a 8K limit by now for
>> XenGT,
>> >> >>> and in the future, if a per-domain memory allocation solution for
>> >> >>> rangeset is ready, we do need to limit the rangeset size. Does this
>> >> >>> sounds more acceptable?
>> >> >>
>> >> >> The lower the limit the better (but no matter how low the limit
>> >> >> it won't make this a pretty thing). Anyway I'd still like to wait
>> >> >> for what Ian may further say on this.
>> >> >>
>> >> > Hi Jan, I just had a discussion with my colleague. We believe 8K could
>> >> > be the biggest limit for the write-protected ram ranges. If in the
>> >> > future, number of vGPU page tables exceeds this limit, we will modify
>> >> > our back-end device model to find a trade-off method, instead of
>> >> > extending this limit. If you can accept this value as the upper bound
>> >> > of rangeset, maybe we do not need to add any tool stack parameters,
>> but
>> >> > define a MAX_NR_WR_RAM_RANGES for the write-protected ram
>> >> rangesset. As
>> >> > to other rangesets, we keep their limit as 256. Does this sounds OK? :)
>> >>
>> >> I'm getting the impression that we're moving in circles. A blanket
>> >> limit above the 256 one for all domains is _not_ going to be
>> >> acceptable; going to 8k will still need host admin consent. With
>> >> your rangeset performance improvement patch, each range is
>> >> going to be tracked by a 40 byte structure (up from 32), which
>> >> already means an overhead increase for all the other ranges. 8k
>> >> of wp ranges implies an overhead beyond 448k (including the
>> >> xmalloc() overhead), which is not _that_ much, but also not
>> >> negligible.
>> >>
>> >
>> > ... which means we are still going to need a toolstack parameter to set the
>> > limit. We already have a parameter for VRAM size so is having a parameter
>> for
>> > max. GTT shadow ranges such a bad thing?
>> 
>> It's workable, but not nice (see also Ian's earlier response).
>> 
>> > Is the fact that the memory comes
>> > from xenheap rather than domheap the real problem?
>> 
>> Not the primary one, since except on huge memory machines
>> both heaps are identical. To me the primary one is the quite
>> more significant resource consumption in the first place (I'm not
>> going to repeat what I've written in already way too many
>> replies before).
> 
> Ok. Well the only way round tracking specific ranges for emulation (and 
> consequently suffering the overhead) is tracking by type. For XenGT I guess 
> it would be possible to live with a situation where a single ioreq server can 
> register all wp mem emulations for a given VM. I can't say I particularly 
> like that way of doing things but if it's the only way forward then I guess 
> we may have to live with it.

Well, subject to Ian not objecting (still awaiting some follow-up by
him), I didn't mean to say doing it the proposed way is a no-go.
All that I really insist on is that this larger resource consumption
won't go without some form of host admin consent.

Jan

Paul Durrant Feb. 3, 2016, 1:07 p.m. UTC | #37

> -----Original Message-----
[snip]
> >> >> I'm getting the impression that we're moving in circles. A blanket
> >> >> limit above the 256 one for all domains is _not_ going to be
> >> >> acceptable; going to 8k will still need host admin consent. With
> >> >> your rangeset performance improvement patch, each range is
> >> >> going to be tracked by a 40 byte structure (up from 32), which
> >> >> already means an overhead increase for all the other ranges. 8k
> >> >> of wp ranges implies an overhead beyond 448k (including the
> >> >> xmalloc() overhead), which is not _that_ much, but also not
> >> >> negligible.
> >> >>
> >> >
> >> > ... which means we are still going to need a toolstack parameter to set
> the
> >> > limit. We already have a parameter for VRAM size so is having a
> parameter
> >> for
> >> > max. GTT shadow ranges such a bad thing?
> >>
> >> It's workable, but not nice (see also Ian's earlier response).
> >>
> >> > Is the fact that the memory comes
> >> > from xenheap rather than domheap the real problem?
> >>
> >> Not the primary one, since except on huge memory machines
> >> both heaps are identical. To me the primary one is the quite
> >> more significant resource consumption in the first place (I'm not
> >> going to repeat what I've written in already way too many
> >> replies before).
> >
> > Ok. Well the only way round tracking specific ranges for emulation (and
> > consequently suffering the overhead) is tracking by type. For XenGT I
> guess
> > it would be possible to live with a situation where a single ioreq server can
> > register all wp mem emulations for a given VM. I can't say I particularly
> > like that way of doing things but if it's the only way forward then I guess
> > we may have to live with it.
> 
> Well, subject to Ian not objecting (still awaiting some follow-up by
> him), I didn't mean to say doing it the proposed way is a no-go.
> All that I really insist on is that this larger resource consumption
> won't go without some form of host admin consent.
> 

Would you be ok with purely host admin consent e.g. just setting the limit via boot command line?

  Paul

> Jan

Jan Beulich Feb. 3, 2016, 1:17 p.m. UTC | #38

>>> On 03.02.16 at 14:07, <Paul.Durrant@citrix.com> wrote:
>>  -----Original Message-----
> [snip]
>> >> >> I'm getting the impression that we're moving in circles. A blanket
>> >> >> limit above the 256 one for all domains is _not_ going to be
>> >> >> acceptable; going to 8k will still need host admin consent. With
>> >> >> your rangeset performance improvement patch, each range is
>> >> >> going to be tracked by a 40 byte structure (up from 32), which
>> >> >> already means an overhead increase for all the other ranges. 8k
>> >> >> of wp ranges implies an overhead beyond 448k (including the
>> >> >> xmalloc() overhead), which is not _that_ much, but also not
>> >> >> negligible.
>> >> >>
>> >> >
>> >> > ... which means we are still going to need a toolstack parameter to set
>> the
>> >> > limit. We already have a parameter for VRAM size so is having a
>> parameter
>> >> for
>> >> > max. GTT shadow ranges such a bad thing?
>> >>
>> >> It's workable, but not nice (see also Ian's earlier response).
>> >>
>> >> > Is the fact that the memory comes
>> >> > from xenheap rather than domheap the real problem?
>> >>
>> >> Not the primary one, since except on huge memory machines
>> >> both heaps are identical. To me the primary one is the quite
>> >> more significant resource consumption in the first place (I'm not
>> >> going to repeat what I've written in already way too many
>> >> replies before).
>> >
>> > Ok. Well the only way round tracking specific ranges for emulation (and
>> > consequently suffering the overhead) is tracking by type. For XenGT I
>> guess
>> > it would be possible to live with a situation where a single ioreq server can
>> > register all wp mem emulations for a given VM. I can't say I particularly
>> > like that way of doing things but if it's the only way forward then I guess
>> > we may have to live with it.
>> 
>> Well, subject to Ian not objecting (still awaiting some follow-up by
>> him), I didn't mean to say doing it the proposed way is a no-go.
>> All that I really insist on is that this larger resource consumption
>> won't go without some form of host admin consent.
> 
> Would you be ok with purely host admin consent e.g. just setting the limit 
> via boot command line?

I wouldn't be happy with that (and I've said so before), since it
would allow all VM this extra resource consumption.

Jan

Paul Durrant Feb. 3, 2016, 1:18 p.m. UTC | #39

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 03 February 2016 13:18
> To: Paul Durrant
> Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu;
> Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen-devel@lists.xen.org; Keir
> (Xen.org)
> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> >>> On 03.02.16 at 14:07, <Paul.Durrant@citrix.com> wrote:
> >>  -----Original Message-----
> > [snip]
> >> >> >> I'm getting the impression that we're moving in circles. A blanket
> >> >> >> limit above the 256 one for all domains is _not_ going to be
> >> >> >> acceptable; going to 8k will still need host admin consent. With
> >> >> >> your rangeset performance improvement patch, each range is
> >> >> >> going to be tracked by a 40 byte structure (up from 32), which
> >> >> >> already means an overhead increase for all the other ranges. 8k
> >> >> >> of wp ranges implies an overhead beyond 448k (including the
> >> >> >> xmalloc() overhead), which is not _that_ much, but also not
> >> >> >> negligible.
> >> >> >>
> >> >> >
> >> >> > ... which means we are still going to need a toolstack parameter to
> set
> >> the
> >> >> > limit. We already have a parameter for VRAM size so is having a
> >> parameter
> >> >> for
> >> >> > max. GTT shadow ranges such a bad thing?
> >> >>
> >> >> It's workable, but not nice (see also Ian's earlier response).
> >> >>
> >> >> > Is the fact that the memory comes
> >> >> > from xenheap rather than domheap the real problem?
> >> >>
> >> >> Not the primary one, since except on huge memory machines
> >> >> both heaps are identical. To me the primary one is the quite
> >> >> more significant resource consumption in the first place (I'm not
> >> >> going to repeat what I've written in already way too many
> >> >> replies before).
> >> >
> >> > Ok. Well the only way round tracking specific ranges for emulation (and
> >> > consequently suffering the overhead) is tracking by type. For XenGT I
> >> guess
> >> > it would be possible to live with a situation where a single ioreq server
> can
> >> > register all wp mem emulations for a given VM. I can't say I particularly
> >> > like that way of doing things but if it's the only way forward then I guess
> >> > we may have to live with it.
> >>
> >> Well, subject to Ian not objecting (still awaiting some follow-up by
> >> him), I didn't mean to say doing it the proposed way is a no-go.
> >> All that I really insist on is that this larger resource consumption
> >> won't go without some form of host admin consent.
> >
> > Would you be ok with purely host admin consent e.g. just setting the limit
> > via boot command line?
> 
> I wouldn't be happy with that (and I've said so before), since it
> would allow all VM this extra resource consumption.
> 

The ball is back in Ian's court then.

  Paul

> Jan

Ian Jackson Feb. 3, 2016, 2:43 p.m. UTC | #40

Paul Durrant writes ("RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."):
> > From: Jan Beulich [mailto:JBeulich@suse.com]
...
> > I wouldn't be happy with that (and I've said so before), since it
> > would allow all VM this extra resource consumption.
> 
> The ball is back in Ian's court then.

Sorry to be vague, but: I'm not definitely objecting to some toolstack
parameter.  I'm trying to figure out whether this parameter, in this
form, with this documentation, makes some kind of sense.

In the most recent proposed patch the docs basically say (to most
users) "there is this parameter, but it is very complicated, so do not
set it.  We already have a lot of these kind of parameters.  As a
general rule they are OK if it is really the case that the parameter
should be ignored.  I am happy to have a whole lot of strange
parameters that the user can ignore.

But as far as I can tell from this conversation, users are going to
need to set this parameter in normal operation in some
configurations.

I would ideally like to avoid a situation where (i) the Xen docs say
"do not set this parameter because it is confusing" but (ii) other
less authoritative sources (wiki pages, or mailing list threads, etc.)
say "oh yes just set this weird parameter to 8192 for no readily
comprehensible reason".

I say `some configurations' because, I'm afraid, most of the
conversation about hypervisor internals has gone over my head.  Let me
try to summarise (correct me if I am wrong):

 * There are some hypervisor tracking resources associated with each
   emulated MMIO range.

   (Do we mean the memory ranges that are configured in the hypervisor
   to be sent to an ioemu via the ioreq protocol - ie, the system
   which is normally used in HVM domains to interface to the device
   model?)

   (Are these ranges denominated in guest-physical space?)

 * For almost all domains the set of such MMIO ranges is small or very
   small.

 * Such ranges are sometimes created by, or specified by, the guest.

   (I don't understand why this should be the case but perhaps this is
   an inherent aspect of the design of this new feature.)

 * If too many such ranges were created by the guest the guest could
   consume excessive hypervisor memory.

 * Therefore normally the number of such ranges per guest is (or
   should be) limited to a small number.

 * With `Intel GVT-g broadwell platform' and `vGPU in GVT-g' or
   `GVT-d' it may be necessary for functionality to allow a larger
   number of such ranges.

But to be able to know what the right interface is for the system
administrator (and what to write in the docs), I need know:

 * Is it possible for libxl to somehow tell from the rest of the
   configuration that this larger limit should be applied ?

 * In the configurations where a larger number is needed, what larger
   limit is appropriate ?  How should it be calculated ?

 * How do we know that this does not itself give an opportunity for
   hypervisor resource exhaustion attacks by guests ?  (Note: if it
   _does_ give such an opportunity, this should be mentioned more
   clearly in the documentation.)

 * If we are talking about mmio ranges for ioreq servers, why do
   guests which do not use this feature have the ability to create
   them at all ?

(A background problem I have is that this thread is full of product
name jargon and assumes a lot of background knowledge of the
implementation of these features - background knowledge which I lack
and which isn't in these patches.  If someone could point me at a
quick summary of what `GVT-g' and `GVT-d' are that might help.)

Ian.

Paul Durrant Feb. 3, 2016, 3:10 p.m. UTC | #41

> -----Original Message-----
> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
> Sent: 03 February 2016 14:44
> To: Paul Durrant
> Cc: Jan Beulich; Andrew Cooper; Ian Campbell; Stefano Stabellini; Wei Liu;
> Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen-devel@lists.xen.org; Keir
> (Xen.org)
> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> Paul Durrant writes ("RE: [Xen-devel] [PATCH v3 3/3] tools: introduce
> parameter max_wp_ram_ranges."):
> > > From: Jan Beulich [mailto:JBeulich@suse.com]
> ...
> > > I wouldn't be happy with that (and I've said so before), since it
> > > would allow all VM this extra resource consumption.
> >
> > The ball is back in Ian's court then.
> 
> Sorry to be vague, but: I'm not definitely objecting to some toolstack
> parameter.  I'm trying to figure out whether this parameter, in this
> form, with this documentation, makes some kind of sense.
> 
> In the most recent proposed patch the docs basically say (to most
> users) "there is this parameter, but it is very complicated, so do not
> set it.  We already have a lot of these kind of parameters.  As a
> general rule they are OK if it is really the case that the parameter
> should be ignored.  I am happy to have a whole lot of strange
> parameters that the user can ignore.
> 
> But as far as I can tell from this conversation, users are going to
> need to set this parameter in normal operation in some
> configurations.
> 
> I would ideally like to avoid a situation where (i) the Xen docs say
> "do not set this parameter because it is confusing" but (ii) other
> less authoritative sources (wiki pages, or mailing list threads, etc.)
> say "oh yes just set this weird parameter to 8192 for no readily
> comprehensible reason".
> 
> I say `some configurations' because, I'm afraid, most of the
> conversation about hypervisor internals has gone over my head.  Let me
> try to summarise (correct me if I am wrong):
> 
>  * There are some hypervisor tracking resources associated with each
>    emulated MMIO range.
> 
>    (Do we mean the memory ranges that are configured in the hypervisor
>    to be sent to an ioemu via the ioreq protocol - ie, the system
>    which is normally used in HVM domains to interface to the device
>    model?)

Yes, the ioreq server infrastructure needs rangesets to steer  particular emulation requests to the correct emulator.

> 
>    (Are these ranges denominated in guest-physical space?)
> 

Yes.

>  * For almost all domains the set of such MMIO ranges is small or very
>    small.
> 
>  * Such ranges are sometimes created by, or specified by, the guest.
> 
>    (I don't understand why this should be the case but perhaps this is
>    an inherent aspect of the design of this new feature.)
> 

This is the 'new' thing here. Because XenGT needs to shadow GTTs (i.e. GPU page tables) by steering writes to the GVT-g ioreq server and the number of GTTs is broadly determined by the number of applications using the GPU inside the VM then we have a situation where the guest is determining Xen's use of memory.

>  * If too many such ranges were created by the guest the guest could
>    consume excessive hypervisor memory.
> 
>  * Therefore normally the number of such ranges per guest is (or
>    should be) limited to a small number.
> 
>  * With `Intel GVT-g broadwell platform' and `vGPU in GVT-g' or
>    `GVT-d' it may be necessary for functionality to allow a larger
>    number of such ranges.
> 
> But to be able to know what the right interface is for the system
> administrator (and what to write in the docs), I need know:
> 
>  * Is it possible for libxl to somehow tell from the rest of the
>    configuration that this larger limit should be applied ?
> 

If a XenGT-enabled VM is provisioned through libxl then some larger limit is likely to be required. One of the issues is that it is impossible (or at least difficult) to know how many GTTs are going to need to be shadowed.

>  * In the configurations where a larger number is needed, what larger
>    limit is appropriate ?  How should it be calculated ?
> 

Intel are suggesting a static 8k value should be ok.

>  * How do we know that this does not itself give an opportunity for
>    hypervisor resource exhaustion attacks by guests ?  (Note: if it
>    _does_ give such an opportunity, this should be mentioned more
>    clearly in the documentation.)
> 

In practice XenGT only supports 4 VMs (including dom0) on Haswell and 7 on Broadwell so even with the larger limit we're not talking about a huge amount of resource.

>  * If we are talking about mmio ranges for ioreq servers, why do
>    guests which do not use this feature have the ability to create
>    them at all ?

It's not the guest that directly creates the ranges, it's the emulator. Normally device emulation would require a relatively small number of MMIO ranges and a total number that cannot be influenced by the guest itself. In this case though, as I said above, the number *can* be influenced by the guest (although it is still the emulator which actually causes the ranges to be created).

> 
> (A background problem I have is that this thread is full of product
> name jargon and assumes a lot of background knowledge of the
> implementation of these features - background knowledge which I lack
> and which isn't in these patches.  If someone could point me at a
> quick summary of what `GVT-g' and `GVT-d' are that might help.)
> 

GVT-d is a name applied to PCI passthrough of an Intel GPU. GVT-g is a name applied to Intel GPU virtualization, which makes use of an emulator to mediate guest access to the real GPU so that it is possible to share the resources securely.

  Paul

> Ian.

George Dunlap Feb. 3, 2016, 5:41 p.m. UTC | #42

On Wed, Feb 3, 2016 at 2:43 PM, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote:
> Paul Durrant writes ("RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."):
>> > From: Jan Beulich [mailto:JBeulich@suse.com]
> ...
>> > I wouldn't be happy with that (and I've said so before), since it
>> > would allow all VM this extra resource consumption.
>>
>> The ball is back in Ian's court then.
>
> Sorry to be vague, but: I'm not definitely objecting to some toolstack
> parameter.  I'm trying to figure out whether this parameter, in this
> form, with this documentation, makes some kind of sense.
>
> In the most recent proposed patch the docs basically say (to most
> users) "there is this parameter, but it is very complicated, so do not
> set it.  We already have a lot of these kind of parameters.  As a
> general rule they are OK if it is really the case that the parameter
> should be ignored.  I am happy to have a whole lot of strange
> parameters that the user can ignore.
>
> But as far as I can tell from this conversation, users are going to
> need to set this parameter in normal operation in some
> configurations.
>
> I would ideally like to avoid a situation where (i) the Xen docs say
> "do not set this parameter because it is confusing" but (ii) other
> less authoritative sources (wiki pages, or mailing list threads, etc.)
> say "oh yes just set this weird parameter to 8192 for no readily
> comprehensible reason".
>
> I say `some configurations' because, I'm afraid, most of the
> conversation about hypervisor internals has gone over my head.  Let me
> try to summarise (correct me if I am wrong):
>
>  * There are some hypervisor tracking resources associated with each
>    emulated MMIO range.
>
>    (Do we mean the memory ranges that are configured in the hypervisor
>    to be sent to an ioemu via the ioreq protocol - ie, the system
>    which is normally used in HVM domains to interface to the device
>    model?)
>
>    (Are these ranges denominated in guest-physical space?)
>
>  * For almost all domains the set of such MMIO ranges is small or very
>    small.
>
>  * Such ranges are sometimes created by, or specified by, the guest.
>
>    (I don't understand why this should be the case but perhaps this is
>    an inherent aspect of the design of this new feature.)

So the real issue here, as I've said elsewhere is this:

These are not MMIO regions.  They are not IO because they do not talk
to devices, and they are not regions; they are individual gpfns.

What's happening here is that qemu wants to be able to do the
equivalent of shadow pagetable emulation for the GPU's equivalent of
the pagetables (there's a special name, I forget what they're called).

These gpfns are just normal guest memory, selected by the operating
system and/or the graphics driver to use in the equivalent of a
pagetable for the GPU.

And instead of making a new interface designed to keep track of gpfns,
they are (ab)using the existing "MMIO range" interface.

But of course, since they they aren't actually ranges but just gpfns,
they're scattered randomly throughout the guest physical address
space.

That's why XenGT suddenly wants orders of magnitude more "MMIO
regions" than any other device has ever needed before -- because
they're using a hammer when what they want is a rake.

They claim that 8k "should be enough for anybody", but as far as I can
tell, the theoretical limit of the number of pages being used in the
GPU pagetables could be unbounded.

I think at some point I suggested an alternate design based on marking
such gpfns with a special p2m type; I can't remember if that
suggestion was actually addressed or not.

 -George

George Dunlap Feb. 3, 2016, 5:50 p.m. UTC | #43

On Wed, Feb 3, 2016 at 3:10 PM, Paul Durrant <Paul.Durrant@citrix.com> wrote:
>>  * Is it possible for libxl to somehow tell from the rest of the
>>    configuration that this larger limit should be applied ?
>>
>
> If a XenGT-enabled VM is provisioned through libxl then some larger limit is likely to be required. One of the issues is that it is impossible (or at least difficult) to know how many GTTs are going to need to be shadowed.

By GTT, you mean the GPU pagetables I assume?  So you're talking about
how large this value should be made, not whether the
heuristically-chosen larger value should be used.  libxl should be
able to tell that XenGT is enabled, I assume, so it should be able to
automatically bump this to 8k if necessary, right?

But I think you'd still need a parameter that you could tweak if it
turned out that 8k wasn't enough for a particular workload, right?

>>  * If we are talking about mmio ranges for ioreq servers, why do
>>    guests which do not use this feature have the ability to create
>>    them at all ?
>
> It's not the guest that directly creates the ranges, it's the emulator. Normally device emulation would require a relatively small number of MMIO ranges and a total number that cannot be influenced by the guest itself. In this case though, as I said above, the number *can* be influenced by the guest (although it is still the emulator which actually causes the ranges to be created).

Just to make this clear: The guest chooses how many gpfns are used in
the GPU pagetables; for each additional gpfn in the guest pagetable,
qemu / xen have the option of either marking it to be emulated (at the
moment, by marking it as a one-page "MMIO region") or crashing the
guest.

>> (A background problem I have is that this thread is full of product
>> name jargon and assumes a lot of background knowledge of the
>> implementation of these features - background knowledge which I lack
>> and which isn't in these patches.  If someone could point me at a
>> quick summary of what `GVT-g' and `GVT-d' are that might help.)
>>
>
> GVT-d is a name applied to PCI passthrough of an Intel GPU. GVT-g is a name applied to Intel GPU virtualization, which makes use of an emulator to mediate guest access to the real GPU so that it is possible to share the resources securely.

And GTT are the GPU equivalent of page tables?

 -George

George Dunlap Feb. 3, 2016, 6:21 p.m. UTC | #44

On Wed, Feb 3, 2016 at 5:41 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> I think at some point I suggested an alternate design based on marking
> such gpfns with a special p2m type; I can't remember if that
> suggestion was actually addressed or not.

FWIW, the thread where I suggested using p2m types was in response to

<1436163912-1506-2-git-send-email-yu.c.zhang@linux.intel.com>

Looking through it again, the main objection Paul gave[1]  was:

"And it's the assertion that use of write_dm will only be relevant to
gfns, and that all such notifications only need go to a single ioreq
server, that I have a problem with. Whilst the use of io ranges to
track gfn updates is, I agree, not ideal I think the overloading of
write_dm is not a step in the right direction."

Two issues raised here, about using only p2m types to implement write_dm:
1. More than one ioreq server may want to use the write_dm functionality
2. ioreq servers may want to use write_dm for things other than individual gpfns

My answer to #1 was:
1. At the moment, we only need to support a single ioreq server using write_dm
2. It's not technically difficult to extend the number of servers
supported to something sensible, like 4 (using 4 different write_dm
p2m types)
3. The interface can be designed such that we can extend support to
multiple servers when we need to.

My answer to #2 was that there's no reason why using write_dm could be
used for both individual gpfns and ranges; there's no reason the
interface can't take a "start" and "count" argument, even if for the
time being "count" is almost always going to be 1.

Compare this to the downsides of the approach you're proposing:
1. Using 40 bytes of hypervisor space per guest GPU pagetable page (as
opposed to using a bit in the existing p2m table)
2. Walking down an RB tree with 8000 individual nodes to find out
which server to send the message to (rather than just reading the
value from the p2m table).
3. Needing to determine on a guest-by-guest basis whether to change the limit
4. Needing to have an interface to make the limit even bigger, just in
case we find workloads that have even more GTTs.

I really don't understand where you're coming from on this.  The
approach you've chosen looks to me to be slower, more difficult to
implement, and more complicated; and it's caused a lot more resistance
trying to get this series accepted.

 -George

[1] <9AAE0902D5BC7E449B7C8E4E778ABCD02598B222@AMSPEX01CL02.citrite.net>

George Dunlap Feb. 3, 2016, 6:26 p.m. UTC | #45

On Wed, Feb 3, 2016 at 6:21 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> I really don't understand where you're coming from on this.  The
> approach you've chosen looks to me to be slower, more difficult to
> implement, and more complicated; and it's caused a lot more resistance
> trying to get this series accepted.

Maybe it would be helpful if I did a mock-up implementation of what I
had in mind, and you can have something concrete to criticize.  I'll
see if I can do that sometime this week.

 -George

Andrew Cooper Feb. 3, 2016, 6:39 p.m. UTC | #46

On 03/02/16 18:21, George Dunlap wrote:
> 2. It's not technically difficult to extend the number of servers
> supported to something sensible, like 4 (using 4 different write_dm
> p2m types)

While technically true, spare bits in the pagetable entries are at a
premium, and steadily decreasing as Intel are introducing new features.

We have 16 current p2m types, and a finite upper bound of 6 bits of p2m
type space, already with a push to reduce this number.

While introducing 1 new p2m type for this purpose might be an acceptable
tradeoff, a using a p2m type per ioreq server is not IMO.

As an orthogonal exercise, several of the existing p2m types are
redundant and can (or are already) expressed elsehow.  There is
definitely room to reduce the current 16 types down a bit without too
much effort.

~Andrew

George Dunlap Feb. 3, 2016, 7:12 p.m. UTC | #47

On 03/02/16 18:39, Andrew Cooper wrote:
> On 03/02/16 18:21, George Dunlap wrote:
>> 2. It's not technically difficult to extend the number of servers
>> supported to something sensible, like 4 (using 4 different write_dm
>> p2m types)
> 
> While technically true, spare bits in the pagetable entries are at a
> premium, and steadily decreasing as Intel are introducing new features.
> 
> We have 16 current p2m types, and a finite upper bound of 6 bits of p2m
> type space, already with a push to reduce this number.
> 
> While introducing 1 new p2m type for this purpose might be an acceptable
> tradeoff, a using a p2m type per ioreq server is not IMO.

It is true that we don't have a ton of elbow room to grow at the moment.

But we actually already have a single p2m type -- mmio_write_dm -- that
as far as I know is only being used by XenGT.  We don't actually need to
add any new p2m types for my initial proposal.

Going forward, we probably will, at some point, need to implement a
parallel "p2t" structure to keep track of types -- and probably will
whether end up implementing 4 separate write_dm types or not (for the
reasons you describe).

 -George

Yu Zhang Feb. 4, 2016, 8:50 a.m. UTC | #48

On 2/4/2016 1:50 AM, George Dunlap wrote:
> On Wed, Feb 3, 2016 at 3:10 PM, Paul Durrant <Paul.Durrant@citrix.com> wrote:
>>>   * Is it possible for libxl to somehow tell from the rest of the
>>>     configuration that this larger limit should be applied ?
>>>
>>
>> If a XenGT-enabled VM is provisioned through libxl then some larger limit is likely to be required. One of the issues is that it is impossible (or at least difficult) to know how many GTTs are going to need to be shadowed.
>
> By GTT, you mean the GPU pagetables I assume?  So you're talking about

Yes, GTT is "Graphics Translation Table" for short.

> how large this value should be made, not whether the
> heuristically-chosen larger value should be used.  libxl should be
> able to tell that XenGT is enabled, I assume, so it should be able to
> automatically bump this to 8k if necessary, right?
>

Yes.

> But I think you'd still need a parameter that you could tweak if it
> turned out that 8k wasn't enough for a particular workload, right?
>

Well, not exactly. For XenGT, the latest suggestion is that even when 8K
is not enough, we will not extend this limit anymore. But when
introducing this parameter, I had thought it might also be helpful for
other device virtualization cases which would like to use the mediated
passthrough idea.

>>>   * If we are talking about mmio ranges for ioreq servers, why do
>>>     guests which do not use this feature have the ability to create
>>>     them at all ?
>>
>> It's not the guest that directly creates the ranges, it's the emulator. Normally device emulation would require a relatively small number of MMIO ranges and a total number that cannot be influenced by the guest itself. In this case though, as I said above, the number *can* be influenced by the guest (although it is still the emulator which actually causes the ranges to be created).
>
> Just to make this clear: The guest chooses how many gpfns are used in
> the GPU pagetables; for each additional gpfn in the guest pagetable,
> qemu / xen have the option of either marking it to be emulated (at the
> moment, by marking it as a one-page "MMIO region") or crashing the
> guest.
>

Well, kind of. The backend device model in dom0(not qemu) makes the 
decision whether or not this page is to be emulated.

>>> (A background problem I have is that this thread is full of product
>>> name jargon and assumes a lot of background knowledge of the
>>> implementation of these features - background knowledge which I lack
>>> and which isn't in these patches.  If someone could point me at a
>>> quick summary of what `GVT-g' and `GVT-d' are that might help.)
>>>
>>
>> GVT-d is a name applied to PCI passthrough of an Intel GPU. GVT-g is a name applied to Intel GPU virtualization, which makes use of an emulator to mediate guest access to the real GPU so that it is possible to share the resources securely.
>
> And GTT are the GPU equivalent of page tables?
>
Yes.

Here let me try to give some brief introduction to the jargons:
* Intel GVT-d: an intel graphic virtualization solution, which dedicates
one physical GPU to a guest exclusively.

* Intel GVT-g: an intel graphic virtualization solution, with mediated
pass-through support. One physical GPU can be shared by multiple guests.
GPU performance-critical resources are partitioned by and passed
through to different vGPUs. Other GPU resources are trapped and
emulated by the device model.

* XenGT: Intel GVT-g code name for Xen.
Here this patch series are features required by XenGT.

* vGPU: virtual GPU presented to guests.

* GTT: abbreviation for graphics translation table, a page table
structure which translates the graphic memory address to a physical
one. For vGPU, PTEs in its GTT are GPFNs, thus raise a demand for
the device model to construct a group of shadow GPU page tables.

Thanks
Yu

Yu Zhang Feb. 4, 2016, 8:50 a.m. UTC | #49

On 2/4/2016 2:21 AM, George Dunlap wrote:
> On Wed, Feb 3, 2016 at 5:41 PM, George Dunlap
> <George.Dunlap@eu.citrix.com> wrote:
>> I think at some point I suggested an alternate design based on marking
>> such gpfns with a special p2m type; I can't remember if that
>> suggestion was actually addressed or not.
>
> FWIW, the thread where I suggested using p2m types was in response to
>
> <1436163912-1506-2-git-send-email-yu.c.zhang@linux.intel.com>
>
> Looking through it again, the main objection Paul gave[1]  was:
>
> "And it's the assertion that use of write_dm will only be relevant to
> gfns, and that all such notifications only need go to a single ioreq
> server, that I have a problem with. Whilst the use of io ranges to
> track gfn updates is, I agree, not ideal I think the overloading of
> write_dm is not a step in the right direction."
>
> Two issues raised here, about using only p2m types to implement write_dm:
> 1. More than one ioreq server may want to use the write_dm functionality
> 2. ioreq servers may want to use write_dm for things other than individual gpfns
>
> My answer to #1 was:
> 1. At the moment, we only need to support a single ioreq server using write_dm
> 2. It's not technically difficult to extend the number of servers
> supported to something sensible, like 4 (using 4 different write_dm
> p2m types)
> 3. The interface can be designed such that we can extend support to
> multiple servers when we need to.
>
> My answer to #2 was that there's no reason why using write_dm could be
> used for both individual gpfns and ranges; there's no reason the
> interface can't take a "start" and "count" argument, even if for the
> time being "count" is almost always going to be 1.
>

Well, talking about "the 'count' always going to be 1". I doubt that. :)
Statistics in XenGT shows that, GPU page tables are very likely to
be allocated in contiguous gpfns.

> Compare this to the downsides of the approach you're proposing:
> 1. Using 40 bytes of hypervisor space per guest GPU pagetable page (as
> opposed to using a bit in the existing p2m table)
> 2. Walking down an RB tree with 8000 individual nodes to find out
> which server to send the message to (rather than just reading the
> value from the p2m table).

8K is an upper limit for the rangeset, in many cases the RB tree will
not contain that many nodes.

> 3. Needing to determine on a guest-by-guest basis whether to change the limit
> 4. Needing to have an interface to make the limit even bigger, just in
> case we find workloads that have even more GTTs.
>

Well, I have suggested in yesterday's reply. XenGT can choose not to
change this limit even when workloads are getting heavy - with
tradeoffs in the device model side.

> I really don't understand where you're coming from on this.  The
> approach you've chosen looks to me to be slower, more difficult to
> implement, and more complicated; and it's caused a lot more resistance
> trying to get this series accepted.
>

I agree utilizing the p2m types to do so is more efficient and quite
intuitive. But I hesitate to occupy the software available bits in EPT
PTEs(like Andrew's reply). Although we have introduced one, we believe 
it can also be used for other situations in the future, not just XenGT.

Thanks
Yu

Yu Zhang Feb. 4, 2016, 8:51 a.m. UTC | #50

On 2/4/2016 3:12 AM, George Dunlap wrote:
> On 03/02/16 18:39, Andrew Cooper wrote:
>> On 03/02/16 18:21, George Dunlap wrote:
>>> 2. It's not technically difficult to extend the number of servers
>>> supported to something sensible, like 4 (using 4 different write_dm
>>> p2m types)
>>
>> While technically true, spare bits in the pagetable entries are at a
>> premium, and steadily decreasing as Intel are introducing new features.
>>
>> We have 16 current p2m types, and a finite upper bound of 6 bits of p2m
>> type space, already with a push to reduce this number.
>>
>> While introducing 1 new p2m type for this purpose might be an acceptable
>> tradeoff, a using a p2m type per ioreq server is not IMO.
>
> It is true that we don't have a ton of elbow room to grow at the moment.
>
> But we actually already have a single p2m type -- mmio_write_dm -- that
> as far as I know is only being used by XenGT.  We don't actually need to
> add any new p2m types for my initial proposal.
>
> Going forward, we probably will, at some point, need to implement a
> parallel "p2t" structure to keep track of types -- and probably will
> whether end up implementing 4 separate write_dm types or not (for the
> reasons you describe).
>

Thank you, George. Could you please elaborate more about the idea of
"p2t"?

Yu

Paul Durrant Feb. 4, 2016, 9:28 a.m. UTC | #51

> -----Original Message-----
> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
> Sent: 04 February 2016 08:51
> To: George Dunlap; Ian Jackson
> Cc: Paul Durrant; Kevin Tian; Wei Liu; Ian Campbell; Andrew Cooper; xen-
> devel@lists.xen.org; Stefano Stabellini; zhiyuan.lv@intel.com; Jan Beulich;
> Keir (Xen.org)
> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> 
> 
> On 2/4/2016 2:21 AM, George Dunlap wrote:
> > On Wed, Feb 3, 2016 at 5:41 PM, George Dunlap
> > <George.Dunlap@eu.citrix.com> wrote:
> >> I think at some point I suggested an alternate design based on marking
> >> such gpfns with a special p2m type; I can't remember if that
> >> suggestion was actually addressed or not.
> >
> > FWIW, the thread where I suggested using p2m types was in response to
> >
> > <1436163912-1506-2-git-send-email-yu.c.zhang@linux.intel.com>
> >
> > Looking through it again, the main objection Paul gave[1]  was:
> >
> > "And it's the assertion that use of write_dm will only be relevant to
> > gfns, and that all such notifications only need go to a single ioreq
> > server, that I have a problem with. Whilst the use of io ranges to
> > track gfn updates is, I agree, not ideal I think the overloading of
> > write_dm is not a step in the right direction."
> >
> > Two issues raised here, about using only p2m types to implement
> write_dm:
> > 1. More than one ioreq server may want to use the write_dm functionality
> > 2. ioreq servers may want to use write_dm for things other than individual
> gpfns
> >
> > My answer to #1 was:
> > 1. At the moment, we only need to support a single ioreq server using
> write_dm
> > 2. It's not technically difficult to extend the number of servers
> > supported to something sensible, like 4 (using 4 different write_dm
> > p2m types)
> > 3. The interface can be designed such that we can extend support to
> > multiple servers when we need to.
> >
> > My answer to #2 was that there's no reason why using write_dm could be
> > used for both individual gpfns and ranges; there's no reason the
> > interface can't take a "start" and "count" argument, even if for the
> > time being "count" is almost always going to be 1.
> >
> 
> Well, talking about "the 'count' always going to be 1". I doubt that. :)
> Statistics in XenGT shows that, GPU page tables are very likely to
> be allocated in contiguous gpfns.
> 
> > Compare this to the downsides of the approach you're proposing:
> > 1. Using 40 bytes of hypervisor space per guest GPU pagetable page (as
> > opposed to using a bit in the existing p2m table)
> > 2. Walking down an RB tree with 8000 individual nodes to find out
> > which server to send the message to (rather than just reading the
> > value from the p2m table).
> 
> 8K is an upper limit for the rangeset, in many cases the RB tree will
> not contain that many nodes.
> 
> > 3. Needing to determine on a guest-by-guest basis whether to change the
> limit
> > 4. Needing to have an interface to make the limit even bigger, just in
> > case we find workloads that have even more GTTs.
> >
> 
> Well, I have suggested in yesterday's reply. XenGT can choose not to
> change this limit even when workloads are getting heavy - with
> tradeoffs in the device model side.

I assume this means that the emulator can 'unshadow' GTTs (I guess on an LRU basis) so that it can shadow new ones when the limit has been exhausted?
If so, how bad is performance likely to be if we live with a lower limit and take the hit of unshadowing if the guest GTTs become heavily fragmented?

  Paul

> 
> > I really don't understand where you're coming from on this.  The
> > approach you've chosen looks to me to be slower, more difficult to
> > implement, and more complicated; and it's caused a lot more resistance
> > trying to get this series accepted.
> >
> 
> I agree utilizing the p2m types to do so is more efficient and quite
> intuitive. But I hesitate to occupy the software available bits in EPT
> PTEs(like Andrew's reply). Although we have introduced one, we believe
> it can also be used for other situations in the future, not just XenGT.
> 
> Thanks
> Yu

Yu Zhang Feb. 4, 2016, 9:38 a.m. UTC | #52

On 2/4/2016 5:28 PM, Paul Durrant wrote:
>> -----Original Message-----
>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
>> Sent: 04 February 2016 08:51
>> To: George Dunlap; Ian Jackson
>> Cc: Paul Durrant; Kevin Tian; Wei Liu; Ian Campbell; Andrew Cooper; xen-
>> devel@lists.xen.org; Stefano Stabellini; zhiyuan.lv@intel.com; Jan Beulich;
>> Keir (Xen.org)
>> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
>> max_wp_ram_ranges.
>>
>>
>>
>> On 2/4/2016 2:21 AM, George Dunlap wrote:
>>> On Wed, Feb 3, 2016 at 5:41 PM, George Dunlap
>>> <George.Dunlap@eu.citrix.com> wrote:
>>>> I think at some point I suggested an alternate design based on marking
>>>> such gpfns with a special p2m type; I can't remember if that
>>>> suggestion was actually addressed or not.
>>>
>>> FWIW, the thread where I suggested using p2m types was in response to
>>>
>>> <1436163912-1506-2-git-send-email-yu.c.zhang@linux.intel.com>
>>>
>>> Looking through it again, the main objection Paul gave[1]  was:
>>>
>>> "And it's the assertion that use of write_dm will only be relevant to
>>> gfns, and that all such notifications only need go to a single ioreq
>>> server, that I have a problem with. Whilst the use of io ranges to
>>> track gfn updates is, I agree, not ideal I think the overloading of
>>> write_dm is not a step in the right direction."
>>>
>>> Two issues raised here, about using only p2m types to implement
>> write_dm:
>>> 1. More than one ioreq server may want to use the write_dm functionality
>>> 2. ioreq servers may want to use write_dm for things other than individual
>> gpfns
>>>
>>> My answer to #1 was:
>>> 1. At the moment, we only need to support a single ioreq server using
>> write_dm
>>> 2. It's not technically difficult to extend the number of servers
>>> supported to something sensible, like 4 (using 4 different write_dm
>>> p2m types)
>>> 3. The interface can be designed such that we can extend support to
>>> multiple servers when we need to.
>>>
>>> My answer to #2 was that there's no reason why using write_dm could be
>>> used for both individual gpfns and ranges; there's no reason the
>>> interface can't take a "start" and "count" argument, even if for the
>>> time being "count" is almost always going to be 1.
>>>
>>
>> Well, talking about "the 'count' always going to be 1". I doubt that. :)
>> Statistics in XenGT shows that, GPU page tables are very likely to
>> be allocated in contiguous gpfns.
>>
>>> Compare this to the downsides of the approach you're proposing:
>>> 1. Using 40 bytes of hypervisor space per guest GPU pagetable page (as
>>> opposed to using a bit in the existing p2m table)
>>> 2. Walking down an RB tree with 8000 individual nodes to find out
>>> which server to send the message to (rather than just reading the
>>> value from the p2m table).
>>
>> 8K is an upper limit for the rangeset, in many cases the RB tree will
>> not contain that many nodes.
>>
>>> 3. Needing to determine on a guest-by-guest basis whether to change the
>> limit
>>> 4. Needing to have an interface to make the limit even bigger, just in
>>> case we find workloads that have even more GTTs.
>>>
>>
>> Well, I have suggested in yesterday's reply. XenGT can choose not to
>> change this limit even when workloads are getting heavy - with
>> tradeoffs in the device model side.
>
> I assume this means that the emulator can 'unshadow' GTTs (I guess on an LRU basis) so that it can shadow new ones when the limit has been exhausted?
> If so, how bad is performance likely to be if we live with a lower limit and take the hit of unshadowing if the guest GTTs become heavily fragmented?
>
Thank you, Paul.

Well, I was told the emulator have approaches to delay the shadowing of
the GTT till future GPU commands are submitted. By now, I'm not sure
about the performance penalties if the limit is set too low. Although
we are confident 8K is a secure limit, it seems still too high to be
accepted. We will perform more experiments with this new approach to
find a balance between the lowest limit and the XenGT performance.

So another question is, if value of this limit really matters, will a
lower one be more acceptable(the current 256 being not enough)?

Thanks
Yu
find a

Paul Durrant Feb. 4, 2016, 9:49 a.m. UTC | #53

> -----Original Message-----
[snip]
> >>> Compare this to the downsides of the approach you're proposing:
> >>> 1. Using 40 bytes of hypervisor space per guest GPU pagetable page (as
> >>> opposed to using a bit in the existing p2m table)
> >>> 2. Walking down an RB tree with 8000 individual nodes to find out
> >>> which server to send the message to (rather than just reading the
> >>> value from the p2m table).
> >>
> >> 8K is an upper limit for the rangeset, in many cases the RB tree will
> >> not contain that many nodes.
> >>
> >>> 3. Needing to determine on a guest-by-guest basis whether to change
> the
> >> limit
> >>> 4. Needing to have an interface to make the limit even bigger, just in
> >>> case we find workloads that have even more GTTs.
> >>>
> >>
> >> Well, I have suggested in yesterday's reply. XenGT can choose not to
> >> change this limit even when workloads are getting heavy - with
> >> tradeoffs in the device model side.
> >
> > I assume this means that the emulator can 'unshadow' GTTs (I guess on an
> LRU basis) so that it can shadow new ones when the limit has been
> exhausted?
> > If so, how bad is performance likely to be if we live with a lower limit and
> take the hit of unshadowing if the guest GTTs become heavily fragmented?
> >
> Thank you, Paul.
> 
> Well, I was told the emulator have approaches to delay the shadowing of
> the GTT till future GPU commands are submitted. By now, I'm not sure
> about the performance penalties if the limit is set too low. Although
> we are confident 8K is a secure limit, it seems still too high to be
> accepted. We will perform more experiments with this new approach to
> find a balance between the lowest limit and the XenGT performance.
> 
> So another question is, if value of this limit really matters, will a
> lower one be more acceptable(the current 256 being not enough)?

Well, do you know conclusively that 256 is absolutely not enough, or is it just too low for acceptable performance?

  Paul

> 
> Thanks
> Yu
> find a

Ian Campbell Feb. 4, 2016, 10:06 a.m. UTC | #54

On Wed, 2016-02-03 at 17:41 +0000, George Dunlap wrote:
> But of course, since they they aren't actually ranges but just gpfns,
> they're scattered randomly throughout the guest physical address
> space.

(Possibly) stupid question:

Since, AIUI, the in-guest GPU driver is XenGT aware could it not allocate a
contiguous range of pages at start of day to use as GPU PTs? Or even just N
contiguous regions, e.g. i think the "8K" refers to pages, which is 16 2M
allocations, which is a far more manageable number of ranges to track than
8096 individual pages.

Ian.

Jan Beulich Feb. 4, 2016, 10:34 a.m. UTC | #55

>>> On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote:
> So another question is, if value of this limit really matters, will a
> lower one be more acceptable(the current 256 being not enough)?

If you've carefully read George's replies, a primary aspect is
whether we wouldn't better revert commit f5a32c5b8e
("x86/HVM: differentiate IO/mem resources tracked by ioreq
server"), as with the alternative approach we wouldn't even
need HVMOP_IO_RANGE_WP_MEM afaict. And then the question
you raise would become irrelevant.

The part of the public interface being tools only allows some
freedom in when to do this, but I think it would be a bad idea
to ship 4.7 with this still in if you're not going to pursue this
route.

Jan

George Dunlap Feb. 4, 2016, 10:49 a.m. UTC | #56

On Thu, Feb 4, 2016 at 8:51 AM, Yu, Zhang <yu.c.zhang@linux.intel.com> wrote:
>> Going forward, we probably will, at some point, need to implement a
>> parallel "p2t" structure to keep track of types -- and probably will
>> whether end up implementing 4 separate write_dm types or not (for the
>> reasons you describe).
>>
>
> Thank you, George. Could you please elaborate more about the idea of
> "p2t"?

So the p2m table is a partially-sparse structure designed to map pfns
to mfns.  At the bottom is a 64-bit entry that contains certain bits
specified by the hardware (mfn number, permissions, other features).
There are at the moment extra bits that aren't use, and in these bits
we store information about the pfn that Xen wants to know -- among
other things, the 'type' of the gpfn.

But as Andy pointed out, Intel are adding new features which take up
more of these bits; and at the same time, Xen has more features for
which using a p2m type is the most obvious solution.  So the idea
would be to have a separate structure, similar to the p2m table, but
wouldn't be used by the hardware -- it would only be used by Xen to
map pfn to type (or whatever other information it wanted about the
gpfn).  This does mean duplicating all the non-leaf nodes, as well as
doing two sparse-tree-walks rather  than just one when looking up gpfn
information.

 -George

George Dunlap Feb. 4, 2016, 11:06 a.m. UTC | #57

On Thu, Feb 4, 2016 at 9:38 AM, Yu, Zhang <yu.c.zhang@linux.intel.com> wrote:
> On 2/4/2016 5:28 PM, Paul Durrant wrote:
>> I assume this means that the emulator can 'unshadow' GTTs (I guess on an
>> LRU basis) so that it can shadow new ones when the limit has been exhausted?
>> If so, how bad is performance likely to be if we live with a lower limit
>> and take the hit of unshadowing if the guest GTTs become heavily fragmented?
>>
> Thank you, Paul.
>
> Well, I was told the emulator have approaches to delay the shadowing of
> the GTT till future GPU commands are submitted. By now, I'm not sure
> about the performance penalties if the limit is set too low. Although
> we are confident 8K is a secure limit, it seems still too high to be
> accepted. We will perform more experiments with this new approach to
> find a balance between the lowest limit and the XenGT performance.

Just to check some of my assumptions:

I assume that unlike memory accesses, your GPU hardware cannot
'recover' from faults in the GTTs. That is, for memory, you can take a
page fault, fix up the pagetables, and then re-execute the original
instruction; but so far I haven't heard of any devices being able to
seamlessly re-execute a transaction after a fault.  Is my
understanding correct?

If that is the case, then for every top-level value (whatever the
equivalent of the CR3), you need to be able to shadow the entire GTT
tree below it, yes?  You can't use a trick that the memory shadow
pagetables can use, of unshadowing parts of the tree and reshadowing
them.

So as long as the currently-in-use GTT tree contains no more than
$LIMIT ranges, you can unshadow and reshadow; this will be slow, but
strictly speaking correct.

What do you do if the guest driver switches to a GTT such that the
entire tree takes up more than $LIMIT entries?

 -George

Ian Campbell Feb. 4, 2016, 11:08 a.m. UTC | #58

On Thu, 2016-02-04 at 10:49 +0000, George Dunlap wrote:
> On Thu, Feb 4, 2016 at 8:51 AM, Yu, Zhang <yu.c.zhang@linux.intel.com>
> wrote:
> > > Going forward, we probably will, at some point, need to implement a
> > > parallel "p2t" structure to keep track of types -- and probably will
> > > whether end up implementing 4 separate write_dm types or not (for the
> > > reasons you describe).
> > > 
> > 
> > Thank you, George. Could you please elaborate more about the idea of
> > "p2t"?
> 
> So the p2m table is a partially-sparse structure designed to map pfns
> to mfns.  At the bottom is a 64-bit entry that contains certain bits
> specified by the hardware (mfn number, permissions, other features).
> There are at the moment extra bits that aren't use, and in these bits
> we store information about the pfn that Xen wants to know -- among
> other things, the 'type' of the gpfn.
> 
> But as Andy pointed out, Intel are adding new features which take up
> more of these bits; and at the same time, Xen has more features for
> which using a p2m type is the most obvious solution.  So the idea
> would be to have a separate structure, similar to the p2m table, but
> wouldn't be used by the hardware -- it would only be used by Xen to
> map pfn to type (or whatever other information it wanted about the
> gpfn).  This does mean duplicating all the non-leaf nodes, as well as
> doing two sparse-tree-walks rather  than just one when looking up gpfn
> information.

FWIW ARM already has such a structure to support xenaccess, since ARM had
far fewer available bits to start with.

It is not quite the same as above since it is only populated with pages for
which xenaccess is in use rather than all pages (and is only consulted if
we have reason to believe the page might be subject to xenaccess). FWIW it
uses Xen's radix tree stuff.

Ian.

Ian Campbell Feb. 4, 2016, 11:19 a.m. UTC | #59

On Thu, 2016-02-04 at 11:08 +0000, Ian Campbell wrote:
> On Thu, 2016-02-04 at 10:49 +0000, George Dunlap wrote:
> > On Thu, Feb 4, 2016 at 8:51 AM, Yu, Zhang <yu.c.zhang@linux.intel.com>
> > wrote:
> > > > Going forward, we probably will, at some point, need to implement a
> > > > parallel "p2t" structure to keep track of types -- and probably
> > > > will
> > > > whether end up implementing 4 separate write_dm types or not (for
> > > > the
> > > > reasons you describe).
> > > > 
> > > 
> > > Thank you, George. Could you please elaborate more about the idea of
> > > "p2t"?
> > 
> > So the p2m table is a partially-sparse structure designed to map pfns
> > to mfns.  At the bottom is a 64-bit entry that contains certain bits
> > specified by the hardware (mfn number, permissions, other features).
> > There are at the moment extra bits that aren't use, and in these bits
> > we store information about the pfn that Xen wants to know -- among
> > other things, the 'type' of the gpfn.
> > 
> > But as Andy pointed out, Intel are adding new features which take up
> > more of these bits; and at the same time, Xen has more features for
> > which using a p2m type is the most obvious solution.  So the idea
> > would be to have a separate structure, similar to the p2m table, but
> > wouldn't be used by the hardware -- it would only be used by Xen to
> > map pfn to type (or whatever other information it wanted about the
> > gpfn).  This does mean duplicating all the non-leaf nodes, as well as
> > doing two sparse-tree-walks rather  than just one when looking up gpfn
> > information.
> 
> FWIW ARM already has such a structure to support xenaccess, since ARM had
> far fewer available bits to start with.
> 
> It is not quite the same as above since it is only populated with pages for
> which xenaccess is in use rather than all pages (and is only consulted if
> we have reason to believe the page might be subject to xenaccess). FWIW it
> uses Xen's radix tree stuff.

A second FWIW in case it is useful, but Linux on 32-bit ARM (older ones
with 32-bit PTEs) allocates every PT page as a pair of pages, so it gets 32
software defined bits for every h/w visible pte at pte+4K.

If you wanted to avoid the order-1 allocations which that implies then I
suppose you could chain to the "pteext" page from the struct page of the
h/w visible page.

Ian.

Ian Jackson Feb. 4, 2016, 1:33 p.m. UTC | #60

Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."):
> On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote:
> > So another question is, if value of this limit really matters, will a
> > lower one be more acceptable(the current 256 being not enough)?
> 
> If you've carefully read George's replies, [...]

Thanks to George for the very clear explanation, and also to him for
an illuminating in-person discussion.

It is disturbing that as a result of me as a tools maintainer asking
questions about what seems to me to be a troublesome a user-visible
control setting in libxl, we are now apparently revisiting lower
layers of the hypervisor design, which have already been committed.

While I find George's line of argument convincing, neither I nor
George are maintainers of the relevant hypervisor code.  I am not
going to insist that anything in the hypervisor is done different and
am not trying to use my tools maintainer position to that end.

Clearly there has been a failure of our workflow to consider and
review everything properly together.  But given where we are now, I
think that this discussion about hypervisor internals is probably a
distraction.

Let pose again some questions that I still don't have clear answers
to:

 * Is it possible for libxl to somehow tell from the rest of the
   configuration that this larger limit should be applied ?

   AFAICT there is nothing in libxl directly involving vgpu.  How can
   libxl be used to create a guest with vgpu enabled ?  I had thought
   that this was done merely with the existing PCI passthrough
   configuration, but it now seems that somehow a second device model
   would have to be started.  libxl doesn't have code to do that.

 * In the configurations where a larger number is needed, what larger
   limit is appropriate ?  How should it be calculated ?

   AFAICT from the discussion, 8192 is a reasonable bet.  Is everyone
   happy with it.

Ian.

PS: Earlier I asked:

 * How do we know that this does not itself give an opportunity for
   hypervisor resource exhaustion attacks by guests ?  (Note: if it
   _does_ give such an opportunity, this should be mentioned more
   clearly in the documentation.)

 * If we are talking about mmio ranges for ioreq servers, why do
   guests which do not use this feature have the ability to create
   them at all ?

I now understand that these mmio ranges are created by the device
model.  Of course the device model needs to be able to create mmio
ranges for the guest.  And since they consume hypervisor resources,
the number of these must be limited (device models not necessarily
being trusted).

Paul Durrant Feb. 4, 2016, 1:47 p.m. UTC | #61

> -----Original Message-----
> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
> Sent: 04 February 2016 13:34
> To: Jan Beulich
> Cc: Zhang Yu; Andrew Cooper; George Dunlap; Ian Campbell; Paul Durrant;
> Stefano Stabellini; Wei Liu; Kevin Tian; zhiyuan.lv@intel.com; xen-
> devel@lists.xen.org; Keir (Xen.org)
> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce
> parameter max_wp_ram_ranges."):
> > On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote:
> > > So another question is, if value of this limit really matters, will a
> > > lower one be more acceptable(the current 256 being not enough)?
> >
> > If you've carefully read George's replies, [...]
> 
> Thanks to George for the very clear explanation, and also to him for
> an illuminating in-person discussion.
> 
> It is disturbing that as a result of me as a tools maintainer asking
> questions about what seems to me to be a troublesome a user-visible
> control setting in libxl, we are now apparently revisiting lower
> layers of the hypervisor design, which have already been committed.
> 
> While I find George's line of argument convincing, neither I nor
> George are maintainers of the relevant hypervisor code.  I am not
> going to insist that anything in the hypervisor is done different and
> am not trying to use my tools maintainer position to that end.
> 
> Clearly there has been a failure of our workflow to consider and
> review everything properly together.  But given where we are now, I
> think that this discussion about hypervisor internals is probably a
> distraction.
> 
> 
> Let pose again some questions that I still don't have clear answers
> to:
> 
>  * Is it possible for libxl to somehow tell from the rest of the
>    configuration that this larger limit should be applied ?
> 
>    AFAICT there is nothing in libxl directly involving vgpu.  How can
>    libxl be used to create a guest with vgpu enabled ?  I had thought
>    that this was done merely with the existing PCI passthrough
>    configuration, but it now seems that somehow a second device model
>    would have to be started.  libxl doesn't have code to do that.
> 

AIUI if the setting of the increased limit is tied to provisioning a gvt-g instance for a VM then I don't there needs to be extra information in the VM config. These seems like the most sensible thing to do.

>  * In the configurations where a larger number is needed, what larger
>    limit is appropriate ?  How should it be calculated ?
> 
>    AFAICT from the discussion, 8192 is a reasonable bet.  Is everyone
>    happy with it.
> 
> Ian.
> 
> PS: Earlier I asked:
> 
>  * How do we know that this does not itself give an opportunity for
>    hypervisor resource exhaustion attacks by guests ?  (Note: if it
>    _does_ give such an opportunity, this should be mentioned more
>    clearly in the documentation.)
> 
>  * If we are talking about mmio ranges for ioreq servers, why do
>    guests which do not use this feature have the ability to create
>    them at all ?
> 
> I now understand that these mmio ranges are created by the device
> model.  Of course the device model needs to be able to create mmio
> ranges for the guest.  And since they consume hypervisor resources,
> the number of these must be limited (device models not necessarily
> being trusted).

...but I think there is still an open question as to whether the toolstack is allowed to set that limit for a VM or not. IMO the toolstack should be allowed to set that limit when creating a domain.

  Paul

Jan Beulich Feb. 4, 2016, 2:08 p.m. UTC | #62

>>> On 04.02.16 at 14:33, <Ian.Jackson@eu.citrix.com> wrote:
> Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter 
> max_wp_ram_ranges."):
>> On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote:
>> > So another question is, if value of this limit really matters, will a
>> > lower one be more acceptable(the current 256 being not enough)?
>> 
>> If you've carefully read George's replies, [...]
> 
> Thanks to George for the very clear explanation, and also to him for
> an illuminating in-person discussion.
> 
> It is disturbing that as a result of me as a tools maintainer asking
> questions about what seems to me to be a troublesome a user-visible
> control setting in libxl, we are now apparently revisiting lower
> layers of the hypervisor design, which have already been committed.
> 
> While I find George's line of argument convincing, neither I nor
> George are maintainers of the relevant hypervisor code.  I am not
> going to insist that anything in the hypervisor is done different and
> am not trying to use my tools maintainer position to that end.
> 
> Clearly there has been a failure of our workflow to consider and
> review everything properly together.  But given where we are now, I
> think that this discussion about hypervisor internals is probably a
> distraction.

While I recall George having made that alternative suggestion,
both Yu and Paul having reservations against it made me not
insist on that alternative. Instead I've been trying to limit some
of the bad effects that the variant originally proposed brought
with it. Clearly, with the more detailed reply George has now
given (involving areas where he is the maintainer for), I should
have been more demanding towards the exploration of that
alternative. That's clearly unfortunate, and I apologize for that,
but such things happen.

As to one of the patches already having for committed - I'm not
worried about that at all. We can always revert, that's why the
thing is called "unstable".

Everything below here I think was meant to be mainly directed
at Yu instead of me.

> Let pose again some questions that I still don't have clear answers
> to:
> 
>  * Is it possible for libxl to somehow tell from the rest of the
>    configuration that this larger limit should be applied ?
> 
>    AFAICT there is nothing in libxl directly involving vgpu.  How can
>    libxl be used to create a guest with vgpu enabled ?  I had thought
>    that this was done merely with the existing PCI passthrough
>    configuration, but it now seems that somehow a second device model
>    would have to be started.  libxl doesn't have code to do that.
> 
>  * In the configurations where a larger number is needed, what larger
>    limit is appropriate ?  How should it be calculated ?
> 
>    AFAICT from the discussion, 8192 is a reasonable bet.  Is everyone
>    happy with it.
> 
> Ian.
> 
> PS: Earlier I asked:
> 
>  * How do we know that this does not itself give an opportunity for
>    hypervisor resource exhaustion attacks by guests ?  (Note: if it
>    _does_ give such an opportunity, this should be mentioned more
>    clearly in the documentation.)
> 
>  * If we are talking about mmio ranges for ioreq servers, why do
>    guests which do not use this feature have the ability to create
>    them at all ?
> 
> I now understand that these mmio ranges are created by the device
> model.  Of course the device model needs to be able to create mmio
> ranges for the guest.  And since they consume hypervisor resources,
> the number of these must be limited (device models not necessarily
> being trusted).

Which is why I have been so hesitant to accept these patches.

Jan

Jan Beulich Feb. 4, 2016, 2:12 p.m. UTC | #63

>>> On 04.02.16 at 14:47, <Paul.Durrant@citrix.com> wrote:
>> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
>> Sent: 04 February 2016 13:34
>>  * Is it possible for libxl to somehow tell from the rest of the
>>    configuration that this larger limit should be applied ?
>> 
>>    AFAICT there is nothing in libxl directly involving vgpu.  How can
>>    libxl be used to create a guest with vgpu enabled ?  I had thought
>>    that this was done merely with the existing PCI passthrough
>>    configuration, but it now seems that somehow a second device model
>>    would have to be started.  libxl doesn't have code to do that.
>> 
> 
> AIUI if the setting of the increased limit is tied to provisioning a gvt-g 
> instance for a VM then I don't there needs to be extra information in the VM 
> config. These seems like the most sensible thing to do.

I don't understand this: For one, it's still unclear to me on what basis
it would be known that a given VM is a "gvt-g instance". And even if
that's indeed derivable from something, the uncertainty about a
workable upper bound on the number of WP ranges would still seem
to demand the value to be specifiable separately...

>> I now understand that these mmio ranges are created by the device
>> model.  Of course the device model needs to be able to create mmio
>> ranges for the guest.  And since they consume hypervisor resources,
>> the number of these must be limited (device models not necessarily
>> being trusted).
> 
> ...but I think there is still an open question as to whether the toolstack 
> is allowed to set that limit for a VM or not. IMO the toolstack should be 
> allowed to set that limit when creating a domain.

... as you indicate here.

Jan

Paul Durrant Feb. 4, 2016, 2:25 p.m. UTC | #64

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 04 February 2016 14:13
> To: Paul Durrant
> Cc: Andrew Cooper; George Dunlap; Ian Campbell; Ian Jackson; Stefano
> Stabellini; Wei Liu; Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen-
> devel@lists.xen.org; Keir (Xen.org)
> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> >>> On 04.02.16 at 14:47, <Paul.Durrant@citrix.com> wrote:
> >> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
> >> Sent: 04 February 2016 13:34
> >>  * Is it possible for libxl to somehow tell from the rest of the
> >>    configuration that this larger limit should be applied ?
> >>
> >>    AFAICT there is nothing in libxl directly involving vgpu.  How can
> >>    libxl be used to create a guest with vgpu enabled ?  I had thought
> >>    that this was done merely with the existing PCI passthrough
> >>    configuration, but it now seems that somehow a second device model
> >>    would have to be started.  libxl doesn't have code to do that.
> >>
> >
> > AIUI if the setting of the increased limit is tied to provisioning a gvt-g
> > instance for a VM then I don't there needs to be extra information in the
> VM
> > config. These seems like the most sensible thing to do.
> 
> I don't understand this: For one, it's still unclear to me on what basis
> it would be known that a given VM is a "gvt-g instance". And even if
> that's indeed derivable from something, the uncertainty about a
> workable upper bound on the number of WP ranges would still seem
> to demand the value to be specifiable separately...

There are patches in the XenGT xen repo which add extra parameters into the VM config to allow libxl to provision a gvt-g instance (of which there are a finite number per GPU) for a VM. The increased limit could be applied when doing so and it may be feasible to determine (maybe from the version of the GPU h/w) what a reasonable limit is.

  Paul

> 
> >> I now understand that these mmio ranges are created by the device
> >> model.  Of course the device model needs to be able to create mmio
> >> ranges for the guest.  And since they consume hypervisor resources,
> >> the number of these must be limited (device models not necessarily
> >> being trusted).
> >
> > ...but I think there is still an open question as to whether the toolstack
> > is allowed to set that limit for a VM or not. IMO the toolstack should be
> > allowed to set that limit when creating a domain.
> 
> ... as you indicate here.
> 
> Jan

Ian Jackson Feb. 4, 2016, 3:06 p.m. UTC | #65

Paul Durrant writes ("RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."):
> There are patches in the XenGT xen repo which add extra parameters
> into the VM config to allow libxl to provision a gvt-g instance (of
> which there are a finite number per GPU) for a VM. The increased
> limit could be applied when doing so and it may be feasible to
> determine (maybe from the version of the GPU h/w) what a reasonable
> limit is.

Right.  So in that case there would be no need for this user-visible
knob (along with its sadly not very helpful documentation) to be
exposed in the libxl stable API.

Ian.

Paul Durrant Feb. 4, 2016, 3:51 p.m. UTC | #66

> -----Original Message-----
> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
> Sent: 04 February 2016 15:06
> To: Paul Durrant
> Cc: Jan Beulich; Andrew Cooper; George Dunlap; Ian Campbell; Stefano
> Stabellini; Wei Liu; Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen-
> devel@lists.xen.org; Keir (Xen.org)
> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> Paul Durrant writes ("RE: [Xen-devel] [PATCH v3 3/3] tools: introduce
> parameter max_wp_ram_ranges."):
> > There are patches in the XenGT xen repo which add extra parameters
> > into the VM config to allow libxl to provision a gvt-g instance (of
> > which there are a finite number per GPU) for a VM. The increased
> > limit could be applied when doing so and it may be feasible to
> > determine (maybe from the version of the GPU h/w) what a reasonable
> > limit is.
> 
> Right.  So in that case there would be no need for this user-visible
> knob (along with its sadly not very helpful documentation) to be
> exposed in the libxl stable API.
> 

No, I really don't think that should be necessary. Libxl merely needs to configure a limit in the hypervisor appropriate to the gvt-g instance that is provisioned.

  Paul

> Ian.

George Dunlap Feb. 4, 2016, 5:12 p.m. UTC | #67

On 04/02/16 14:08, Jan Beulich wrote:
>>>> On 04.02.16 at 14:33, <Ian.Jackson@eu.citrix.com> wrote:
>> Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter 
>> max_wp_ram_ranges."):
>>> On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote:
>>>> So another question is, if value of this limit really matters, will a
>>>> lower one be more acceptable(the current 256 being not enough)?
>>>
>>> If you've carefully read George's replies, [...]
>>
>> Thanks to George for the very clear explanation, and also to him for
>> an illuminating in-person discussion.
>>
>> It is disturbing that as a result of me as a tools maintainer asking
>> questions about what seems to me to be a troublesome a user-visible
>> control setting in libxl, we are now apparently revisiting lower
>> layers of the hypervisor design, which have already been committed.
>>
>> While I find George's line of argument convincing, neither I nor
>> George are maintainers of the relevant hypervisor code.  I am not
>> going to insist that anything in the hypervisor is done different and
>> am not trying to use my tools maintainer position to that end.
>>
>> Clearly there has been a failure of our workflow to consider and
>> review everything properly together.  But given where we are now, I
>> think that this discussion about hypervisor internals is probably a
>> distraction.
> 
> While I recall George having made that alternative suggestion,
> both Yu and Paul having reservations against it made me not
> insist on that alternative. Instead I've been trying to limit some
> of the bad effects that the variant originally proposed brought
> with it. Clearly, with the more detailed reply George has now
> given (involving areas where he is the maintainer for), I should
> have been more demanding towards the exploration of that
> alternative. That's clearly unfortunate, and I apologize for that,
> but such things happen.
> 
> As to one of the patches already having for committed - I'm not
> worried about that at all. We can always revert, that's why the
> thing is called "unstable".

It looks like I should have been more careful to catch up on the current
state of things before I started arguing again -- please accept my
apologies.

I see that patch 2/3 addresses the gpfn/io question in the commit
message by saying, "Previously, a new hypercall or subop was suggested
to map write-protected pages into ioreq server. However, it turned out
handler of this new hypercall would be almost the same with the existing
pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's already a
type parameter in this hypercall. So no new hypercall defined, only a
new type is introduced."

And I see that 2/3 internally separates the WP_RAM type into a separate
rangeset, whose size can be adjusted separately.

This addresses my complaint about the interface using gpfns rather than
MMIO ranges as an interface (somewhat anyway).  Sorry for not
acknowledging this at first.

The question of the internal implementation -- whether to use RB tree
rangesets, or radix trees (as apparently ARM memaccess does) or p2m
types -- is an internal implementation question.  I think p2m types is
long-term the best way to go, but it won't hurt to have the current
implementation checked in, as long as it doesn't have any impacts on the
stable interface.

At the moment, as far as I can tell, there's no way for libxl to even
run a version of qemu with XenGT enabled, so there's no real need for
libxl to be involved.

The purpose of having the limit would putatively be to prevent a guest
being able to trigger an exhaustion of hypervisor memory by inducing the
device model to mark an arbitrary number of ranges as mmio_dm.

Two angles on this.

First, assuming that limiting the number of ranges is what we want:  I'm
not really a fan of using HVM_PARAMs for this, but as long as it's not
considered a public interface (i.e., it could go away or disappear and
everything would Just Work), then I wouldn't object.

Although I would ask: would it instead be suitable for now to just set
the default limit for WP_RAM to 8196 in the hypervisor, since we do
expect it to be tracking gpfn ranges rather than IO regions?  And if we
determine in the future that more ranges are necessary, to then do the
work of moving it to using p2m types (or exposing a knob to adjust it)?

But (and this the other angle): is simply marking a numerical limit
sufficient to avoid memory exhaustion? Is there a danger that after
creating several guests, such that Xen was now running very low on
memory, that a guest would (purposely or not) cause memory to be
exhausted sometime further after boot, causing a system-wide DoS (or
just general lack of stability)?

In the shadow / hap memory case, the memory is pre-allocated up front,
which makes sure that nothing a guest does can cause Xen to run out of
memory once it's booted.  Without pre-allocating it, it's still possible
that the admin might start up enough VMs that exhaustion is *possible*,
but won't be triggered until later (when the guest starts using more GTTs).

Although in fact this really points to the need for a general overhaul
in how memory allocation on behalf of a domain is handled in general;
that's a bigger chunk of work.

But in any case, it seems to me that we can entirely avoid the question
of how many ranges might ever be necessary by starting with a fixed
limit in the hypervisor, and then moving to a p2m-type based
implementation if and when that becomes unsatisfactory.

 -George

Zhiyuan Lv Feb. 5, 2016, 2:01 a.m. UTC | #68

Hi George,

On Thu, Feb 04, 2016 at 11:06:33AM +0000, George Dunlap wrote:
> On Thu, Feb 4, 2016 at 9:38 AM, Yu, Zhang <yu.c.zhang@linux.intel.com> wrote:
> > On 2/4/2016 5:28 PM, Paul Durrant wrote:
> >> I assume this means that the emulator can 'unshadow' GTTs (I guess on an
> >> LRU basis) so that it can shadow new ones when the limit has been exhausted?
> >> If so, how bad is performance likely to be if we live with a lower limit
> >> and take the hit of unshadowing if the guest GTTs become heavily fragmented?
> >>
> > Thank you, Paul.
> >
> > Well, I was told the emulator have approaches to delay the shadowing of
> > the GTT till future GPU commands are submitted. By now, I'm not sure
> > about the performance penalties if the limit is set too low. Although
> > we are confident 8K is a secure limit, it seems still too high to be
> > accepted. We will perform more experiments with this new approach to
> > find a balance between the lowest limit and the XenGT performance.
> 
> Just to check some of my assumptions:
> 
> I assume that unlike memory accesses, your GPU hardware cannot
> 'recover' from faults in the GTTs. That is, for memory, you can take a
> page fault, fix up the pagetables, and then re-execute the original
> instruction; but so far I haven't heard of any devices being able to
> seamlessly re-execute a transaction after a fault.  Is my
> understanding correct?

That's correct. At least for now, we do not have GPU page fault.

> 
> If that is the case, then for every top-level value (whatever the
> equivalent of the CR3), you need to be able to shadow the entire GTT
> tree below it, yes?  You can't use a trick that the memory shadow

That's right also.

> pagetables can use, of unshadowing parts of the tree and reshadowing
> them.
> 
> So as long as the currently-in-use GTT tree contains no more than
> $LIMIT ranges, you can unshadow and reshadow; this will be slow, but
> strictly speaking correct.
> 
> What do you do if the guest driver switches to a GTT such that the
> entire tree takes up more than $LIMIT entries?

GPU has some special properties different from CPU, which make things
easier. The GPU page table is constructed by CPU and used by GPU
workloads. GPU workload itself will not change the page table.
Meanwhile, GPU workload submission, in virtualized environment, is
controled by our device model. Then we can reshadow the whole table
every time before we submit workload. That can reduce the total number
required for write protection but with performance impact, because GPU
has to have more idle time waiting for CPU. Hope the info helps.
Thanks!

Regards,
-Zhiyuan

> 
>  -George

Tian, Kevin Feb. 5, 2016, 3:31 a.m. UTC | #69

> From: Ian Campbell [mailto:ian.campbell@citrix.com]
> Sent: Thursday, February 04, 2016 6:06 PM
> 
> On Wed, 2016-02-03 at 17:41 +0000, George Dunlap wrote:
> > But of course, since they they aren't actually ranges but just gpfns,
> > they're scattered randomly throughout the guest physical address
> > space.
> 
> (Possibly) stupid question:
> 
> Since, AIUI, the in-guest GPU driver is XenGT aware could it not allocate a
> contiguous range of pages at start of day to use as GPU PTs? Or even just N
> contiguous regions, e.g. i think the "8K" refers to pages, which is 16 2M
> allocations, which is a far more manageable number of ranges to track than
> 8096 individual pages.
> 

We add XenGT awareness in guest driver at minimum requirement (e.g.
to handle address space ballooning due to graphics memory partitioning),
which only impacts hardware specific initialization code (thinking vgpu as
a new sku).

However we're hesitating to touch other general driver code, such as 
allocation policy you mentioned earlier. Changing that part specifically 
for one sku needs very strong reason to convince driver maintainer.

And we cannot assume above allocation policy can be always met. 

Thanks
Kevin

Tian, Kevin Feb. 5, 2016, 3:35 a.m. UTC | #70

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, February 04, 2016 10:13 PM
> 
> >>> On 04.02.16 at 14:47, <Paul.Durrant@citrix.com> wrote:
> >> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
> >> Sent: 04 February 2016 13:34
> >>  * Is it possible for libxl to somehow tell from the rest of the
> >>    configuration that this larger limit should be applied ?
> >>
> >>    AFAICT there is nothing in libxl directly involving vgpu.  How can
> >>    libxl be used to create a guest with vgpu enabled ?  I had thought
> >>    that this was done merely with the existing PCI passthrough
> >>    configuration, but it now seems that somehow a second device model
> >>    would have to be started.  libxl doesn't have code to do that.
> >>
> >
> > AIUI if the setting of the increased limit is tied to provisioning a gvt-g
> > instance for a VM then I don't there needs to be extra information in the VM
> > config. These seems like the most sensible thing to do.
> 
> I don't understand this: For one, it's still unclear to me on what basis
> it would be known that a given VM is a "gvt-g instance". And even if
> that's indeed derivable from something, the uncertainty about a
> workable upper bound on the number of WP ranges would still seem
> to demand the value to be specifiable separately...
> 

We'll invent a different parameter e.g. gvt-g from existing passthrough
one. So just for this question, a toolstack can know whether a VM is
provisioned with a vgpu based on that parameter from config file.

Thanks
Kevin

Tian, Kevin Feb. 5, 2016, 3:44 a.m. UTC | #71

> From: Lv, Zhiyuan
> Sent: Friday, February 05, 2016 10:01 AM
> 
> Hi George,
> 
> On Thu, Feb 04, 2016 at 11:06:33AM +0000, George Dunlap wrote:
> > On Thu, Feb 4, 2016 at 9:38 AM, Yu, Zhang <yu.c.zhang@linux.intel.com> wrote:
> > > On 2/4/2016 5:28 PM, Paul Durrant wrote:
> > >> I assume this means that the emulator can 'unshadow' GTTs (I guess on an
> > >> LRU basis) so that it can shadow new ones when the limit has been exhausted?
> > >> If so, how bad is performance likely to be if we live with a lower limit
> > >> and take the hit of unshadowing if the guest GTTs become heavily fragmented?
> > >>
> > > Thank you, Paul.
> > >
> > > Well, I was told the emulator have approaches to delay the shadowing of
> > > the GTT till future GPU commands are submitted. By now, I'm not sure
> > > about the performance penalties if the limit is set too low. Although
> > > we are confident 8K is a secure limit, it seems still too high to be
> > > accepted. We will perform more experiments with this new approach to
> > > find a balance between the lowest limit and the XenGT performance.
> >
> > Just to check some of my assumptions:
> >
> > I assume that unlike memory accesses, your GPU hardware cannot
> > 'recover' from faults in the GTTs. That is, for memory, you can take a
> > page fault, fix up the pagetables, and then re-execute the original
> > instruction; but so far I haven't heard of any devices being able to
> > seamlessly re-execute a transaction after a fault.  Is my
> > understanding correct?
> 
> That's correct. At least for now, we do not have GPU page fault.
> 
> >
> > If that is the case, then for every top-level value (whatever the
> > equivalent of the CR3), you need to be able to shadow the entire GTT
> > tree below it, yes?  You can't use a trick that the memory shadow
> 
> That's right also.
> 
> > pagetables can use, of unshadowing parts of the tree and reshadowing
> > them.
> >
> > So as long as the currently-in-use GTT tree contains no more than
> > $LIMIT ranges, you can unshadow and reshadow; this will be slow, but
> > strictly speaking correct.
> >
> > What do you do if the guest driver switches to a GTT such that the
> > entire tree takes up more than $LIMIT entries?
> 
> GPU has some special properties different from CPU, which make things
> easier. The GPU page table is constructed by CPU and used by GPU
> workloads. GPU workload itself will not change the page table.
> Meanwhile, GPU workload submission, in virtualized environment, is
> controled by our device model. Then we can reshadow the whole table
> every time before we submit workload. That can reduce the total number
> required for write protection but with performance impact, because GPU
> has to have more idle time waiting for CPU. Hope the info helps.
> Thanks!
> 

Putting in another way, it's fully under mediation when a GPU page table 
(GTT) will be referenced by the GPU, so there're plenty of room to
optimize existing shadowing (always shadowing all recognized GPU page
tables), e.g. shadowing only active one when a VM is scheduled in. It's
a performance matter but no correctness issue.

This is why Yu mentioned earlier whether we can just set a default
limit which is good for majority of use cases, while extending our
device mode to drop/recreate some shadow tables upon the limitation
is hit. I think this matches how today's CPU shadow page table is
implemented, which also has a limitation of how many shadow pages
are allowed per-VM.

Thanks
Kevin

Tian, Kevin Feb. 5, 2016, 3:47 a.m. UTC | #72

> From: Paul Durrant [mailto:Paul.Durrant@citrix.com]
> Sent: Thursday, February 04, 2016 11:51 PM
> 
> > -----Original Message-----
> > From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
> > Sent: 04 February 2016 15:06
> > To: Paul Durrant
> > Cc: Jan Beulich; Andrew Cooper; George Dunlap; Ian Campbell; Stefano
> > Stabellini; Wei Liu; Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen-
> > devel@lists.xen.org; Keir (Xen.org)
> > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> > max_wp_ram_ranges.
> >
> > Paul Durrant writes ("RE: [Xen-devel] [PATCH v3 3/3] tools: introduce
> > parameter max_wp_ram_ranges."):
> > > There are patches in the XenGT xen repo which add extra parameters
> > > into the VM config to allow libxl to provision a gvt-g instance (of
> > > which there are a finite number per GPU) for a VM. The increased
> > > limit could be applied when doing so and it may be feasible to
> > > determine (maybe from the version of the GPU h/w) what a reasonable
> > > limit is.
> >
> > Right.  So in that case there would be no need for this user-visible
> > knob (along with its sadly not very helpful documentation) to be
> > exposed in the libxl stable API.
> >
> 
> No, I really don't think that should be necessary. Libxl merely needs to configure a limit in
> the hypervisor appropriate to the gvt-g instance that is provisioned.
> 

Agree.

Thanks
Kevin

Tian, Kevin Feb. 5, 2016, 4:18 a.m. UTC | #73

> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: Friday, February 05, 2016 1:12 AM
> 
> On 04/02/16 14:08, Jan Beulich wrote:
> >>>> On 04.02.16 at 14:33, <Ian.Jackson@eu.citrix.com> wrote:
> >> Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> >> max_wp_ram_ranges."):
> >>> On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote:
> >>>> So another question is, if value of this limit really matters, will a
> >>>> lower one be more acceptable(the current 256 being not enough)?
> >>>
> >>> If you've carefully read George's replies, [...]
> >>
> >> Thanks to George for the very clear explanation, and also to him for
> >> an illuminating in-person discussion.
> >>
> >> It is disturbing that as a result of me as a tools maintainer asking
> >> questions about what seems to me to be a troublesome a user-visible
> >> control setting in libxl, we are now apparently revisiting lower
> >> layers of the hypervisor design, which have already been committed.
> >>
> >> While I find George's line of argument convincing, neither I nor
> >> George are maintainers of the relevant hypervisor code.  I am not
> >> going to insist that anything in the hypervisor is done different and
> >> am not trying to use my tools maintainer position to that end.
> >>
> >> Clearly there has been a failure of our workflow to consider and
> >> review everything properly together.  But given where we are now, I
> >> think that this discussion about hypervisor internals is probably a
> >> distraction.
> >
> > While I recall George having made that alternative suggestion,
> > both Yu and Paul having reservations against it made me not
> > insist on that alternative. Instead I've been trying to limit some
> > of the bad effects that the variant originally proposed brought
> > with it. Clearly, with the more detailed reply George has now
> > given (involving areas where he is the maintainer for), I should
> > have been more demanding towards the exploration of that
> > alternative. That's clearly unfortunate, and I apologize for that,
> > but such things happen.
> >
> > As to one of the patches already having for committed - I'm not
> > worried about that at all. We can always revert, that's why the
> > thing is called "unstable".
> 
> It looks like I should have been more careful to catch up on the current
> state of things before I started arguing again -- please accept my
> apologies.

Thanks George for your careful thinking.

> 
> I see that patch 2/3 addresses the gpfn/io question in the commit
> message by saying, "Previously, a new hypercall or subop was suggested
> to map write-protected pages into ioreq server. However, it turned out
> handler of this new hypercall would be almost the same with the existing
> pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's already a
> type parameter in this hypercall. So no new hypercall defined, only a
> new type is introduced."
> 
> And I see that 2/3 internally separates the WP_RAM type into a separate
> rangeset, whose size can be adjusted separately.
> 
> This addresses my complaint about the interface using gpfns rather than
> MMIO ranges as an interface (somewhat anyway).  Sorry for not
> acknowledging this at first.
> 
> The question of the internal implementation -- whether to use RB tree
> rangesets, or radix trees (as apparently ARM memaccess does) or p2m
> types -- is an internal implementation question.  I think p2m types is
> long-term the best way to go, but it won't hurt to have the current
> implementation checked in, as long as it doesn't have any impacts on the
> stable interface.

I'm still trying to understand your suggestion vs. this one. Today we
already have a p2m_mmio_write_dm type. It's there already, and any
write fault hitting that type will be delivered to ioreq server. Then next
open is how a ioreq server could know whether it should handle this
request or not, which is why some tracking structures (either RB/radix)
are created to maintain that specific information. It's under the assumption
that multiple ioreq servers co-exist, so a loop check on all ioreq servers
is required to identify the right target. And multiple ioreq servers are
real case in XenGT, because our vGPU device model is in kernel, as
part of Intel i915 graphics driver. So at least two ioreq servers already
exist, with one routing to XenGT in Dom0 kernel space and the other 
to the default Qemu in Dom0 user.

In your long-term approach with p2m types, looks you are proposing
encoding ioreq server ID in p2m type directly (e.g. 4bits), which then 
eliminates the need of tracking in ioreq server side so the whole
security concern is gone. And no limitation at all. Because available
p2m bits are limited, as Andrew pointed out, so it might be reasonable
to implement this approach when a new p2t structure is added, which
is why we consider it as a long-term approach.

Please correct me if above understanding is correct?

> 
> At the moment, as far as I can tell, there's no way for libxl to even
> run a version of qemu with XenGT enabled, so there's no real need for
> libxl to be involved.

no way because we have upstreamed all toolstack changes yet, but
we should still discuss the requirement as we've been doing in this
thread right? or do you mean something different?

> 
> The purpose of having the limit would putatively be to prevent a guest
> being able to trigger an exhaustion of hypervisor memory by inducing the
> device model to mark an arbitrary number of ranges as mmio_dm.
> 
> Two angles on this.
> 
> First, assuming that limiting the number of ranges is what we want:  I'm
> not really a fan of using HVM_PARAMs for this, but as long as it's not
> considered a public interface (i.e., it could go away or disappear and
> everything would Just Work), then I wouldn't object.
> 
> Although I would ask: would it instead be suitable for now to just set
> the default limit for WP_RAM to 8196 in the hypervisor, since we do
> expect it to be tracking gpfn ranges rather than IO regions?  And if we
> determine in the future that more ranges are necessary, to then do the
> work of moving it to using p2m types (or exposing a knob to adjust it)?
> 
> But (and this the other angle): is simply marking a numerical limit
> sufficient to avoid memory exhaustion? Is there a danger that after
> creating several guests, such that Xen was now running very low on
> memory, that a guest would (purposely or not) cause memory to be
> exhausted sometime further after boot, causing a system-wide DoS (or
> just general lack of stability)?
> 
> In the shadow / hap memory case, the memory is pre-allocated up front,
> which makes sure that nothing a guest does can cause Xen to run out of
> memory once it's booted.  Without pre-allocating it, it's still possible
> that the admin might start up enough VMs that exhaustion is *possible*,
> but won't be triggered until later (when the guest starts using more GTTs).

that's a valid concern though I believe not cleanly addressed in many places. :-)

> 
> Although in fact this really points to the need for a general overhaul
> in how memory allocation on behalf of a domain is handled in general;
> that's a bigger chunk of work.
> 
> But in any case, it seems to me that we can entirely avoid the question
> of how many ranges might ever be necessary by starting with a fixed
> limit in the hypervisor, and then moving to a p2m-type based
> implementation if and when that becomes unsatisfactory.
> 

I agree with this approach. Let's see whether we can get consensus from
others to make a conclusion.

Actually another quick thought. Please help check whether it makes sense
or just be more tricky.

Could we have an intermediate option toward p2m-type based option, by
assuming only one ioreq server can handle write_dm related faults? If that's
the case:

- we don't need to add more bits in p2m type;
- XenGT can register as the default ioreq server for write_dm;
- when a write-dm fault is triggered, we'll loop all ioreq servers to find the
default one and then deliver through that path w/o further check;
- we may not change ioreq server interface at all. Assume the ioreq sever 
which receives the 1st write-protection request from device model is the
default one;

It is not very clean, but code change is minimal. Also it's easily extensible to
support multiple ioreq server both with write_dm requirement in the future.

But I haven't think about all corner cases carefully. Welcome comments. :-)

Thanks
Kevin

Jan Beulich Feb. 5, 2016, 8:32 a.m. UTC | #74

>>> On 04.02.16 at 18:12, <george.dunlap@citrix.com> wrote:
> Two angles on this.
> 
> First, assuming that limiting the number of ranges is what we want:  I'm
> not really a fan of using HVM_PARAMs for this, but as long as it's not
> considered a public interface (i.e., it could go away or disappear and
> everything would Just Work), then I wouldn't object.

The parameter setting interface generally is guest exposed, it's just
that many parameters can't be set by guests for themselves. Of
course the new #define could be put inside a Xen/tools conditional.

> Although I would ask: would it instead be suitable for now to just set
> the default limit for WP_RAM to 8196 in the hypervisor, since we do
> expect it to be tracking gpfn ranges rather than IO regions?  And if we
> determine in the future that more ranges are necessary, to then do the
> work of moving it to using p2m types (or exposing a knob to adjust it)?

That's what collides with disaggregation (as pointed out before):
Any tool stack component could then create wp-ram pages in
guests it controls, no matter whether those guests actually have
a need for such. I continue to think that if we indeed got this
route, then the host admin should be required to explicitly state
that the risks are acceptable (by allowing only certain guests to
actually create [many] of such pages).

> But (and this the other angle): is simply marking a numerical limit
> sufficient to avoid memory exhaustion? Is there a danger that after
> creating several guests, such that Xen was now running very low on
> memory, that a guest would (purposely or not) cause memory to be
> exhausted sometime further after boot, causing a system-wide DoS (or
> just general lack of stability)?

The guest itself can't, but other than fully privileged tool stack
components could, and that's still something we need to avoid.

> In the shadow / hap memory case, the memory is pre-allocated up front,
> which makes sure that nothing a guest does can cause Xen to run out of
> memory once it's booted.  Without pre-allocating it, it's still possible
> that the admin might start up enough VMs that exhaustion is *possible*,
> but won't be triggered until later (when the guest starts using more GTTs).
> 
> Although in fact this really points to the need for a general overhaul
> in how memory allocation on behalf of a domain is handled in general;
> that's a bigger chunk of work.

Well, right now there's pretty little allocation happening at run time
afaict, i.e. having a couple of pages available will generally keep
things going smoothly. (Once upon a time it was that there were -
iirc - no active runtime allocations at all, i.e. such had occurred only
in the context of things like domctls, where failure could be dealt
with by ballooning a few pages out of some domain and retrying.
I'm afraid that's not the case anymore nowadays, and even if it
was would collide with disaggregation, as a tool stack component
legitimately issuing a domctl may not have the privileges to
balloon other than the domain it controls, in particular not Dom0.)

> But in any case, it seems to me that we can entirely avoid the question
> of how many ranges might ever be necessary by starting with a fixed
> limit in the hypervisor, and then moving to a p2m-type based
> implementation if and when that becomes unsatisfactory.

With all of the above I think we should rather explore the p2m-type
based approach, in particular the suggestion Kevin has made to
direct all p2m_mmio_write_dm (a name which appears to have been
badly chosen, considering that we're talking about RAM pages
here) write accesses to the default ioreq server (utilizing that
upstream qemu doesn't itself register as such).

Jan

Jan Beulich Feb. 5, 2016, 8:38 a.m. UTC | #75

>>> On 05.02.16 at 04:44, <kevin.tian@intel.com> wrote:
> This is why Yu mentioned earlier whether we can just set a default
> limit which is good for majority of use cases, while extending our
> device mode to drop/recreate some shadow tables upon the limitation
> is hit. I think this matches how today's CPU shadow page table is
> implemented, which also has a limitation of how many shadow pages
> are allowed per-VM.

Except that un-shadowing some victim page in order to shadow one
to make forward progress can there be done in the page fault handler,
i.e. we don't need to up front determine the set of pages to be
shadowed for a certain operation to complete.

Jan

Yu Zhang Feb. 5, 2016, 8:40 a.m. UTC | #76

On 2/4/2016 7:06 PM, George Dunlap wrote:
> On Thu, Feb 4, 2016 at 9:38 AM, Yu, Zhang <yu.c.zhang@linux.intel.com> wrote:
>> On 2/4/2016 5:28 PM, Paul Durrant wrote:
>>> I assume this means that the emulator can 'unshadow' GTTs (I guess on an
>>> LRU basis) so that it can shadow new ones when the limit has been exhausted?
>>> If so, how bad is performance likely to be if we live with a lower limit
>>> and take the hit of unshadowing if the guest GTTs become heavily fragmented?
>>>
>> Thank you, Paul.
>>
>> Well, I was told the emulator have approaches to delay the shadowing of
>> the GTT till future GPU commands are submitted. By now, I'm not sure
>> about the performance penalties if the limit is set too low. Although
>> we are confident 8K is a secure limit, it seems still too high to be
>> accepted. We will perform more experiments with this new approach to
>> find a balance between the lowest limit and the XenGT performance.
>
> Just to check some of my assumptions:
>
> I assume that unlike memory accesses, your GPU hardware cannot
> 'recover' from faults in the GTTs. That is, for memory, you can take a
> page fault, fix up the pagetables, and then re-execute the original
> instruction; but so far I haven't heard of any devices being able to
> seamlessly re-execute a transaction after a fault.  Is my
> understanding correct?
>

Yes

> If that is the case, then for every top-level value (whatever the
> equivalent of the CR3), you need to be able to shadow the entire GTT
> tree below it, yes?  You can't use a trick that the memory shadow
> pagetables can use, of unshadowing parts of the tree and reshadowing
> them.
>
> So as long as the currently-in-use GTT tree contains no more than
> $LIMIT ranges, you can unshadow and reshadow; this will be slow, but
> strictly speaking correct.
>
> What do you do if the guest driver switches to a GTT such that the
> entire tree takes up more than $LIMIT entries?
>

Good question. Like the memory virtualization, IIUC, besides wp the
guest page tables, we can also track the updates of them when cr3 is
written or when a tlb flush occurs. We can consider to optimize our GPU
device model to achieve similar goal, e.g. when a root pointer(like
cr3) to the page table is written and when a set of commands is
submitted(Both situations are trigger by MMIO operations). But taking
consideration of performance, we may probably still need to wp all the
page tables when they are created at the first time. It requires a lot
optimization work in the device model side to find a balance between a
minimal wp-ed gpfns and a reasonable performance. We'd like to have a
try. :)

Yu

Yu Zhang Feb. 5, 2016, 8:41 a.m. UTC | #77

On 2/5/2016 1:12 AM, George Dunlap wrote:
> On 04/02/16 14:08, Jan Beulich wrote:
>>>>> On 04.02.16 at 14:33, <Ian.Jackson@eu.citrix.com> wrote:
>>> Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
>>> max_wp_ram_ranges."):
>>>> On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote:
>>>>> So another question is, if value of this limit really matters, will a
>>>>> lower one be more acceptable(the current 256 being not enough)?
>>>>
>>>> If you've carefully read George's replies, [...]
>>>
>>> Thanks to George for the very clear explanation, and also to him for
>>> an illuminating in-person discussion.
>>>
>>> It is disturbing that as a result of me as a tools maintainer asking
>>> questions about what seems to me to be a troublesome a user-visible
>>> control setting in libxl, we are now apparently revisiting lower
>>> layers of the hypervisor design, which have already been committed.
>>>
>>> While I find George's line of argument convincing, neither I nor
>>> George are maintainers of the relevant hypervisor code.  I am not
>>> going to insist that anything in the hypervisor is done different and
>>> am not trying to use my tools maintainer position to that end.
>>>
>>> Clearly there has been a failure of our workflow to consider and
>>> review everything properly together.  But given where we are now, I
>>> think that this discussion about hypervisor internals is probably a
>>> distraction.
>>
>> While I recall George having made that alternative suggestion,
>> both Yu and Paul having reservations against it made me not
>> insist on that alternative. Instead I've been trying to limit some
>> of the bad effects that the variant originally proposed brought
>> with it. Clearly, with the more detailed reply George has now
>> given (involving areas where he is the maintainer for), I should
>> have been more demanding towards the exploration of that
>> alternative. That's clearly unfortunate, and I apologize for that,
>> but such things happen.
>>
>> As to one of the patches already having for committed - I'm not
>> worried about that at all. We can always revert, that's why the
>> thing is called "unstable".
>
> It looks like I should have been more careful to catch up on the current
> state of things before I started arguing again -- please accept my
> apologies.
>

In fact, I need to say thank you all for your patience and suggestions.
I'm thrilled to see XenGT is receiving so much attention. :)

> I see that patch 2/3 addresses the gpfn/io question in the commit
> message by saying, "Previously, a new hypercall or subop was suggested
> to map write-protected pages into ioreq server. However, it turned out
> handler of this new hypercall would be almost the same with the existing
> pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's already a
> type parameter in this hypercall. So no new hypercall defined, only a
> new type is introduced."
>
> And I see that 2/3 internally separates the WP_RAM type into a separate
> rangeset, whose size can be adjusted separately.
>
> This addresses my complaint about the interface using gpfns rather than
> MMIO ranges as an interface (somewhat anyway).  Sorry for not
> acknowledging this at first.
>
> The question of the internal implementation -- whether to use RB tree
> rangesets, or radix trees (as apparently ARM memaccess does) or p2m
> types -- is an internal implementation question.  I think p2m types is
> long-term the best way to go, but it won't hurt to have the current
> implementation checked in, as long as it doesn't have any impacts on the
> stable interface.
>
> At the moment, as far as I can tell, there's no way for libxl to even
> run a version of qemu with XenGT enabled, so there's no real need for
> libxl to be involved.
>

I agree.

> The purpose of having the limit would putatively be to prevent a guest
> being able to trigger an exhaustion of hypervisor memory by inducing the
> device model to mark an arbitrary number of ranges as mmio_dm.
>
> Two angles on this.
>
> First, assuming that limiting the number of ranges is what we want:  I'm
> not really a fan of using HVM_PARAMs for this, but as long as it's not
> considered a public interface (i.e., it could go away or disappear and
> everything would Just Work), then I wouldn't object.
>
> Although I would ask: would it instead be suitable for now to just set
> the default limit for WP_RAM to 8196 in the hypervisor, since we do
> expect it to be tracking gpfn ranges rather than IO regions?  And if we

That is what we have suggesting in v9. But Jan proposed we leave this
option to the admin. And to some extent, I can understand his concern.

> determine in the future that more ranges are necessary, to then do the
> work of moving it to using p2m types (or exposing a knob to adjust it)?
>
> But (and this the other angle): is simply marking a numerical limit
> sufficient to avoid memory exhaustion? Is there a danger that after
> creating several guests, such that Xen was now running very low on
> memory, that a guest would (purposely or not) cause memory to be
> exhausted sometime further after boot, causing a system-wide DoS (or
> just general lack of stability)?
>

This worry sounds reasonable. So from this point of view, I guess value
of this limit does matter. Setting it to 256 will not cause any
trouble, but setting it to a huge one would make this limitation
useless. Previously, I had thought 8K is an acceptable one for XenGT to
run smoothly, and for Xen heap to not over consume its memory, but
seems this value is not small enough to be acceptable. :)

Yu

Yu Zhang Feb. 5, 2016, 8:41 a.m. UTC | #78

On 2/5/2016 12:18 PM, Tian, Kevin wrote:
>> From: George Dunlap [mailto:george.dunlap@citrix.com]
>> Sent: Friday, February 05, 2016 1:12 AM
>>
>> On 04/02/16 14:08, Jan Beulich wrote:
>>>>>> On 04.02.16 at 14:33, <Ian.Jackson@eu.citrix.com> wrote:
>>>> Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
>>>> max_wp_ram_ranges."):
>>>>> On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote:
>>>>>> So another question is, if value of this limit really matters, will a
>>>>>> lower one be more acceptable(the current 256 being not enough)?
>>>>>
>>>>> If you've carefully read George's replies, [...]
>>>>
>>>> Thanks to George for the very clear explanation, and also to him for
>>>> an illuminating in-person discussion.
>>>>
>>>> It is disturbing that as a result of me as a tools maintainer asking
>>>> questions about what seems to me to be a troublesome a user-visible
>>>> control setting in libxl, we are now apparently revisiting lower
>>>> layers of the hypervisor design, which have already been committed.
>>>>
>>>> While I find George's line of argument convincing, neither I nor
>>>> George are maintainers of the relevant hypervisor code.  I am not
>>>> going to insist that anything in the hypervisor is done different and
>>>> am not trying to use my tools maintainer position to that end.
>>>>
>>>> Clearly there has been a failure of our workflow to consider and
>>>> review everything properly together.  But given where we are now, I
>>>> think that this discussion about hypervisor internals is probably a
>>>> distraction.
>>>
>>> While I recall George having made that alternative suggestion,
>>> both Yu and Paul having reservations against it made me not
>>> insist on that alternative. Instead I've been trying to limit some
>>> of the bad effects that the variant originally proposed brought
>>> with it. Clearly, with the more detailed reply George has now
>>> given (involving areas where he is the maintainer for), I should
>>> have been more demanding towards the exploration of that
>>> alternative. That's clearly unfortunate, and I apologize for that,
>>> but such things happen.
>>>
>>> As to one of the patches already having for committed - I'm not
>>> worried about that at all. We can always revert, that's why the
>>> thing is called "unstable".
>>
>> It looks like I should have been more careful to catch up on the current
>> state of things before I started arguing again -- please accept my
>> apologies.
>
> Thanks George for your careful thinking.
>
>>
>> I see that patch 2/3 addresses the gpfn/io question in the commit
>> message by saying, "Previously, a new hypercall or subop was suggested
>> to map write-protected pages into ioreq server. However, it turned out
>> handler of this new hypercall would be almost the same with the existing
>> pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's already a
>> type parameter in this hypercall. So no new hypercall defined, only a
>> new type is introduced."
>>
>> And I see that 2/3 internally separates the WP_RAM type into a separate
>> rangeset, whose size can be adjusted separately.
>>
>> This addresses my complaint about the interface using gpfns rather than
>> MMIO ranges as an interface (somewhat anyway).  Sorry for not
>> acknowledging this at first.
>>
>> The question of the internal implementation -- whether to use RB tree
>> rangesets, or radix trees (as apparently ARM memaccess does) or p2m
>> types -- is an internal implementation question.  I think p2m types is
>> long-term the best way to go, but it won't hurt to have the current
>> implementation checked in, as long as it doesn't have any impacts on the
>> stable interface.
>
> I'm still trying to understand your suggestion vs. this one. Today we
> already have a p2m_mmio_write_dm type. It's there already, and any
> write fault hitting that type will be delivered to ioreq server. Then next
> open is how a ioreq server could know whether it should handle this
> request or not, which is why some tracking structures (either RB/radix)
> are created to maintain that specific information. It's under the assumption
> that multiple ioreq servers co-exist, so a loop check on all ioreq servers
> is required to identify the right target. And multiple ioreq servers are
> real case in XenGT, because our vGPU device model is in kernel, as
> part of Intel i915 graphics driver. So at least two ioreq servers already
> exist, with one routing to XenGT in Dom0 kernel space and the other
> to the default Qemu in Dom0 user.
>
> In your long-term approach with p2m types, looks you are proposing
> encoding ioreq server ID in p2m type directly (e.g. 4bits), which then
> eliminates the need of tracking in ioreq server side so the whole
> security concern is gone. And no limitation at all. Because available
> p2m bits are limited, as Andrew pointed out, so it might be reasonable
> to implement this approach when a new p2t structure is added, which
> is why we consider it as a long-term approach.
>
> Please correct me if above understanding is correct?
>
>>
>> At the moment, as far as I can tell, there's no way for libxl to even
>> run a version of qemu with XenGT enabled, so there's no real need for
>> libxl to be involved.
>
> no way because we have upstreamed all toolstack changes yet, but
> we should still discuss the requirement as we've been doing in this
> thread right? or do you mean something different?
>
>>
>> The purpose of having the limit would putatively be to prevent a guest
>> being able to trigger an exhaustion of hypervisor memory by inducing the
>> device model to mark an arbitrary number of ranges as mmio_dm.
>>
>> Two angles on this.
>>
>> First, assuming that limiting the number of ranges is what we want:  I'm
>> not really a fan of using HVM_PARAMs for this, but as long as it's not
>> considered a public interface (i.e., it could go away or disappear and
>> everything would Just Work), then I wouldn't object.
>>
>> Although I would ask: would it instead be suitable for now to just set
>> the default limit for WP_RAM to 8196 in the hypervisor, since we do
>> expect it to be tracking gpfn ranges rather than IO regions?  And if we
>> determine in the future that more ranges are necessary, to then do the
>> work of moving it to using p2m types (or exposing a knob to adjust it)?
>>
>> But (and this the other angle): is simply marking a numerical limit
>> sufficient to avoid memory exhaustion? Is there a danger that after
>> creating several guests, such that Xen was now running very low on
>> memory, that a guest would (purposely or not) cause memory to be
>> exhausted sometime further after boot, causing a system-wide DoS (or
>> just general lack of stability)?
>>
>> In the shadow / hap memory case, the memory is pre-allocated up front,
>> which makes sure that nothing a guest does can cause Xen to run out of
>> memory once it's booted.  Without pre-allocating it, it's still possible
>> that the admin might start up enough VMs that exhaustion is *possible*,
>> but won't be triggered until later (when the guest starts using more GTTs).
>
> that's a valid concern though I believe not cleanly addressed in many places. :-)
>
>>
>> Although in fact this really points to the need for a general overhaul
>> in how memory allocation on behalf of a domain is handled in general;
>> that's a bigger chunk of work.
>>
>> But in any case, it seems to me that we can entirely avoid the question
>> of how many ranges might ever be necessary by starting with a fixed
>> limit in the hypervisor, and then moving to a p2m-type based
>> implementation if and when that becomes unsatisfactory.
>>
>
> I agree with this approach. Let's see whether we can get consensus from
> others to make a conclusion.
>
> Actually another quick thought. Please help check whether it makes sense
> or just be more tricky.
>
> Could we have an intermediate option toward p2m-type based option, by
> assuming only one ioreq server can handle write_dm related faults? If that's
> the case:
>
> - we don't need to add more bits in p2m type;
> - XenGT can register as the default ioreq server for write_dm;
> - when a write-dm fault is triggered, we'll loop all ioreq servers to find the
> default one and then deliver through that path w/o further check;
> - we may not change ioreq server interface at all. Assume the ioreq sever
> which receives the 1st write-protection request from device model is the
> default one;

Thank you, Kevin. With this suggestion, we do not need to worry about
the limits in rangeset, therefore no performance worries to traverse
the ranges. One disadvantage is that we need to add a field like
wp_ioreq_server in struct hvm_domain, yet there's already a
default_ioreq_server for qemu. So I'd like to hear Paul's opion about
this.

By now, we all agree that this new max_wp_ram_ranges in tool stack is
not suitable. And to give a summary of the choices we have:

1> George's suggestion: "starting with a fixed limit in the hypervisor,
and then moving to a p2m-type based implementation if and when that
becomes unsatisfactory". My understanding is this looks like the v9
implementation.

2> If hard code the limit as 8K is acceptable, we can use a xengt flag
in hvm configuration, which leave this choice to admin and without
exposing to much hypervisor implantation details.

3> Kevin's suggestion: leverage the p2m type, which no longer needs the
underlying rangeset.

4> I'll continue to do some path-finding in the device model side, to
see if, after optimization, this limit can be set to a lower value.

Anyway, thank you all for your attentions and suggestions.
This weekend is Chinese Spring Festival, my mail replies would be slow
in the next 2 weeks. Wish you all good luck in the new year! :)

Yu

Paul Durrant Feb. 5, 2016, 9:24 a.m. UTC | #79

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 05 February 2016 08:33
> To: George Dunlap
> Cc: Andrew Cooper; Ian Campbell; Paul Durrant; Stefano Stabellini; Wei Liu;
> Ian Jackson; Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen-
> devel@lists.xen.org; Keir (Xen.org)
> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> >>> On 04.02.16 at 18:12, <george.dunlap@citrix.com> wrote:
> > Two angles on this.
> >
> > First, assuming that limiting the number of ranges is what we want:  I'm
> > not really a fan of using HVM_PARAMs for this, but as long as it's not
> > considered a public interface (i.e., it could go away or disappear and
> > everything would Just Work), then I wouldn't object.
> 
> The parameter setting interface generally is guest exposed, it's just
> that many parameters can't be set by guests for themselves. Of
> course the new #define could be put inside a Xen/tools conditional.
> 
> > Although I would ask: would it instead be suitable for now to just set
> > the default limit for WP_RAM to 8196 in the hypervisor, since we do
> > expect it to be tracking gpfn ranges rather than IO regions?  And if we
> > determine in the future that more ranges are necessary, to then do the
> > work of moving it to using p2m types (or exposing a knob to adjust it)?
> 
> That's what collides with disaggregation (as pointed out before):
> Any tool stack component could then create wp-ram pages in
> guests it controls, no matter whether those guests actually have
> a need for such. I continue to think that if we indeed got this
> route, then the host admin should be required to explicitly state
> that the risks are acceptable (by allowing only certain guests to
> actually create [many] of such pages).
> 
> > But (and this the other angle): is simply marking a numerical limit
> > sufficient to avoid memory exhaustion? Is there a danger that after
> > creating several guests, such that Xen was now running very low on
> > memory, that a guest would (purposely or not) cause memory to be
> > exhausted sometime further after boot, causing a system-wide DoS (or
> > just general lack of stability)?
> 
> The guest itself can't, but other than fully privileged tool stack
> components could, and that's still something we need to avoid.
> 
> > In the shadow / hap memory case, the memory is pre-allocated up front,
> > which makes sure that nothing a guest does can cause Xen to run out of
> > memory once it's booted.  Without pre-allocating it, it's still possible
> > that the admin might start up enough VMs that exhaustion is *possible*,
> > but won't be triggered until later (when the guest starts using more GTTs).
> >
> > Although in fact this really points to the need for a general overhaul
> > in how memory allocation on behalf of a domain is handled in general;
> > that's a bigger chunk of work.
> 
> Well, right now there's pretty little allocation happening at run time
> afaict, i.e. having a couple of pages available will generally keep
> things going smoothly. (Once upon a time it was that there were -
> iirc - no active runtime allocations at all, i.e. such had occurred only
> in the context of things like domctls, where failure could be dealt
> with by ballooning a few pages out of some domain and retrying.
> I'm afraid that's not the case anymore nowadays, and even if it
> was would collide with disaggregation, as a tool stack component
> legitimately issuing a domctl may not have the privileges to
> balloon other than the domain it controls, in particular not Dom0.)
> 
> > But in any case, it seems to me that we can entirely avoid the question
> > of how many ranges might ever be necessary by starting with a fixed
> > limit in the hypervisor, and then moving to a p2m-type based
> > implementation if and when that becomes unsatisfactory.
> 
> With all of the above I think we should rather explore the p2m-type
> based approach, in particular the suggestion Kevin has made to
> direct all p2m_mmio_write_dm (a name which appears to have been
> badly chosen, considering that we're talking about RAM pages
> here) write accesses to the default ioreq server (utilizing that
> upstream qemu doesn't itself register as such).
> 

Utilizing the default server is a backwards step. GVT-g would have to use the old HVM_PARAM mechanism to cause it's emulator to become default. I think a more appropriate mechanism would be p2m_mmio_write_dm to become something like 'p2m_ioreq_server_write' and then have a hypercall to allow it to be mapped to a particular ioreq server.
Obviously only one could claim it but, with a p2t, the bit could be re-purposed to simply mean 'go look in the p2t' for more information and then the p2t could be structured to allow emulations to be steered to one of many ioreq servers (for read and/or write emulation).

  Paul

> Jan

Jan Beulich Feb. 5, 2016, 10:41 a.m. UTC | #80

>>> On 05.02.16 at 10:24, <Paul.Durrant@citrix.com> wrote:
> Utilizing the default server is a backwards step. GVT-g would have to use the 
> old HVM_PARAM mechanism to cause it's emulator to become default. I think a 
> more appropriate mechanism would be p2m_mmio_write_dm to become something 
> like 'p2m_ioreq_server_write' and then have a hypercall to allow it to be 
> mapped to a particular ioreq server.
> Obviously only one could claim it but, with a p2t, the bit could be 
> re-purposed to simply mean 'go look in the p2t' for more information and 
> then the p2t could be structured to allow emulations to be steered to one of 
> many ioreq servers (for read and/or write emulation).

Sounds reasonable.

Jan

George Dunlap Feb. 5, 2016, 11:05 a.m. UTC | #81

On Fri, Feb 5, 2016 at 3:44 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
>> > So as long as the currently-in-use GTT tree contains no more than
>> > $LIMIT ranges, you can unshadow and reshadow; this will be slow, but
>> > strictly speaking correct.
>> >
>> > What do you do if the guest driver switches to a GTT such that the
>> > entire tree takes up more than $LIMIT entries?
>>
>> GPU has some special properties different from CPU, which make things
>> easier. The GPU page table is constructed by CPU and used by GPU
>> workloads. GPU workload itself will not change the page table.
>> Meanwhile, GPU workload submission, in virtualized environment, is
>> controled by our device model. Then we can reshadow the whole table
>> every time before we submit workload. That can reduce the total number
>> required for write protection but with performance impact, because GPU
>> has to have more idle time waiting for CPU. Hope the info helps.
>> Thanks!
>>
>
> Putting in another way, it's fully under mediation when a GPU page table
> (GTT) will be referenced by the GPU, so there're plenty of room to
> optimize existing shadowing (always shadowing all recognized GPU page
> tables), e.g. shadowing only active one when a VM is scheduled in. It's
> a performance matter but no correctness issue.
>
> This is why Yu mentioned earlier whether we can just set a default
> limit which is good for majority of use cases, while extending our
> device mode to drop/recreate some shadow tables upon the limitation
> is hit. I think this matches how today's CPU shadow page table is
> implemented, which also has a limitation of how many shadow pages
> are allowed per-VM.

I don't think you've understood my question (or maybe I still don't
understood the situation properly).

So in memory pagetables, there's a "tree" that contains a single
top-level page, which points to other pages, which defines one address
space (usually corresponding to one process or thread).   (This is
often just refered to as 'cr3', since it's the value you write into
the cr3 register on x86 processors.) I'm assuming that the structure
is similar for your GPU translation tables -- that a single GTT is
effectively a "tree" sort of like a process address space for an OS.

And it sounds like what you're saying is: suppose we have 10 different
GTTs (i.e., an entire tree / gpu thread), and each one require 1024
ranges to shadow.  In that case, a limit of 8192 ranges means we can
only keep 8 of the ten actually shadowed at any one time.  This is not
optimal, since it will occasionally mean unshadowing an entire GTT and
re-shadowing another one, but it will work, because we can always make
sure that the currently-active GTT is shadowed.

My question is, suppose a single GTT / gpu thread / tree has 9000
ranges.  It would be trivial for an attacker to break into the
operating system and *construct* such a tree, but it's entirely
possible that due to a combination of memory fragmentation and very
large usage, the normal driver might accidentally create such a GTT.
In that case, the device model will not be able to write-protect all
the pages in the single GTT, and thus will not be able to correctly
track changes to the currently-active GTT.  What does your device
model do in that case?

 -George

George Dunlap Feb. 5, 2016, 11:14 a.m. UTC | #82

On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant <Paul.Durrant@citrix.com> wrote:
> Utilizing the default server is a backwards step. GVT-g would have to use the old HVM_PARAM mechanism to cause it's emulator to become default. I think a more appropriate mechanism would be p2m_mmio_write_dm to become something like 'p2m_ioreq_server_write' and then have a hypercall to allow it to be mapped to a particular ioreq server.
> Obviously only one could claim it but, with a p2t, the bit could be re-purposed to simply mean 'go look in the p2t' for more information and then the p2t could be structured to allow emulations to be steered to one of many ioreq servers (for read and/or write emulation).

Right; I had in mind that Xen would allow at any given time a max of N
ioreq servers to register for mmio_write_dm ranges, first-come
first-served; with 'N' being '1' to begin with.  If a second ioreq
server requested mmio_write_dm functionality, it would get -EBUSY.
This would allow their current setup (one qemu dm which doesn't do
mmio_write_dm, one xengt dm which does) to work without needing to
worry any more about how many pages might need to be tracked (either
for efficiency or correctness).

We could then extend this to some larger number (4 seems pretty
reasonable to me) either by adding an extra 3 types, or by some other
method (such as the one Paul suggests).

 -George

Paul Durrant Feb. 5, 2016, 11:24 a.m. UTC | #83

> -----Original Message-----
> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
> George Dunlap
> Sent: 05 February 2016 11:14
> To: Paul Durrant
> Cc: Jan Beulich; George Dunlap; Kevin Tian; Wei Liu; Ian Campbell; Andrew
> Cooper; Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini;
> zhiyuan.lv@intel.com; Ian Jackson; Keir (Xen.org)
> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant <Paul.Durrant@citrix.com>
> wrote:
> > Utilizing the default server is a backwards step. GVT-g would have to use
> the old HVM_PARAM mechanism to cause it's emulator to become default. I
> think a more appropriate mechanism would be p2m_mmio_write_dm to
> become something like 'p2m_ioreq_server_write' and then have a hypercall
> to allow it to be mapped to a particular ioreq server.
> > Obviously only one could claim it but, with a p2t, the bit could be re-
> purposed to simply mean 'go look in the p2t' for more information and then
> the p2t could be structured to allow emulations to be steered to one of many
> ioreq servers (for read and/or write emulation).
> 
> Right; I had in mind that Xen would allow at any given time a max of N
> ioreq servers to register for mmio_write_dm ranges, first-come
> first-served; with 'N' being '1' to begin with.  If a second ioreq
> server requested mmio_write_dm functionality, it would get -EBUSY.
> This would allow their current setup (one qemu dm which doesn't do
> mmio_write_dm, one xengt dm which does) to work without needing to
> worry any more about how many pages might need to be tracked (either
> for efficiency or correctness).
> 
> We could then extend this to some larger number (4 seems pretty
> reasonable to me) either by adding an extra 3 types, or by some other
> method (such as the one Paul suggests).

I think it would be best to do away with the 'write dm' name though. I would like to see it be possible to steer reads+writes, as well as writes (and maybe just reads?) to a particular ioreq server based on type information. So maybe we just call the existing type 'p2m_ioreq_server' and then, in the absence of a p2t, hardcode this to go to whichever emulator makes the new TBD hypercall.
I think we need a proper design at this point. Given that it's Chinese New Year maybe I'll have a stab in Yu's absence.

  Paul

> 
>  -George

Zhiyuan Lv Feb. 5, 2016, 3:13 p.m. UTC | #84

Hi George,

On Fri, Feb 05, 2016 at 11:05:39AM +0000, George Dunlap wrote:
> On Fri, Feb 5, 2016 at 3:44 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
> >> > So as long as the currently-in-use GTT tree contains no more than
> >> > $LIMIT ranges, you can unshadow and reshadow; this will be slow, but
> >> > strictly speaking correct.
> >> >
> >> > What do you do if the guest driver switches to a GTT such that the
> >> > entire tree takes up more than $LIMIT entries?
> >>
> >> GPU has some special properties different from CPU, which make things
> >> easier. The GPU page table is constructed by CPU and used by GPU
> >> workloads. GPU workload itself will not change the page table.
> >> Meanwhile, GPU workload submission, in virtualized environment, is
> >> controled by our device model. Then we can reshadow the whole table
> >> every time before we submit workload. That can reduce the total number
> >> required for write protection but with performance impact, because GPU
> >> has to have more idle time waiting for CPU. Hope the info helps.
> >> Thanks!
> >>
> >
> > Putting in another way, it's fully under mediation when a GPU page table
> > (GTT) will be referenced by the GPU, so there're plenty of room to
> > optimize existing shadowing (always shadowing all recognized GPU page
> > tables), e.g. shadowing only active one when a VM is scheduled in. It's
> > a performance matter but no correctness issue.
> >
> > This is why Yu mentioned earlier whether we can just set a default
> > limit which is good for majority of use cases, while extending our
> > device mode to drop/recreate some shadow tables upon the limitation
> > is hit. I think this matches how today's CPU shadow page table is
> > implemented, which also has a limitation of how many shadow pages
> > are allowed per-VM.
> 
> I don't think you've understood my question (or maybe I still don't
> understood the situation properly).
> 
> So in memory pagetables, there's a "tree" that contains a single
> top-level page, which points to other pages, which defines one address
> space (usually corresponding to one process or thread).   (This is
> often just refered to as 'cr3', since it's the value you write into
> the cr3 register on x86 processors.) I'm assuming that the structure
> is similar for your GPU translation tables -- that a single GTT is
> effectively a "tree" sort of like a process address space for an OS.
> 
> And it sounds like what you're saying is: suppose we have 10 different
> GTTs (i.e., an entire tree / gpu thread), and each one require 1024
> ranges to shadow.  In that case, a limit of 8192 ranges means we can
> only keep 8 of the ten actually shadowed at any one time.  This is not
> optimal, since it will occasionally mean unshadowing an entire GTT and
> re-shadowing another one, but it will work, because we can always make
> sure that the currently-active GTT is shadowed.
> 
> My question is, suppose a single GTT / gpu thread / tree has 9000
> ranges.  It would be trivial for an attacker to break into the
> operating system and *construct* such a tree, but it's entirely
> possible that due to a combination of memory fragmentation and very
> large usage, the normal driver might accidentally create such a GTT.
> In that case, the device model will not be able to write-protect all
> the pages in the single GTT, and thus will not be able to correctly
> track changes to the currently-active GTT.  What does your device
> model do in that case?

We can live with the partially write protected tree. That is because
GPU's workload execution is controlled by the device model. We still
have chance to update the shadow page table before we submit workload
to GPU. The impact is performance not correctness. Thanks!

-Zhiyuan

> 
>  -George

George Dunlap Feb. 5, 2016, 8:14 p.m. UTC | #85

On Fri, Feb 5, 2016 at 3:13 PM, Zhiyuan Lv <zhiyuan.lv@intel.com> wrote:
>> My question is, suppose a single GTT / gpu thread / tree has 9000
>> ranges.  It would be trivial for an attacker to break into the
>> operating system and *construct* such a tree, but it's entirely
>> possible that due to a combination of memory fragmentation and very
>> large usage, the normal driver might accidentally create such a GTT.
>> In that case, the device model will not be able to write-protect all
>> the pages in the single GTT, and thus will not be able to correctly
>> track changes to the currently-active GTT.  What does your device
>> model do in that case?
>
> We can live with the partially write protected tree. That is because
> GPU's workload execution is controlled by the device model. We still
> have chance to update the shadow page table before we submit workload
> to GPU. The impact is performance not correctness. Thanks!

Right -- so it's actually never a hard limit.  That's good to know.

 -George

Tian, Kevin Feb. 16, 2016, 7:22 a.m. UTC | #86

> From: Paul Durrant [mailto:Paul.Durrant@citrix.com]
> Sent: Friday, February 05, 2016 7:24 PM
> 
> > -----Original Message-----
> > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
> > George Dunlap
> > Sent: 05 February 2016 11:14
> > To: Paul Durrant
> > Cc: Jan Beulich; George Dunlap; Kevin Tian; Wei Liu; Ian Campbell; Andrew
> > Cooper; Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini;
> > zhiyuan.lv@intel.com; Ian Jackson; Keir (Xen.org)
> > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> > max_wp_ram_ranges.
> >
> > On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant <Paul.Durrant@citrix.com>
> > wrote:
> > > Utilizing the default server is a backwards step. GVT-g would have to use
> > the old HVM_PARAM mechanism to cause it's emulator to become default. I
> > think a more appropriate mechanism would be p2m_mmio_write_dm to
> > become something like 'p2m_ioreq_server_write' and then have a hypercall
> > to allow it to be mapped to a particular ioreq server.
> > > Obviously only one could claim it but, with a p2t, the bit could be re-
> > purposed to simply mean 'go look in the p2t' for more information and then
> > the p2t could be structured to allow emulations to be steered to one of many
> > ioreq servers (for read and/or write emulation).
> >
> > Right; I had in mind that Xen would allow at any given time a max of N
> > ioreq servers to register for mmio_write_dm ranges, first-come
> > first-served; with 'N' being '1' to begin with.  If a second ioreq
> > server requested mmio_write_dm functionality, it would get -EBUSY.
> > This would allow their current setup (one qemu dm which doesn't do
> > mmio_write_dm, one xengt dm which does) to work without needing to
> > worry any more about how many pages might need to be tracked (either
> > for efficiency or correctness).
> >
> > We could then extend this to some larger number (4 seems pretty
> > reasonable to me) either by adding an extra 3 types, or by some other
> > method (such as the one Paul suggests).
> 
> I think it would be best to do away with the 'write dm' name though. I would like to see it
> be possible to steer reads+writes, as well as writes (and maybe just reads?) to a particular
> ioreq server based on type information. So maybe we just call the existing type
> 'p2m_ioreq_server' and then, in the absence of a p2t, hardcode this to go to whichever
> emulator makes the new TBD hypercall.
> I think we need a proper design at this point. Given that it's Chinese New Year maybe I'll
> have a stab in Yu's absence.
> 

Hi, Paul, what about your progress on this?

My feeling is that we do not need a new hypercall to explicitly claim
whether a ioreq server wants to handle write requests. It can be
implicitly marked upon whether a specific page is requested for
write-protection through a specific ioreq channel, and then that
ioreq server will claim the attribute automatically. 

Thanks
Kevin

Paul Durrant Feb. 16, 2016, 8:50 a.m. UTC | #87

> -----Original Message-----
> From: Tian, Kevin [mailto:kevin.tian@intel.com]
> Sent: 16 February 2016 07:23
> To: Paul Durrant; George Dunlap
> Cc: Jan Beulich; George Dunlap; Wei Liu; Ian Campbell; Andrew Cooper;
> Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; Lv, Zhiyuan; Ian
> Jackson; Keir (Xen.org)
> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> > From: Paul Durrant [mailto:Paul.Durrant@citrix.com]
> > Sent: Friday, February 05, 2016 7:24 PM
> >
> > > -----Original Message-----
> > > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
> > > George Dunlap
> > > Sent: 05 February 2016 11:14
> > > To: Paul Durrant
> > > Cc: Jan Beulich; George Dunlap; Kevin Tian; Wei Liu; Ian Campbell;
> Andrew
> > > Cooper; Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini;
> > > zhiyuan.lv@intel.com; Ian Jackson; Keir (Xen.org)
> > > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> > > max_wp_ram_ranges.
> > >
> > > On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant <Paul.Durrant@citrix.com>
> > > wrote:
> > > > Utilizing the default server is a backwards step. GVT-g would have to
> use
> > > the old HVM_PARAM mechanism to cause it's emulator to become
> default. I
> > > think a more appropriate mechanism would be p2m_mmio_write_dm to
> > > become something like 'p2m_ioreq_server_write' and then have a
> hypercall
> > > to allow it to be mapped to a particular ioreq server.
> > > > Obviously only one could claim it but, with a p2t, the bit could be re-
> > > purposed to simply mean 'go look in the p2t' for more information and
> then
> > > the p2t could be structured to allow emulations to be steered to one of
> many
> > > ioreq servers (for read and/or write emulation).
> > >
> > > Right; I had in mind that Xen would allow at any given time a max of N
> > > ioreq servers to register for mmio_write_dm ranges, first-come
> > > first-served; with 'N' being '1' to begin with.  If a second ioreq
> > > server requested mmio_write_dm functionality, it would get -EBUSY.
> > > This would allow their current setup (one qemu dm which doesn't do
> > > mmio_write_dm, one xengt dm which does) to work without needing to
> > > worry any more about how many pages might need to be tracked (either
> > > for efficiency or correctness).
> > >
> > > We could then extend this to some larger number (4 seems pretty
> > > reasonable to me) either by adding an extra 3 types, or by some other
> > > method (such as the one Paul suggests).
> >
> > I think it would be best to do away with the 'write dm' name though. I
> would like to see it
> > be possible to steer reads+writes, as well as writes (and maybe just reads?)
> to a particular
> > ioreq server based on type information. So maybe we just call the existing
> type
> > 'p2m_ioreq_server' and then, in the absence of a p2t, hardcode this to go
> to whichever
> > emulator makes the new TBD hypercall.
> > I think we need a proper design at this point. Given that it's Chinese New
> Year maybe I'll
> > have a stab in Yu's absence.
> >
> 
> Hi, Paul, what about your progress on this?
> 
> My feeling is that we do not need a new hypercall to explicitly claim
> whether a ioreq server wants to handle write requests. It can be
> implicitly marked upon whether a specific page is requested for
> write-protection through a specific ioreq channel, and then that
> ioreq server will claim the attribute automatically.

Hi Kevin,

Is there a hypercall to do that? Maybe I'm missing something but I was under the impression that the only way to set write protection was via an HVMOP_set_mem_type and that does not carry an ioreq server id.

I'm afraid I have made little progress due to the distractions of trying get some patches into Linux but my thoughts are around replacing the HVM_mmio_write_dm with something like HVM_emulate_0 (i.e. the zero-th example of a type that requires emulation, to be followed by others in future) and then add a hypercall along the lines of HVMOP_map_mem_type_to_ioreq_server which will take an ioerq server id, a type and flags saying whether it wishes to handle reads and/or writes to that type.

Thoughts (anyone)?

  Paul

> 
> Thanks
> Kevin

Jan Beulich Feb. 16, 2016, 10:33 a.m. UTC | #88

>>> On 16.02.16 at 09:50, <Paul.Durrant@citrix.com> wrote:
>>  -----Original Message-----
>> From: Tian, Kevin [mailto:kevin.tian@intel.com]
>> Sent: 16 February 2016 07:23
>> To: Paul Durrant; George Dunlap
>> Cc: Jan Beulich; George Dunlap; Wei Liu; Ian Campbell; Andrew Cooper;
>> Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; Lv, Zhiyuan; Ian
>> Jackson; Keir (Xen.org)
>> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
>> max_wp_ram_ranges.
>> 
>> > From: Paul Durrant [mailto:Paul.Durrant@citrix.com]
>> > Sent: Friday, February 05, 2016 7:24 PM
>> >
>> > > -----Original Message-----
>> > > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
>> > > George Dunlap
>> > > Sent: 05 February 2016 11:14
>> > > To: Paul Durrant
>> > > Cc: Jan Beulich; George Dunlap; Kevin Tian; Wei Liu; Ian Campbell;
>> Andrew
>> > > Cooper; Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini;
>> > > zhiyuan.lv@intel.com; Ian Jackson; Keir (Xen.org)
>> > > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
>> > > max_wp_ram_ranges.
>> > >
>> > > On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant <Paul.Durrant@citrix.com>
>> > > wrote:
>> > > > Utilizing the default server is a backwards step. GVT-g would have to
>> use
>> > > the old HVM_PARAM mechanism to cause it's emulator to become
>> default. I
>> > > think a more appropriate mechanism would be p2m_mmio_write_dm to
>> > > become something like 'p2m_ioreq_server_write' and then have a
>> hypercall
>> > > to allow it to be mapped to a particular ioreq server.
>> > > > Obviously only one could claim it but, with a p2t, the bit could be re-
>> > > purposed to simply mean 'go look in the p2t' for more information and
>> then
>> > > the p2t could be structured to allow emulations to be steered to one of
>> many
>> > > ioreq servers (for read and/or write emulation).
>> > >
>> > > Right; I had in mind that Xen would allow at any given time a max of N
>> > > ioreq servers to register for mmio_write_dm ranges, first-come
>> > > first-served; with 'N' being '1' to begin with.  If a second ioreq
>> > > server requested mmio_write_dm functionality, it would get -EBUSY.
>> > > This would allow their current setup (one qemu dm which doesn't do
>> > > mmio_write_dm, one xengt dm which does) to work without needing to
>> > > worry any more about how many pages might need to be tracked (either
>> > > for efficiency or correctness).
>> > >
>> > > We could then extend this to some larger number (4 seems pretty
>> > > reasonable to me) either by adding an extra 3 types, or by some other
>> > > method (such as the one Paul suggests).
>> >
>> > I think it would be best to do away with the 'write dm' name though. I
>> would like to see it
>> > be possible to steer reads+writes, as well as writes (and maybe just 
> reads?)
>> to a particular
>> > ioreq server based on type information. So maybe we just call the existing
>> type
>> > 'p2m_ioreq_server' and then, in the absence of a p2t, hardcode this to go
>> to whichever
>> > emulator makes the new TBD hypercall.
>> > I think we need a proper design at this point. Given that it's Chinese New
>> Year maybe I'll
>> > have a stab in Yu's absence.
>> >
>> 
>> Hi, Paul, what about your progress on this?
>> 
>> My feeling is that we do not need a new hypercall to explicitly claim
>> whether a ioreq server wants to handle write requests. It can be
>> implicitly marked upon whether a specific page is requested for
>> write-protection through a specific ioreq channel, and then that
>> ioreq server will claim the attribute automatically.
> 
> Hi Kevin,
> 
> Is there a hypercall to do that? Maybe I'm missing something but I was under 
> the impression that the only way to set write protection was via an 
> HVMOP_set_mem_type and that does not carry an ioreq server id.
> 
> I'm afraid I have made little progress due to the distractions of trying get 
> some patches into Linux but my thoughts are around replacing the 
> HVM_mmio_write_dm with something like HVM_emulate_0 (i.e. the zero-th example 
> of a type that requires emulation, to be followed by others in future) and 
> then add a hypercall along the lines of HVMOP_map_mem_type_to_ioreq_server 
> which will take an ioerq server id, a type and flags saying whether it wishes 
> to handle reads and/or writes to that type.
> 
> Thoughts (anyone)?

I think as a general idea also allowing reads to be intercepted is
nice, but would incur quite a few changes which we don't currently
have a user for. Hence I'd suggest making the public interface
ready for that without actually implementing hypervisor support.

Jan

Paul Durrant Feb. 16, 2016, 11:11 a.m. UTC | #89

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 16 February 2016 10:34
> To: Paul Durrant
> Cc: Andrew Cooper; George Dunlap; Ian Campbell; Ian Jackson; Stefano
> Stabellini; Wei Liu; Kevin Tian; Zhiyuan Lv; Zhang Yu; xen-devel@lists.xen.org;
> George Dunlap; Keir (Xen.org)
> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> >>> On 16.02.16 at 09:50, <Paul.Durrant@citrix.com> wrote:
> >>  -----Original Message-----
> >> From: Tian, Kevin [mailto:kevin.tian@intel.com]
> >> Sent: 16 February 2016 07:23
> >> To: Paul Durrant; George Dunlap
> >> Cc: Jan Beulich; George Dunlap; Wei Liu; Ian Campbell; Andrew Cooper;
> >> Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; Lv, Zhiyuan; Ian
> >> Jackson; Keir (Xen.org)
> >> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> >> max_wp_ram_ranges.
> >>
> >> > From: Paul Durrant [mailto:Paul.Durrant@citrix.com]
> >> > Sent: Friday, February 05, 2016 7:24 PM
> >> >
> >> > > -----Original Message-----
> >> > > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
> >> > > George Dunlap
> >> > > Sent: 05 February 2016 11:14
> >> > > To: Paul Durrant
> >> > > Cc: Jan Beulich; George Dunlap; Kevin Tian; Wei Liu; Ian Campbell;
> >> Andrew
> >> > > Cooper; Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini;
> >> > > zhiyuan.lv@intel.com; Ian Jackson; Keir (Xen.org)
> >> > > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> >> > > max_wp_ram_ranges.
> >> > >
> >> > > On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant
> <Paul.Durrant@citrix.com>
> >> > > wrote:
> >> > > > Utilizing the default server is a backwards step. GVT-g would have to
> >> use
> >> > > the old HVM_PARAM mechanism to cause it's emulator to become
> >> default. I
> >> > > think a more appropriate mechanism would be p2m_mmio_write_dm
> to
> >> > > become something like 'p2m_ioreq_server_write' and then have a
> >> hypercall
> >> > > to allow it to be mapped to a particular ioreq server.
> >> > > > Obviously only one could claim it but, with a p2t, the bit could be re-
> >> > > purposed to simply mean 'go look in the p2t' for more information and
> >> then
> >> > > the p2t could be structured to allow emulations to be steered to one
> of
> >> many
> >> > > ioreq servers (for read and/or write emulation).
> >> > >
> >> > > Right; I had in mind that Xen would allow at any given time a max of N
> >> > > ioreq servers to register for mmio_write_dm ranges, first-come
> >> > > first-served; with 'N' being '1' to begin with.  If a second ioreq
> >> > > server requested mmio_write_dm functionality, it would get -EBUSY.
> >> > > This would allow their current setup (one qemu dm which doesn't do
> >> > > mmio_write_dm, one xengt dm which does) to work without needing
> to
> >> > > worry any more about how many pages might need to be tracked
> (either
> >> > > for efficiency or correctness).
> >> > >
> >> > > We could then extend this to some larger number (4 seems pretty
> >> > > reasonable to me) either by adding an extra 3 types, or by some other
> >> > > method (such as the one Paul suggests).
> >> >
> >> > I think it would be best to do away with the 'write dm' name though. I
> >> would like to see it
> >> > be possible to steer reads+writes, as well as writes (and maybe just
> > reads?)
> >> to a particular
> >> > ioreq server based on type information. So maybe we just call the
> existing
> >> type
> >> > 'p2m_ioreq_server' and then, in the absence of a p2t, hardcode this to
> go
> >> to whichever
> >> > emulator makes the new TBD hypercall.
> >> > I think we need a proper design at this point. Given that it's Chinese
> New
> >> Year maybe I'll
> >> > have a stab in Yu's absence.
> >> >
> >>
> >> Hi, Paul, what about your progress on this?
> >>
> >> My feeling is that we do not need a new hypercall to explicitly claim
> >> whether a ioreq server wants to handle write requests. It can be
> >> implicitly marked upon whether a specific page is requested for
> >> write-protection through a specific ioreq channel, and then that
> >> ioreq server will claim the attribute automatically.
> >
> > Hi Kevin,
> >
> > Is there a hypercall to do that? Maybe I'm missing something but I was
> under
> > the impression that the only way to set write protection was via an
> > HVMOP_set_mem_type and that does not carry an ioreq server id.
> >
> > I'm afraid I have made little progress due to the distractions of trying get
> > some patches into Linux but my thoughts are around replacing the
> > HVM_mmio_write_dm with something like HVM_emulate_0 (i.e. the zero-
> th example
> > of a type that requires emulation, to be followed by others in future) and
> > then add a hypercall along the lines of
> HVMOP_map_mem_type_to_ioreq_server
> > which will take an ioerq server id, a type and flags saying whether it wishes
> > to handle reads and/or writes to that type.
> >
> > Thoughts (anyone)?
> 
> I think as a general idea also allowing reads to be intercepted is
> nice, but would incur quite a few changes which we don't currently
> have a user for. Hence I'd suggest making the public interface
> ready for that without actually implementing hypervisor support.
> 

Well, we need some form a hypervisor support to replace what's already there.

I'd envisaged that setting HVM_emulate_0 type on a page would mean nothing until an ioreq server claims it (i.e. it stays as r/w RAM) but when the ioreq server makes the claim the EPT is changed according to whether reads and/or writes are wanted and then the fault handler steers transactions to the (single at the moment) ioreq server. I'll need to code up a PoC to make sure I'm not barking up the wrong tree though.

  Paul

> Jan

Tian, Kevin Feb. 17, 2016, 3:18 a.m. UTC | #90

> From: Paul Durrant [mailto:Paul.Durrant@citrix.com]
> Sent: Tuesday, February 16, 2016 7:11 PM
> 
> > -----Original Message-----
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: 16 February 2016 10:34
> > To: Paul Durrant
> > Cc: Andrew Cooper; George Dunlap; Ian Campbell; Ian Jackson; Stefano
> > Stabellini; Wei Liu; Kevin Tian; Zhiyuan Lv; Zhang Yu; xen-devel@lists.xen.org;
> > George Dunlap; Keir (Xen.org)
> > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> > max_wp_ram_ranges.
> >
> > >>> On 16.02.16 at 09:50, <Paul.Durrant@citrix.com> wrote:
> > >>  -----Original Message-----
> > >> From: Tian, Kevin [mailto:kevin.tian@intel.com]
> > >> Sent: 16 February 2016 07:23
> > >> To: Paul Durrant; George Dunlap
> > >> Cc: Jan Beulich; George Dunlap; Wei Liu; Ian Campbell; Andrew Cooper;
> > >> Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; Lv, Zhiyuan; Ian
> > >> Jackson; Keir (Xen.org)
> > >> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> > >> max_wp_ram_ranges.
> > >>
> > >> > From: Paul Durrant [mailto:Paul.Durrant@citrix.com]
> > >> > Sent: Friday, February 05, 2016 7:24 PM
> > >> >
> > >> > > -----Original Message-----
> > >> > > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
> > >> > > George Dunlap
> > >> > > Sent: 05 February 2016 11:14
> > >> > > To: Paul Durrant
> > >> > > Cc: Jan Beulich; George Dunlap; Kevin Tian; Wei Liu; Ian Campbell;
> > >> Andrew
> > >> > > Cooper; Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini;
> > >> > > zhiyuan.lv@intel.com; Ian Jackson; Keir (Xen.org)
> > >> > > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> > >> > > max_wp_ram_ranges.
> > >> > >
> > >> > > On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant
> > <Paul.Durrant@citrix.com>
> > >> > > wrote:
> > >> > > > Utilizing the default server is a backwards step. GVT-g would have to
> > >> use
> > >> > > the old HVM_PARAM mechanism to cause it's emulator to become
> > >> default. I
> > >> > > think a more appropriate mechanism would be p2m_mmio_write_dm
> > to
> > >> > > become something like 'p2m_ioreq_server_write' and then have a
> > >> hypercall
> > >> > > to allow it to be mapped to a particular ioreq server.
> > >> > > > Obviously only one could claim it but, with a p2t, the bit could be re-
> > >> > > purposed to simply mean 'go look in the p2t' for more information and
> > >> then
> > >> > > the p2t could be structured to allow emulations to be steered to one
> > of
> > >> many
> > >> > > ioreq servers (for read and/or write emulation).
> > >> > >
> > >> > > Right; I had in mind that Xen would allow at any given time a max of N
> > >> > > ioreq servers to register for mmio_write_dm ranges, first-come
> > >> > > first-served; with 'N' being '1' to begin with.  If a second ioreq
> > >> > > server requested mmio_write_dm functionality, it would get -EBUSY.
> > >> > > This would allow their current setup (one qemu dm which doesn't do
> > >> > > mmio_write_dm, one xengt dm which does) to work without needing
> > to
> > >> > > worry any more about how many pages might need to be tracked
> > (either
> > >> > > for efficiency or correctness).
> > >> > >
> > >> > > We could then extend this to some larger number (4 seems pretty
> > >> > > reasonable to me) either by adding an extra 3 types, or by some other
> > >> > > method (such as the one Paul suggests).
> > >> >
> > >> > I think it would be best to do away with the 'write dm' name though. I
> > >> would like to see it
> > >> > be possible to steer reads+writes, as well as writes (and maybe just
> > > reads?)
> > >> to a particular
> > >> > ioreq server based on type information. So maybe we just call the
> > existing
> > >> type
> > >> > 'p2m_ioreq_server' and then, in the absence of a p2t, hardcode this to
> > go
> > >> to whichever
> > >> > emulator makes the new TBD hypercall.
> > >> > I think we need a proper design at this point. Given that it's Chinese
> > New
> > >> Year maybe I'll
> > >> > have a stab in Yu's absence.
> > >> >
> > >>
> > >> Hi, Paul, what about your progress on this?
> > >>
> > >> My feeling is that we do not need a new hypercall to explicitly claim
> > >> whether a ioreq server wants to handle write requests. It can be
> > >> implicitly marked upon whether a specific page is requested for
> > >> write-protection through a specific ioreq channel, and then that
> > >> ioreq server will claim the attribute automatically.
> > >
> > > Hi Kevin,
> > >
> > > Is there a hypercall to do that? Maybe I'm missing something but I was
> > under
> > > the impression that the only way to set write protection was via an
> > > HVMOP_set_mem_type and that does not carry an ioreq server id.

Thanks for clarification. I've got a bit messed on current state before. :-)

> > >
> > > I'm afraid I have made little progress due to the distractions of trying get
> > > some patches into Linux but my thoughts are around replacing the
> > > HVM_mmio_write_dm with something like HVM_emulate_0 (i.e. the zero-
> > th example
> > > of a type that requires emulation, to be followed by others in future) and
> > > then add a hypercall along the lines of
> > HVMOP_map_mem_type_to_ioreq_server
> > > which will take an ioerq server id, a type and flags saying whether it wishes
> > > to handle reads and/or writes to that type.
> > >
> > > Thoughts (anyone)?
> >
> > I think as a general idea also allowing reads to be intercepted is
> > nice, but would incur quite a few changes which we don't currently
> > have a user for. Hence I'd suggest making the public interface
> > ready for that without actually implementing hypervisor support.
> >
> 
> Well, we need some form a hypervisor support to replace what's already there.
> 
> I'd envisaged that setting HVM_emulate_0 type on a page would mean nothing until an

for "mean nothing" what is the default policy then if guest happens to access it
before any ioreq server claims it?

> ioreq server claims it (i.e. it stays as r/w RAM) but when the ioreq server makes the claim
> the EPT is changed according to whether reads and/or writes are wanted and then the fault
> handler steers transactions to the (single at the moment) ioreq server. I'll need to code up
> a PoC to make sure I'm not barking up the wrong tree though.
> 

Curious any reason why we must have a HVM_emulate_0 placeholder
first and why we can't allow ioreq server to claim on any existing type?
Thinking about XenGT usage, I cannot envisage when a page should
be set to HVM_emulate_0 first. The write-protection operation is dynamic
according to guest page table operations, upon which we'll directly jump 
to claim phase...

btw does this design consider the case where multiple ioreq servers
may claim on same page? For example, different usages may both
want to capture write requests on the same set of pages (say XenGT
selectively write-protects a subset of pages due to shadow GTT, while 
another agent wants to monitor all guest writes to any guest memory 
page).

Thanks
Kevin

Paul Durrant Feb. 17, 2016, 8:58 a.m. UTC | #91

> -----Original Message-----
[snip]
> > > >
> > > > I'm afraid I have made little progress due to the distractions of trying
> get
> > > > some patches into Linux but my thoughts are around replacing the
> > > > HVM_mmio_write_dm with something like HVM_emulate_0 (i.e. the
> zero-
> > > th example
> > > > of a type that requires emulation, to be followed by others in future)
> and
> > > > then add a hypercall along the lines of
> > > HVMOP_map_mem_type_to_ioreq_server
> > > > which will take an ioerq server id, a type and flags saying whether it
> wishes
> > > > to handle reads and/or writes to that type.
> > > >
> > > > Thoughts (anyone)?
> > >
> > > I think as a general idea also allowing reads to be intercepted is
> > > nice, but would incur quite a few changes which we don't currently
> > > have a user for. Hence I'd suggest making the public interface
> > > ready for that without actually implementing hypervisor support.
> > >
> >
> > Well, we need some form a hypervisor support to replace what's already
> there.
> >
> > I'd envisaged that setting HVM_emulate_0 type on a page would mean
> nothing until an
> 
> for "mean nothing" what is the default policy then if guest happens to access
> it
> before any ioreq server claims it?
> 

My thoughts were that, since no specific emulation has yet been requested (because no ioreq server has yet claimed it) that the default policy is to treat it as r/w RAM as I said below. This is because I think the only legal type transitions should be from HVMMEM_ram_rw to HVMMEM_emulate_0 and back again.

> > ioreq server claims it (i.e. it stays as r/w RAM) but when the ioreq server
> makes the claim
> > the EPT is changed according to whether reads and/or writes are wanted
> and then the fault
> > handler steers transactions to the (single at the moment) ioreq server. I'll
> need to code up
> > a PoC to make sure I'm not barking up the wrong tree though.
> >
> 
> Curious any reason why we must have a HVM_emulate_0 placeholder
> first and why we can't allow ioreq server to claim on any existing type?

Which type were you thinking of? Remember that the ioreq server would be claiming *all* pages of that type.

> Thinking about XenGT usage, I cannot envisage when a page should
> be set to HVM_emulate_0 first. The write-protection operation is dynamic
> according to guest page table operations, upon which we'll directly jump
> to claim phase...

I don't imagine that things would happen that way round in the common case. For XenGT I'd expect the ioreq server to immediately claim HVMMEM_emulate_0 and then set that type on any page that it wants to trap accesses on (which means that
I am assuming that the same emulation - i.e. write accesses only - is desired for all pages... but I think that is indeed the case).

> 
> btw does this design consider the case where multiple ioreq servers
> may claim on same page?

Yes it does and there are currently insufficient page types to allow any more than a single ioreq server to claim a type. My plan is that, in future, we can add a p2t mapping table to allow for more types and then introduce HVMMEM_ioreq_1, HVMMEM_ioreq_2, etc.

> For example, different usages may both
> want to capture write requests on the same set of pages (say XenGT
> selectively write-protects a subset of pages due to shadow GTT, while
> another agent wants to monitor all guest writes to any guest memory
> page).

Monitoring is a different thing altogether. Emulation is costly and not something you'd want to use for that purpose. If you want to monitor writes then log-dirty already exists for that purpose.

> 
> Thanks
> Kevin

I hope my explanation helped. I think things will be clearer once I've had chance to actually put together a design doc. and hack up a PoC (probably only for EPT at first).

  Paul

Jan Beulich Feb. 17, 2016, 9:32 a.m. UTC | #92

>>> On 17.02.16 at 09:58, <Paul.Durrant@citrix.com> wrote:
>> > I'd envisaged that setting HVM_emulate_0 type on a page would mean
>> nothing until an
>> 
>> for "mean nothing" what is the default policy then if guest happens to access
>> it
>> before any ioreq server claims it?
>> 
> 
> My thoughts were that, since no specific emulation has yet been requested 
> (because no ioreq server has yet claimed it) that the default policy is to 
> treat it as r/w RAM as I said below. This is because I think the only legal 
> type transitions should be from HVMMEM_ram_rw to HVMMEM_emulate_0 and back 
> again.

+1

Jan

Tian, Kevin Feb. 17, 2016, 9:58 a.m. UTC | #93

> From: Paul Durrant [mailto:Paul.Durrant@citrix.com]
> Sent: Wednesday, February 17, 2016 4:58 PM
> 
> >
> > btw does this design consider the case where multiple ioreq servers
> > may claim on same page?
> 
> Yes it does and there are currently insufficient page types to allow any more than a single
> ioreq server to claim a type. My plan is that, in future, we can add a p2t mapping table to
> allow for more types and then introduce HVMMEM_ioreq_1, HVMMEM_ioreq_2, etc.

so these new types actually represent ioreq server ID, right? If yes I can then
understand your earlier explanations.

> 
> > For example, different usages may both
> > want to capture write requests on the same set of pages (say XenGT
> > selectively write-protects a subset of pages due to shadow GTT, while
> > another agent wants to monitor all guest writes to any guest memory
> > page).
> 
> Monitoring is a different thing altogether. Emulation is costly and not something you'd want
> to use for that purpose. If you want to monitor writes then log-dirty already exists for that
> purpose.

Agree.

> 
> >
> > Thanks
> > Kevin
> 
> I hope my explanation helped. I think things will be clearer once I've had chance to actually
> put together a design doc. and hack up a PoC (probably only for EPT at first).
> 

Thanks for the help. Let's see whether we can have some solution ready for 4.7. :-)

Thanks
Kevin

Paul Durrant Feb. 17, 2016, 10:03 a.m. UTC | #94

> -----Original Message-----
> From: Tian, Kevin [mailto:kevin.tian@intel.com]
> Sent: 17 February 2016 09:58
> To: Paul Durrant; Jan Beulich
> Cc: Andrew Cooper; George Dunlap; Ian Campbell; Ian Jackson; Stefano
> Stabellini; Wei Liu; Lv, Zhiyuan; Zhang Yu; xen-devel@lists.xen.org; George
> Dunlap; Keir (Xen.org)
> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> > From: Paul Durrant [mailto:Paul.Durrant@citrix.com]
> > Sent: Wednesday, February 17, 2016 4:58 PM
> >
> > >
> > > btw does this design consider the case where multiple ioreq servers
> > > may claim on same page?
> >
> > Yes it does and there are currently insufficient page types to allow any
> more than a single
> > ioreq server to claim a type. My plan is that, in future, we can add a p2t
> mapping table to
> > allow for more types and then introduce HVMMEM_ioreq_1,
> HVMMEM_ioreq_2, etc.
> 
> so these new types actually represent ioreq server ID, right? If yes I can then
> understand your earlier explanations.

Well, not quite. Each of these types can be steered to exactly one ioreq server but there will be a new hypercall to control the mapping. The design doc will make it clear.

  Paul

> 
> >
> > > For example, different usages may both
> > > want to capture write requests on the same set of pages (say XenGT
> > > selectively write-protects a subset of pages due to shadow GTT, while
> > > another agent wants to monitor all guest writes to any guest memory
> > > page).
> >
> > Monitoring is a different thing altogether. Emulation is costly and not
> something you'd want
> > to use for that purpose. If you want to monitor writes then log-dirty
> already exists for that
> > purpose.
> 
> Agree.
> 
> >
> > >
> > > Thanks
> > > Kevin
> >
> > I hope my explanation helped. I think things will be clearer once I've had
> chance to actually
> > put together a design doc. and hack up a PoC (probably only for EPT at
> first).
> >
> 
> Thanks for the help. Let's see whether we can have some solution ready for
> 4.7. :-)
> 
> Thanks
> Kevin

Jan Beulich Feb. 17, 2016, 10:22 a.m. UTC | #95

>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote:
> Thanks for the help. Let's see whether we can have some solution ready for 
> 4.7. :-)

Since we now seem to all agree that a different approach is going to
be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM:
differentiate IO/mem resources tracked by ioreq server"). Please
voice objections to the plan pretty soon.

Jan

Paul Durrant Feb. 17, 2016, 10:24 a.m. UTC | #96

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 17 February 2016 10:22
> To: Paul Durrant; Kevin Tian; Zhang Yu
> Cc: Andrew Cooper; George Dunlap; Ian Campbell; Ian Jackson; Stefano
> Stabellini; Wei Liu; Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir
> (Xen.org)
> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> >>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote:
> > Thanks for the help. Let's see whether we can have some solution ready
> for
> > 4.7. :-)
> 
> Since we now seem to all agree that a different approach is going to
> be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM:
> differentiate IO/mem resources tracked by ioreq server"). Please
> voice objections to the plan pretty soon.
> 

I'm happy with reversion at this point.

  Paul

Tian, Kevin Feb. 17, 2016, 10:25 a.m. UTC | #97

> From: Paul Durrant [mailto:Paul.Durrant@citrix.com]
> Sent: Wednesday, February 17, 2016 6:24 PM
> 
> > -----Original Message-----
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: 17 February 2016 10:22
> > To: Paul Durrant; Kevin Tian; Zhang Yu
> > Cc: Andrew Cooper; George Dunlap; Ian Campbell; Ian Jackson; Stefano
> > Stabellini; Wei Liu; Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir
> > (Xen.org)
> > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> > max_wp_ram_ranges.
> >
> > >>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote:
> > > Thanks for the help. Let's see whether we can have some solution ready
> > for
> > > 4.7. :-)
> >
> > Since we now seem to all agree that a different approach is going to
> > be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM:
> > differentiate IO/mem resources tracked by ioreq server"). Please
> > voice objections to the plan pretty soon.
> >
> 
> I'm happy with reversion at this point.
> 

Agree.

Thanks
Kevin

George Dunlap Feb. 17, 2016, 11:01 a.m. UTC | #98

On 17/02/16 10:22, Jan Beulich wrote:
>>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote:
>> Thanks for the help. Let's see whether we can have some solution ready for 
>> 4.7. :-)
> 
> Since we now seem to all agree that a different approach is going to
> be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM:
> differentiate IO/mem resources tracked by ioreq server"). Please
> voice objections to the plan pretty soon.

FWIW, after this discussion, I don't have an objection to the basic
interface in this series as it is, since it addresses my request that it
be memory-based, and it could be switched to using p2m types
behind-the-scenes -- with the exception of the knob to control how many
ranges are allowed (since it exposes the internal implementation).

I wouldn't object to the current implementation going in as it was in v9
(apparently), and then changing the type stuff around behind the scenes
later as an optimization.  I also don't think it would be terribly
difficult to change the implementation as it is to just use write_dm for
a single IOREQ server.  We can rename it ioreq_server and expand it
later.  Sorry if this wasn't clear from my comments before.

A new interface that's been carefully thought through would of course be
nice too.

 -George

Paul Durrant Feb. 17, 2016, 11:12 a.m. UTC | #99

> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: 17 February 2016 11:02
> To: Jan Beulich; Paul Durrant; Kevin Tian; Zhang Yu
> Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu;
> Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir (Xen.org)
> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> On 17/02/16 10:22, Jan Beulich wrote:
> >>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote:
> >> Thanks for the help. Let's see whether we can have some solution ready
> for
> >> 4.7. :-)
> >
> > Since we now seem to all agree that a different approach is going to
> > be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM:
> > differentiate IO/mem resources tracked by ioreq server"). Please
> > voice objections to the plan pretty soon.
> 
> FWIW, after this discussion, I don't have an objection to the basic
> interface in this series as it is, since it addresses my request that it
> be memory-based, and it could be switched to using p2m types
> behind-the-scenes -- with the exception of the knob to control how many
> ranges are allowed (since it exposes the internal implementation).
> 
> I wouldn't object to the current implementation going in as it was in v9
> (apparently), and then changing the type stuff around behind the scenes
> later as an optimization.  I also don't think it would be terribly
> difficult to change the implementation as it is to just use write_dm for
> a single IOREQ server.  We can rename it ioreq_server and expand it
> later.  Sorry if this wasn't clear from my comments before.
>

The problem is that, since you objected to wp_mem being treated the same as mmio, we had to introduce a new range type to the map/unmap hypercalls and that will become redundant if we steer by type. If we could have just increased the size of the rangeset for mmio and used that *for now* then weeks of work could have been saved going down this dead end.

  Paul
 
> A new interface that's been carefully thought through would of course be
> nice too.
> 
>  -George

George Dunlap Feb. 22, 2016, 3:56 p.m. UTC | #100

On Wed, Feb 17, 2016 at 11:12 AM, Paul Durrant <Paul.Durrant@citrix.com> wrote:
>> -----Original Message-----
>> From: George Dunlap [mailto:george.dunlap@citrix.com]
>> Sent: 17 February 2016 11:02
>> To: Jan Beulich; Paul Durrant; Kevin Tian; Zhang Yu
>> Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu;
>> Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir (Xen.org)
>> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
>> max_wp_ram_ranges.
>>
>> On 17/02/16 10:22, Jan Beulich wrote:
>> >>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote:
>> >> Thanks for the help. Let's see whether we can have some solution ready
>> for
>> >> 4.7. :-)
>> >
>> > Since we now seem to all agree that a different approach is going to
>> > be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM:
>> > differentiate IO/mem resources tracked by ioreq server"). Please
>> > voice objections to the plan pretty soon.
>>
>> FWIW, after this discussion, I don't have an objection to the basic
>> interface in this series as it is, since it addresses my request that it
>> be memory-based, and it could be switched to using p2m types
>> behind-the-scenes -- with the exception of the knob to control how many
>> ranges are allowed (since it exposes the internal implementation).
>>
>> I wouldn't object to the current implementation going in as it was in v9
>> (apparently), and then changing the type stuff around behind the scenes
>> later as an optimization.  I also don't think it would be terribly
>> difficult to change the implementation as it is to just use write_dm for
>> a single IOREQ server.  We can rename it ioreq_server and expand it
>> later.  Sorry if this wasn't clear from my comments before.
>>
>
> The problem is that, since you objected to wp_mem being treated the same as mmio, we had to introduce a new range type to the map/unmap hypercalls and that will become redundant if we steer by type. If we could have just increased the size of the rangeset for mmio and used that *for now* then weeks of work could have been saved going down this dead end.

So first of all, let me say that I handled things respect to this
thread rather poorly in a lot of ways, and you're right to be
frustrated.  All I can say is, I'm sorry and I'll try to do better.

I'm not sure I understand what you're saying with this line: "We had
to introduce a new range type to the map/unmap hypercalls that will
become redundant if we steer by type".

The interface you have here simply says, "Please allow the guest to
read from these gpfns, but give me (an ioreq server) a notification if
the guest attempts to write to them."  One way to implement this is
the way you have done -- by taking the existing (basically unused)
write_dm p2m type, and layering the existing MMIO rangesets on top of
it.  Another way you could have (and AFAICS still can) implemented
this interface is by allowing ioreq servers asking for write
intercepts on gpfns to claim individual p2m types set aside for that
purpose -- starting with the existing write_dm, which would only allow
one such server to register, and later perhaps extending to allow
more.

In what way would the second implementation make the new type for the
map/unmap hypercalls redundant?

 -George

Paul Durrant Feb. 22, 2016, 4:02 p.m. UTC | #101

> -----Original Message-----
> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
> George Dunlap
> Sent: 22 February 2016 15:56
> To: Paul Durrant
> Cc: George Dunlap; Jan Beulich; Kevin Tian; Zhang Yu; Wei Liu; Ian Campbell;
> Andrew Cooper; xen-devel@lists.xen.org; Stefano Stabellini; Zhiyuan Lv; Ian
> Jackson; Keir (Xen.org)
> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> On Wed, Feb 17, 2016 at 11:12 AM, Paul Durrant <Paul.Durrant@citrix.com>
> wrote:
> >> -----Original Message-----
> >> From: George Dunlap [mailto:george.dunlap@citrix.com]
> >> Sent: 17 February 2016 11:02
> >> To: Jan Beulich; Paul Durrant; Kevin Tian; Zhang Yu
> >> Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu;
> >> Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir (Xen.org)
> >> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> >> max_wp_ram_ranges.
> >>
> >> On 17/02/16 10:22, Jan Beulich wrote:
> >> >>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote:
> >> >> Thanks for the help. Let's see whether we can have some solution
> ready
> >> for
> >> >> 4.7. :-)
> >> >
> >> > Since we now seem to all agree that a different approach is going to
> >> > be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM:
> >> > differentiate IO/mem resources tracked by ioreq server"). Please
> >> > voice objections to the plan pretty soon.
> >>
> >> FWIW, after this discussion, I don't have an objection to the basic
> >> interface in this series as it is, since it addresses my request that it
> >> be memory-based, and it could be switched to using p2m types
> >> behind-the-scenes -- with the exception of the knob to control how many
> >> ranges are allowed (since it exposes the internal implementation).
> >>
> >> I wouldn't object to the current implementation going in as it was in v9
> >> (apparently), and then changing the type stuff around behind the scenes
> >> later as an optimization.  I also don't think it would be terribly
> >> difficult to change the implementation as it is to just use write_dm for
> >> a single IOREQ server.  We can rename it ioreq_server and expand it
> >> later.  Sorry if this wasn't clear from my comments before.
> >>
> >
> > The problem is that, since you objected to wp_mem being treated the
> same as mmio, we had to introduce a new range type to the map/unmap
> hypercalls and that will become redundant if we steer by type. If we could
> have just increased the size of the rangeset for mmio and used that *for
> now* then weeks of work could have been saved going down this dead end.
> 
> So first of all, let me say that I handled things respect to this
> thread rather poorly in a lot of ways, and you're right to be
> frustrated.  All I can say is, I'm sorry and I'll try to do better.
> 
> I'm not sure I understand what you're saying with this line: "We had
> to introduce a new range type to the map/unmap hypercalls that will
> become redundant if we steer by type".
>

What I means is that if we are going to do away with the concept of 'wp mem' and replace it with 'special type for steering to ioreq server' then we only need range types of pci, portio and mmio as we had before.
 
> The interface you have here simply says, "Please allow the guest to
> read from these gpfns, but give me (an ioreq server) a notification if
> the guest attempts to write to them."  One way to implement this is
> the way you have done -- by taking the existing (basically unused)
> write_dm p2m type, and layering the existing MMIO rangesets on top of
> it.

That type was introduced for this purpose; it was not there before.

>  Another way you could have (and AFAICS still can) implemented
> this interface is by allowing ioreq servers asking for write
> intercepts on gpfns to claim individual p2m types set aside for that
> purpose -- starting with the existing write_dm, which would only allow
> one such server to register, and later perhaps extending to allow
> more.
> 
> In what way would the second implementation make the new type for the
> map/unmap hypercalls redundant?
> 

Because that type is going to go away, now that it has no use.

  Paul

>  -George

George Dunlap Feb. 22, 2016, 4:45 p.m. UTC | #102

On 22/02/16 16:02, Paul Durrant wrote:
>> -----Original Message-----
>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
>> George Dunlap
>> Sent: 22 February 2016 15:56
>> To: Paul Durrant
>> Cc: George Dunlap; Jan Beulich; Kevin Tian; Zhang Yu; Wei Liu; Ian Campbell;
>> Andrew Cooper; xen-devel@lists.xen.org; Stefano Stabellini; Zhiyuan Lv; Ian
>> Jackson; Keir (Xen.org)
>> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
>> max_wp_ram_ranges.
>>
>> On Wed, Feb 17, 2016 at 11:12 AM, Paul Durrant <Paul.Durrant@citrix.com>
>> wrote:
>>>> -----Original Message-----
>>>> From: George Dunlap [mailto:george.dunlap@citrix.com]
>>>> Sent: 17 February 2016 11:02
>>>> To: Jan Beulich; Paul Durrant; Kevin Tian; Zhang Yu
>>>> Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu;
>>>> Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir (Xen.org)
>>>> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
>>>> max_wp_ram_ranges.
>>>>
>>>> On 17/02/16 10:22, Jan Beulich wrote:
>>>>>>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote:
>>>>>> Thanks for the help. Let's see whether we can have some solution
>> ready
>>>> for
>>>>>> 4.7. :-)
>>>>>
>>>>> Since we now seem to all agree that a different approach is going to
>>>>> be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM:
>>>>> differentiate IO/mem resources tracked by ioreq server"). Please
>>>>> voice objections to the plan pretty soon.
>>>>
>>>> FWIW, after this discussion, I don't have an objection to the basic
>>>> interface in this series as it is, since it addresses my request that it
>>>> be memory-based, and it could be switched to using p2m types
>>>> behind-the-scenes -- with the exception of the knob to control how many
>>>> ranges are allowed (since it exposes the internal implementation).
>>>>
>>>> I wouldn't object to the current implementation going in as it was in v9
>>>> (apparently), and then changing the type stuff around behind the scenes
>>>> later as an optimization.  I also don't think it would be terribly
>>>> difficult to change the implementation as it is to just use write_dm for
>>>> a single IOREQ server.  We can rename it ioreq_server and expand it
>>>> later.  Sorry if this wasn't clear from my comments before.
>>>>
>>>
>>> The problem is that, since you objected to wp_mem being treated the
>> same as mmio, we had to introduce a new range type to the map/unmap
>> hypercalls and that will become redundant if we steer by type. If we could
>> have just increased the size of the rangeset for mmio and used that *for
>> now* then weeks of work could have been saved going down this dead end.
>>
>> So first of all, let me say that I handled things respect to this
>> thread rather poorly in a lot of ways, and you're right to be
>> frustrated.  All I can say is, I'm sorry and I'll try to do better.
>>
>> I'm not sure I understand what you're saying with this line: "We had
>> to introduce a new range type to the map/unmap hypercalls that will
>> become redundant if we steer by type".
>>
> 
> What I means is that if we are going to do away with the concept of 'wp mem' and replace it with 'special type for steering to ioreq server' then we only need range types of pci, portio and mmio as we had before.

I'm still confused.

To me, there are two distinct things here: the interface and the
implementation.

From an interface perspective, ioreq servers need to be able to ask Xen
to intercept writes to specific memory regions.  I insisted that this
interface be specifically designed for memory, not calling them MMIO
regions (since it's actually memory and not MMIO regions); and I said
that the explosion in number of required rangesets demonstrated that
rangesets were a bad fit to implement this interface.

What you did in an earlier version of this series (correct me if I'm
wrong) is to make a separate hypercall for memory, but still keep using
the same internal implementation (i.e., still having a write_dm p2m type
and using rangesets to determine which ioreq server to give the
notification to).  But since the interface for memory looks exactly the
same as the interface for MMIO, at some point this morphed back to "use
xen_hvm_io_range but with a different range type (i.e., WP_MEM)".

From an *interface* perspective that makes sense, because in both cases
you want to be able to specify a domain, an ioreq server, and a range of
physical addresses.  I don't have any objection to the change you made
to hvm_op.h in this version of the series.

But from an *implementation* perspective, you're back to the very thing
I objected to -- having a large and potentially unbounded number of
rangesets, and baking that implementation into the interface by
requiring the administrator to specify the maxiumum number of rangesets
with a HVM_PARAM.

My suggestion was that you retain the same interface that you exposed in
hvm_op.h (adding WP_MEM), but instead of using rangesets to determine
which ioreq server to send the notification to, use p2m types instead.
The interface to qemu can be exactly the same as it is now, but you
won't need to increase the number of rangesets.  In fact, if you only
allow one ioreq server to register for WP_MEM at a time, then you don't
even  need to change the p2m code.

One way or another, you still want the ioreq server to be able to
intercept writes to guest memory, so I don't understand in what way "we
are going to do away with the concept of 'wp mem'".

The only question is whether it's going to go through xen_hvm_io_range
or some other hypecall -- and I don't particularly care which one it
ends up being, as long as it doesn't require setting this magic
potentially unbounded parameter.

 -George

Paul Durrant Feb. 22, 2016, 5:01 p.m. UTC | #103

> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: 22 February 2016 16:46
> To: Paul Durrant
> Cc: Jan Beulich; Kevin Tian; Zhang Yu; Wei Liu; Ian Campbell; Andrew Cooper;
> xen-devel@lists.xen.org; Stefano Stabellini; Zhiyuan Lv; Ian Jackson; Keir
> (Xen.org)
> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> On 22/02/16 16:02, Paul Durrant wrote:
> >> -----Original Message-----
> >> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
> >> George Dunlap
> >> Sent: 22 February 2016 15:56
> >> To: Paul Durrant
> >> Cc: George Dunlap; Jan Beulich; Kevin Tian; Zhang Yu; Wei Liu; Ian
> Campbell;
> >> Andrew Cooper; xen-devel@lists.xen.org; Stefano Stabellini; Zhiyuan Lv;
> Ian
> >> Jackson; Keir (Xen.org)
> >> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> >> max_wp_ram_ranges.
> >>
> >> On Wed, Feb 17, 2016 at 11:12 AM, Paul Durrant
> <Paul.Durrant@citrix.com>
> >> wrote:
> >>>> -----Original Message-----
> >>>> From: George Dunlap [mailto:george.dunlap@citrix.com]
> >>>> Sent: 17 February 2016 11:02
> >>>> To: Jan Beulich; Paul Durrant; Kevin Tian; Zhang Yu
> >>>> Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei
> Liu;
> >>>> Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir (Xen.org)
> >>>> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> >>>> max_wp_ram_ranges.
> >>>>
> >>>> On 17/02/16 10:22, Jan Beulich wrote:
> >>>>>>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote:
> >>>>>> Thanks for the help. Let's see whether we can have some solution
> >> ready
> >>>> for
> >>>>>> 4.7. :-)
> >>>>>
> >>>>> Since we now seem to all agree that a different approach is going to
> >>>>> be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM:
> >>>>> differentiate IO/mem resources tracked by ioreq server"). Please
> >>>>> voice objections to the plan pretty soon.
> >>>>
> >>>> FWIW, after this discussion, I don't have an objection to the basic
> >>>> interface in this series as it is, since it addresses my request that it
> >>>> be memory-based, and it could be switched to using p2m types
> >>>> behind-the-scenes -- with the exception of the knob to control how
> many
> >>>> ranges are allowed (since it exposes the internal implementation).
> >>>>
> >>>> I wouldn't object to the current implementation going in as it was in v9
> >>>> (apparently), and then changing the type stuff around behind the
> scenes
> >>>> later as an optimization.  I also don't think it would be terribly
> >>>> difficult to change the implementation as it is to just use write_dm for
> >>>> a single IOREQ server.  We can rename it ioreq_server and expand it
> >>>> later.  Sorry if this wasn't clear from my comments before.
> >>>>
> >>>
> >>> The problem is that, since you objected to wp_mem being treated the
> >> same as mmio, we had to introduce a new range type to the map/unmap
> >> hypercalls and that will become redundant if we steer by type. If we could
> >> have just increased the size of the rangeset for mmio and used that *for
> >> now* then weeks of work could have been saved going down this dead
> end.
> >>
> >> So first of all, let me say that I handled things respect to this
> >> thread rather poorly in a lot of ways, and you're right to be
> >> frustrated.  All I can say is, I'm sorry and I'll try to do better.
> >>
> >> I'm not sure I understand what you're saying with this line: "We had
> >> to introduce a new range type to the map/unmap hypercalls that will
> >> become redundant if we steer by type".
> >>
> >
> > What I means is that if we are going to do away with the concept of 'wp
> mem' and replace it with 'special type for steering to ioreq server' then we
> only need range types of pci, portio and mmio as we had before.
> 
> I'm still confused.
> 

So am I. I thought I was being quite clear.

> To me, there are two distinct things here: the interface and the
> implementation.
> 

Indeed.

> From an interface perspective, ioreq servers need to be able to ask Xen
> to intercept writes to specific memory regions.

No. You insisted on making MMIO distinct from 'memory'. We need to be able to intercept MMIO, port IO and PCI access to specific regions.

>  I insisted that this
> interface be specifically designed for memory, not calling them MMIO
> regions (since it's actually memory and not MMIO regions); and I said
> that the explosion in number of required rangesets demonstrated that
> rangesets were a bad fit to implement this interface.
> 

The intended overloading of MMIO for this purpose was a pragmatic choice at the time, however it appears pragmatism has no place here.

> What you did in an earlier version of this series (correct me if I'm
> wrong) is to make a separate hypercall for memory, but still keep using
> the same internal implementation (i.e., still having a write_dm p2m type
> and using rangesets to determine which ioreq server to give the
> notification to).  But since the interface for memory looks exactly the
> same as the interface for MMIO, at some point this morphed back to "use
> xen_hvm_io_range but with a different range type (i.e., WP_MEM)".
> 

Yes, and that's now pointless since we're going to use purely p2m types for sending memory accesses to ioreq servers.

> From an *interface* perspective that makes sense, because in both cases
> you want to be able to specify a domain, an ioreq server, and a range of
> physical addresses.  I don't have any objection to the change you made
> to hvm_op.h in this version of the series.
> 

No, if we are intercepting 'memory' purely by p2m type then there is no need for the additional range type.

> But from an *implementation* perspective, you're back to the very thing
> I objected to -- having a large and potentially unbounded number of
> rangesets, and baking that implementation into the interface by
> requiring the administrator to specify the maxiumum number of rangesets
> with a HVM_PARAM.
> 
> My suggestion was that you retain the same interface that you exposed in
> hvm_op.h (adding WP_MEM), but instead of using rangesets to determine
> which ioreq server to send the notification to, use p2m types instead.

Ok. But ultimately we want multiple p2m types for this purpose and the ability to send accesses to a specific type to a specific ioreq server.

> The interface to qemu can be exactly the same as it is now, but you
> won't need to increase the number of rangesets.  In fact, if you only
> allow one ioreq server to register for WP_MEM at a time, then you don't
> even  need to change the p2m code.
> 

Who said anything about QEMU? It's not relevant here.

Functionally maybe not at this time *but* that limit of a single ioreq server should not be baked into the interface forever more.

> One way or another, you still want the ioreq server to be able to
> intercept writes to guest memory, so I don't understand in what way "we
> are going to do away with the concept of 'wp mem'".
> 

As I have said previously on this thread, 'wp mem' should be replaced by a 'steer to ioreq server' type. That fulfils the purpose and makes it clear what the type is for. Then additional 'steer to a different ioreq server' types can be added later when we have sufficient type-space.

  Paul

> The only question is whether it's going to go through xen_hvm_io_range
> or some other hypecall -- and I don't particularly care which one it
> ends up being, as long as it doesn't require setting this magic
> potentially unbounded parameter.
> 
>  -George

George Dunlap Feb. 22, 2016, 5:23 p.m. UTC | #104

On 22/02/16 17:01, Paul Durrant wrote:
>> What you did in an earlier version of this series (correct me if I'm
>> wrong) is to make a separate hypercall for memory, but still keep using
>> the same internal implementation (i.e., still having a write_dm p2m type
>> and using rangesets to determine which ioreq server to give the
>> notification to).  But since the interface for memory looks exactly the
>> same as the interface for MMIO, at some point this morphed back to "use
>> xen_hvm_io_range but with a different range type (i.e., WP_MEM)".
>>
> 
> Yes, and that's now pointless since we're going to use purely p2m types for sending memory accesses to ioreq servers.
> 
>> From an *interface* perspective that makes sense, because in both cases
>> you want to be able to specify a domain, an ioreq server, and a range of
>> physical addresses.  I don't have any objection to the change you made
>> to hvm_op.h in this version of the series.
>>
> 
> No, if we are intercepting 'memory' purely by p2m type then there is no need for the additional range type.

So here seems to be the crux of our disagreement.

I don't understand why you think that the WP_MEM interface described in
patch 2 of this series can't be implemented using p2m types rather than
rangesets.

 -George

Paul Durrant Feb. 22, 2016, 5:34 p.m. UTC | #105

> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: 22 February 2016 17:23
> To: Paul Durrant
> Cc: Jan Beulich; Kevin Tian; Zhang Yu; Wei Liu; Ian Campbell; Andrew Cooper;
> xen-devel@lists.xen.org; Stefano Stabellini; Zhiyuan Lv; Ian Jackson; Keir
> (Xen.org)
> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter
> max_wp_ram_ranges.
> 
> On 22/02/16 17:01, Paul Durrant wrote:
> >> What you did in an earlier version of this series (correct me if I'm
> >> wrong) is to make a separate hypercall for memory, but still keep using
> >> the same internal implementation (i.e., still having a write_dm p2m type
> >> and using rangesets to determine which ioreq server to give the
> >> notification to).  But since the interface for memory looks exactly the
> >> same as the interface for MMIO, at some point this morphed back to "use
> >> xen_hvm_io_range but with a different range type (i.e., WP_MEM)".
> >>
> >
> > Yes, and that's now pointless since we're going to use purely p2m types for
> sending memory accesses to ioreq servers.
> >
> >> From an *interface* perspective that makes sense, because in both cases
> >> you want to be able to specify a domain, an ioreq server, and a range of
> >> physical addresses.  I don't have any objection to the change you made
> >> to hvm_op.h in this version of the series.
> >>
> >
> > No, if we are intercepting 'memory' purely by p2m type then there is no
> need for the additional range type.
> 
> So here seems to be the crux of our disagreement.
> 
> I don't understand why you think that the WP_MEM interface described in
> patch 2 of this series can't be implemented using p2m types rather than
> rangesets.
> 

Huh? I just said that's exactly how I intend to implement it:

"if we are intercepting 'memory' purely by p2m type then there is no need for the additional range type"

I.e. we will not intercept memory accesses by range, we will only intercept them by type.

  Paul

>  -George

diff mbox

Patch

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index 8899f75..c294fd3 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -962,6 +962,32 @@  FIFO-based event channel ABI support up to 131,071 event channels.
 Other guests are limited to 4095 (64-bit x86 and ARM) or 1023 (32-bit
 x86).
 
+=item B<max_wp_ram_ranges=N>
+
+Limit the maximum write-protected ram ranges that can be tracked
+inside one ioreq server rangeset.
+
+Ioreq server uses a group of rangesets to track the I/O or memory
+resources to be emulated. Default limit of ranges that one rangeset
+can allocate is set to a small value, due to the fact that these
+ranges are allocated in xen heap. Yet for the write-protected ram
+ranges, there are circumstances under which the upper limit inside
+one rangeset should exceed the default one. E.g. in Intel GVT-g,
+when tracking the PPGTT(per-process graphic translation tables) on
+Intel broadwell platforms, the number of page tables concerned will
+be of several thousand.
+
+For Intel GVT-g broadwell platform, 8192 is a suggested value for
+this parameter in most cases. But users who set his item explicitly
+are also supposed to know the specific scenarios that necessitate
+this configuration. Especially when this parameter is used:
+1> for virtual devices other than vGPU in GVT-g;
+2> for GVT-g, there also might be some extreme cases, e.g. too many
+graphic related applications in one VM, which create a great deal of
+per-process graphic translation tables;
+3> for GVT-g, future cpu platforms which provide even more per-process
+graphic translation tables.
+
 =back
 
 =head2 Paravirtualised (PV) Guest Specific Options
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index fa87f53..18828c5 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -136,6 +136,11 @@ 
 #define LIBXL_HAVE_BUILDINFO_EVENT_CHANNELS 1
 
 /*
+ * libxl_domain_build_info has the u.hvm.max_wp_ram_ranges field.
+ */
+#define LIBXL_HAVE_BUILDINFO_HVM_MAX_WP_RAM_RANGES 1
+
+/*
  * libxl_domain_build_info has the u.hvm.ms_vm_genid field.
  */
 #define LIBXL_HAVE_BUILDINFO_HVM_MS_VM_GENID 1
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 2269998..54173cb 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -288,6 +288,9 @@  static void hvm_set_conf_params(xc_interface *handle, uint32_t domid,
                     libxl_defbool_val(info->u.hvm.nested_hvm));
     xc_hvm_param_set(handle, domid, HVM_PARAM_ALTP2M,
                     libxl_defbool_val(info->u.hvm.altp2m));
+    if (info->u.hvm.max_wp_ram_ranges > 0)
+        xc_hvm_param_set(handle, domid, HVM_PARAM_MAX_WP_RAM_RANGES,
+                        info->u.hvm.max_wp_ram_ranges);
 }
 
 int libxl__build_pre(libxl__gc *gc, uint32_t domid,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 9ad7eba..9185014 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -518,6 +518,7 @@  libxl_domain_build_info = Struct("domain_build_info",[
                                        ("serial_list",      libxl_string_list),
                                        ("rdm", libxl_rdm_reserve),
                                        ("rdm_mem_boundary_memkb", MemKB),
+                                       ("max_wp_ram_ranges", uint64),
                                        ])),
                  ("pv", Struct(None, [("kernel", string),
                                       ("slack_memkb", MemKB),
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 25507c7..0c19dee 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -35,6 +35,7 @@ 
 #include <inttypes.h>
 #include <limits.h>
 #include <xen/hvm/e820.h>
+#include <xenctrl.h>
 
 #include "libxl.h"
 #include "libxl_utils.h"
@@ -1626,6 +1627,22 @@  static void parse_config_data(const char *config_source,
 
         if (!xlu_cfg_get_long (config, "rdm_mem_boundary", &l, 0))
             b_info->u.hvm.rdm_mem_boundary_memkb = l * 1024;
+
+        if (!xlu_cfg_get_long (config, "max_wp_ram_ranges", &l, 0)) {
+            uint64_t nr_pages = (b_info->max_memkb << 10) >> XC_PAGE_SHIFT;
+
+            /* Due to rangeset's ability to combine continuous ranges, this
+             * parameter shall not be configured with values greater than half
+             * of the number of VM's page frames. It also shall not exceed 4G,
+             * because of the limitation from the rangeset side. */
+            if (l > (nr_pages / 2) || l > UINT32_MAX) {
+                fprintf(stderr, "ERROR: Invalid value for \"max_wp_ram_ranges\". "
+                        "Shall not exceed %ld or 4G.\n", nr_pages / 2);
+                exit(1);
+            }
+            b_info->u.hvm.max_wp_ram_ranges = l;
+        }
+
         break;
     case LIBXL_DOMAIN_TYPE_PV:
     {
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index e0d998f..7339833 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -940,6 +940,8 @@  static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s,
 {
     unsigned int i;
     int rc;
+    unsigned int max_wp_ram_ranges =
+        s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES];
 
     if ( is_default )
         goto done;
@@ -962,7 +964,10 @@  static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s,
         if ( !s->range[i] )
             goto fail;
 
-        rangeset_limit(s->range[i], MAX_NR_IO_RANGES);
+        rangeset_limit(s->range[i],
+                       ((i == HVMOP_IO_RANGE_WP_MEM) ?
+                        max_wp_ram_ranges :
+                        MAX_NR_IO_RANGES));
     }
 
  done:
diff --git a/xen/include/public/hvm/params.h b/xen/include/public/hvm/params.h
index 81f9451..ab3b11d 100644
--- a/xen/include/public/hvm/params.h
+++ b/xen/include/public/hvm/params.h
@@ -210,6 +210,9 @@ 
 /* Boolean: Enable altp2m */
 #define HVM_PARAM_ALTP2M       35
 
-#define HVM_NR_PARAMS          36
+/* Max write-protected ram ranges to be tracked in one ioreq server rangeset */
+#define HVM_PARAM_MAX_WP_RAM_RANGES  36
+
+#define HVM_NR_PARAMS          37
 
 #endif /* __XEN_PUBLIC_HVM_PARAMS_H__ */