Message ID | 1454064314-7799-4-git-send-email-yu.c.zhang@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote: > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s, > { > unsigned int i; > int rc; > + unsigned int max_wp_ram_ranges = > + s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES]; You're still losing the upper 32 bits here. Iirc you agreed to range check the value before storing into params[]... Jan
On 1/30/2016 12:33 AM, Jan Beulich wrote: >>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote: >> --- a/xen/arch/x86/hvm/hvm.c >> +++ b/xen/arch/x86/hvm/hvm.c >> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s, >> { >> unsigned int i; >> int rc; >> + unsigned int max_wp_ram_ranges = >> + s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES]; > > You're still losing the upper 32 bits here. Iirc you agreed to range > check the value before storing into params[]... > > Jan > > Thanks, Jan. :) In this version, the check is added in routine parse_config_data(). If option 'max_wp_ram_ranges' is configured with an unreasonable value, the xl will terminate, before calling xc_hvm_param_set(). Does this change meet your requirement? Or maybe did I have some misunderstanding on this issue? B.R. Yu
>>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote: > On 1/30/2016 12:33 AM, Jan Beulich wrote: >>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote: >>> --- a/xen/arch/x86/hvm/hvm.c >>> +++ b/xen/arch/x86/hvm/hvm.c >>> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s, >>> { >>> unsigned int i; >>> int rc; >>> + unsigned int max_wp_ram_ranges = >>> + s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES]; >> >> You're still losing the upper 32 bits here. Iirc you agreed to range >> check the value before storing into params[]... > > Thanks, Jan. :) > In this version, the check is added in routine parse_config_data(). > If option 'max_wp_ram_ranges' is configured with an unreasonable value, > the xl will terminate, before calling xc_hvm_param_set(). Does this > change meet your requirement? Or maybe did I have some misunderstanding > on this issue? Checking in the tools is desirable, but the hypervisor shouldn't rely on any tool side checking. Jan
On Fri, Jan 29, 2016 at 06:45:14PM +0800, Yu Zhang wrote: > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > index 25507c7..0c19dee 100644 > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -35,6 +35,7 @@ > #include <inttypes.h> > #include <limits.h> > #include <xen/hvm/e820.h> > +#include <xenctrl.h> > > #include "libxl.h" > #include "libxl_utils.h" > @@ -1626,6 +1627,22 @@ static void parse_config_data(const char *config_source, > > if (!xlu_cfg_get_long (config, "rdm_mem_boundary", &l, 0)) > b_info->u.hvm.rdm_mem_boundary_memkb = l * 1024; > + > + if (!xlu_cfg_get_long (config, "max_wp_ram_ranges", &l, 0)) { > + uint64_t nr_pages = (b_info->max_memkb << 10) >> XC_PAGE_SHIFT; > + > + /* Due to rangeset's ability to combine continuous ranges, this > + * parameter shall not be configured with values greater than half > + * of the number of VM's page frames. It also shall not exceed 4G, > + * because of the limitation from the rangeset side. */ > + if (l > (nr_pages / 2) || l > UINT32_MAX) { > + fprintf(stderr, "ERROR: Invalid value for \"max_wp_ram_ranges\". " > + "Shall not exceed %ld or 4G.\n", nr_pages / 2); > + exit(1); > + } > + b_info->u.hvm.max_wp_ram_ranges = l; > + } > + Xl is only one of the applications that use libxl (the library). This check should be inside libxl so that all applications (xl, libvirt and others) have the same behaviour. Take a look at initiate_domain_create where numerous validations are done. Wei.
On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote: > >>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote: > > > On 1/30/2016 12:33 AM, Jan Beulich wrote: > >>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote: > >>> --- a/xen/arch/x86/hvm/hvm.c > >>> +++ b/xen/arch/x86/hvm/hvm.c > >>> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s, > >>> { > >>> unsigned int i; > >>> int rc; > >>> + unsigned int max_wp_ram_ranges = > >>> + s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES]; > >> > >> You're still losing the upper 32 bits here. Iirc you agreed to range > >> check the value before storing into params[]... > > > > Thanks, Jan. :) > > In this version, the check is added in routine parse_config_data(). > > If option 'max_wp_ram_ranges' is configured with an unreasonable value, > > the xl will terminate, before calling xc_hvm_param_set(). Does this > > change meet your requirement? Or maybe did I have some misunderstanding > > on this issue? > > Checking in the tools is desirable, but the hypervisor shouldn't rely > on any tool side checking. > As in hypervisor needs to sanitise all input from toolstack? I don't think Xen does that today. What is the difference between this particular configuration option and all other options in the same hvm_set_conf_params function? Wei. > Jan >
>>> On 01.02.16 at 13:02, <wei.liu2@citrix.com> wrote: > On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote: >> >>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote: >> >> > On 1/30/2016 12:33 AM, Jan Beulich wrote: >> >>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote: >> >>> --- a/xen/arch/x86/hvm/hvm.c >> >>> +++ b/xen/arch/x86/hvm/hvm.c >> >>> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct > hvm_ioreq_server *s, >> >>> { >> >>> unsigned int i; >> >>> int rc; >> >>> + unsigned int max_wp_ram_ranges = >> >>> + s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES]; >> >> >> >> You're still losing the upper 32 bits here. Iirc you agreed to range >> >> check the value before storing into params[]... >> > >> > Thanks, Jan. :) >> > In this version, the check is added in routine parse_config_data(). >> > If option 'max_wp_ram_ranges' is configured with an unreasonable value, >> > the xl will terminate, before calling xc_hvm_param_set(). Does this >> > change meet your requirement? Or maybe did I have some misunderstanding >> > on this issue? >> >> Checking in the tools is desirable, but the hypervisor shouldn't rely >> on any tool side checking. >> > > As in hypervisor needs to sanitise all input from toolstack? I don't > think Xen does that today. If it doesn't, then that's a bug. Note that in many cases (domctl-s and alike) such bogus trusting in the tool stack behaving correctly is only not a security issue due to XSA-77. Yet with XSA-77 we made quite clear that we shouldn't knowingly allow in further such issues (it'll be hard enough to find and address all existing ones). Jan
On Mon, Feb 01, 2016 at 05:15:16AM -0700, Jan Beulich wrote: > >>> On 01.02.16 at 13:02, <wei.liu2@citrix.com> wrote: > > On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote: > >> >>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote: > >> > >> > On 1/30/2016 12:33 AM, Jan Beulich wrote: > >> >>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote: > >> >>> --- a/xen/arch/x86/hvm/hvm.c > >> >>> +++ b/xen/arch/x86/hvm/hvm.c > >> >>> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct > > hvm_ioreq_server *s, > >> >>> { > >> >>> unsigned int i; > >> >>> int rc; > >> >>> + unsigned int max_wp_ram_ranges = > >> >>> + s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES]; > >> >> > >> >> You're still losing the upper 32 bits here. Iirc you agreed to range > >> >> check the value before storing into params[]... > >> > > >> > Thanks, Jan. :) > >> > In this version, the check is added in routine parse_config_data(). > >> > If option 'max_wp_ram_ranges' is configured with an unreasonable value, > >> > the xl will terminate, before calling xc_hvm_param_set(). Does this > >> > change meet your requirement? Or maybe did I have some misunderstanding > >> > on this issue? > >> > >> Checking in the tools is desirable, but the hypervisor shouldn't rely > >> on any tool side checking. > >> > > > > As in hypervisor needs to sanitise all input from toolstack? I don't > > think Xen does that today. > > If it doesn't, then that's a bug. Note that in many cases (domctl-s > and alike) such bogus trusting in the tool stack behaving correctly > is only not a security issue due to XSA-77. Yet with XSA-77 we > made quite clear that we shouldn't knowingly allow in further such > issues (it'll be hard enough to find and address all existing ones). > So are you suggesting pulling the check done in toolstack into hypervisor? Wei. > Jan >
>>> On 01.02.16 at 13:49, <wei.liu2@citrix.com> wrote: > On Mon, Feb 01, 2016 at 05:15:16AM -0700, Jan Beulich wrote: >> >>> On 01.02.16 at 13:02, <wei.liu2@citrix.com> wrote: >> > On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote: >> >> >>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote: >> >> >> >> > On 1/30/2016 12:33 AM, Jan Beulich wrote: >> >> >>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote: >> >> >>> --- a/xen/arch/x86/hvm/hvm.c >> >> >>> +++ b/xen/arch/x86/hvm/hvm.c >> >> >>> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct >> > hvm_ioreq_server *s, >> >> >>> { >> >> >>> unsigned int i; >> >> >>> int rc; >> >> >>> + unsigned int max_wp_ram_ranges = >> >> >>> + s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES]; >> >> >> >> >> >> You're still losing the upper 32 bits here. Iirc you agreed to range >> >> >> check the value before storing into params[]... >> >> > >> >> > Thanks, Jan. :) >> >> > In this version, the check is added in routine parse_config_data(). >> >> > If option 'max_wp_ram_ranges' is configured with an unreasonable value, >> >> > the xl will terminate, before calling xc_hvm_param_set(). Does this >> >> > change meet your requirement? Or maybe did I have some misunderstanding >> >> > on this issue? >> >> >> >> Checking in the tools is desirable, but the hypervisor shouldn't rely >> >> on any tool side checking. >> >> >> > >> > As in hypervisor needs to sanitise all input from toolstack? I don't >> > think Xen does that today. >> >> If it doesn't, then that's a bug. Note that in many cases (domctl-s >> and alike) such bogus trusting in the tool stack behaving correctly >> is only not a security issue due to XSA-77. Yet with XSA-77 we >> made quite clear that we shouldn't knowingly allow in further such >> issues (it'll be hard enough to find and address all existing ones). > > So are you suggesting pulling the check done in toolstack into > hypervisor? I think the check in the tools should stay (allowing for a distinguishable error message to be issued); all I'm saying is that doing the check in the tools is not enough. Jan
On 2/1/2016 9:07 PM, Jan Beulich wrote: >>>> On 01.02.16 at 13:49, <wei.liu2@citrix.com> wrote: >> On Mon, Feb 01, 2016 at 05:15:16AM -0700, Jan Beulich wrote: >>>>>> On 01.02.16 at 13:02, <wei.liu2@citrix.com> wrote: >>>> On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote: >>>>>>>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote: >>>>> >>>>>> On 1/30/2016 12:33 AM, Jan Beulich wrote: >>>>>>>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote: >>>>>>>> --- a/xen/arch/x86/hvm/hvm.c >>>>>>>> +++ b/xen/arch/x86/hvm/hvm.c >>>>>>>> @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct >>>> hvm_ioreq_server *s, >>>>>>>> { >>>>>>>> unsigned int i; >>>>>>>> int rc; >>>>>>>> + unsigned int max_wp_ram_ranges = >>>>>>>> + s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES]; >>>>>>> >>>>>>> You're still losing the upper 32 bits here. Iirc you agreed to range >>>>>>> check the value before storing into params[]... >>>>>> >>>>>> Thanks, Jan. :) >>>>>> In this version, the check is added in routine parse_config_data(). >>>>>> If option 'max_wp_ram_ranges' is configured with an unreasonable value, >>>>>> the xl will terminate, before calling xc_hvm_param_set(). Does this >>>>>> change meet your requirement? Or maybe did I have some misunderstanding >>>>>> on this issue? >>>>> >>>>> Checking in the tools is desirable, but the hypervisor shouldn't rely >>>>> on any tool side checking. >>>>> >>>> >>>> As in hypervisor needs to sanitise all input from toolstack? I don't >>>> think Xen does that today. >>> >>> If it doesn't, then that's a bug. Note that in many cases (domctl-s >>> and alike) such bogus trusting in the tool stack behaving correctly >>> is only not a security issue due to XSA-77. Yet with XSA-77 we >>> made quite clear that we shouldn't knowingly allow in further such >>> issues (it'll be hard enough to find and address all existing ones). >> >> So are you suggesting pulling the check done in toolstack into >> hypervisor? > > I think the check in the tools should stay (allowing for a > distinguishable error message to be issued); all I'm saying is > that doing the check in the tools is not enough. > > Jan > Thank you Jan and Wei. And sorry for the late response. But I still do not quite understand. :) If tool stack can guarantee the validity of a parameter, under which circumstances will hypervisor be threatened? I'm not familiar with XSA-77, and I'll read it ASAP. B.R. Yu > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
On 2/1/2016 7:57 PM, Wei Liu wrote: > On Fri, Jan 29, 2016 at 06:45:14PM +0800, Yu Zhang wrote: >> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c >> index 25507c7..0c19dee 100644 >> --- a/tools/libxl/xl_cmdimpl.c >> +++ b/tools/libxl/xl_cmdimpl.c >> @@ -35,6 +35,7 @@ >> #include <inttypes.h> >> #include <limits.h> >> #include <xen/hvm/e820.h> >> +#include <xenctrl.h> >> >> #include "libxl.h" >> #include "libxl_utils.h" >> @@ -1626,6 +1627,22 @@ static void parse_config_data(const char *config_source, >> >> if (!xlu_cfg_get_long (config, "rdm_mem_boundary", &l, 0)) >> b_info->u.hvm.rdm_mem_boundary_memkb = l * 1024; >> + >> + if (!xlu_cfg_get_long (config, "max_wp_ram_ranges", &l, 0)) { >> + uint64_t nr_pages = (b_info->max_memkb << 10) >> XC_PAGE_SHIFT; >> + >> + /* Due to rangeset's ability to combine continuous ranges, this >> + * parameter shall not be configured with values greater than half >> + * of the number of VM's page frames. It also shall not exceed 4G, >> + * because of the limitation from the rangeset side. */ >> + if (l > (nr_pages / 2) || l > UINT32_MAX) { >> + fprintf(stderr, "ERROR: Invalid value for \"max_wp_ram_ranges\". " >> + "Shall not exceed %ld or 4G.\n", nr_pages / 2); >> + exit(1); >> + } >> + b_info->u.hvm.max_wp_ram_ranges = l; >> + } >> + > > Xl is only one of the applications that use libxl (the library). This > check should be inside libxl so that all applications (xl, libvirt and > others) have the same behaviour. > > Take a look at initiate_domain_create where numerous validations are > done. > > Wei. > Thank you, Wei. I'll try to move this part into initiate_domain_create(). B.R. Yu > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
>>> On 01.02.16 at 16:14, <yu.c.zhang@linux.intel.com> wrote: > But I still do not quite understand. :) > If tool stack can guarantee the validity of a parameter, > under which circumstances will hypervisor be threatened? At least in disaggregated environments the hypervisor cannot trust the (parts of the) tool stack(s) living outside of Dom0. But even without disaggregation in mind it is bad practice to have the hypervisor assume the tool stack will only pass sane values. Just at the example of the param you're introducing: You don't even do the validation in libxc, so any (theoretical) tool stack no based on xl/libxl would not be guaranteed to pass a sane value. And even if you moved it into libxc, one could still argue that there could an even more theoretical tool stack not even building on top of libxc. Jan
On 2/1/2016 11:14 PM, Yu, Zhang wrote: > > > On 2/1/2016 9:07 PM, Jan Beulich wrote: >>>>> On 01.02.16 at 13:49, <wei.liu2@citrix.com> wrote: >>> On Mon, Feb 01, 2016 at 05:15:16AM -0700, Jan Beulich wrote: >>>>>>> On 01.02.16 at 13:02, <wei.liu2@citrix.com> wrote: >>>>> On Mon, Feb 01, 2016 at 12:52:51AM -0700, Jan Beulich wrote: >>>>>>>>> On 30.01.16 at 15:38, <yu.c.zhang@linux.intel.com> wrote: >>>>>> >>>>>>> On 1/30/2016 12:33 AM, Jan Beulich wrote: >>>>>>>>>>> On 29.01.16 at 11:45, <yu.c.zhang@linux.intel.com> wrote: >>>>>>>>> --- a/xen/arch/x86/hvm/hvm.c >>>>>>>>> +++ b/xen/arch/x86/hvm/hvm.c >>>>>>>>> @@ -940,6 +940,8 @@ static int >>>>>>>>> hvm_ioreq_server_alloc_rangesets(struct >>>>> hvm_ioreq_server *s, >>>>>>>>> { >>>>>>>>> unsigned int i; >>>>>>>>> int rc; >>>>>>>>> + unsigned int max_wp_ram_ranges = >>>>>>>>> + >>>>>>>>> s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES]; >>>>>>>> >>>>>>>> You're still losing the upper 32 bits here. Iirc you agreed to >>>>>>>> range >>>>>>>> check the value before storing into params[]... >>>>>>> >>>>>>> Thanks, Jan. :) >>>>>>> In this version, the check is added in routine parse_config_data(). >>>>>>> If option 'max_wp_ram_ranges' is configured with an unreasonable >>>>>>> value, >>>>>>> the xl will terminate, before calling xc_hvm_param_set(). Does this >>>>>>> change meet your requirement? Or maybe did I have some >>>>>>> misunderstanding >>>>>>> on this issue? >>>>>> >>>>>> Checking in the tools is desirable, but the hypervisor shouldn't rely >>>>>> on any tool side checking. >>>>>> >>>>> >>>>> As in hypervisor needs to sanitise all input from toolstack? I don't >>>>> think Xen does that today. >>>> >>>> If it doesn't, then that's a bug. Note that in many cases (domctl-s >>>> and alike) such bogus trusting in the tool stack behaving correctly >>>> is only not a security issue due to XSA-77. Yet with XSA-77 we >>>> made quite clear that we shouldn't knowingly allow in further such >>>> issues (it'll be hard enough to find and address all existing ones). >>> >>> So are you suggesting pulling the check done in toolstack into >>> hypervisor? >> >> I think the check in the tools should stay (allowing for a >> distinguishable error message to be issued); all I'm saying is >> that doing the check in the tools is not enough. >> >> Jan >> > > Thank you Jan and Wei. And sorry for the late response. > But I still do not quite understand. :) > If tool stack can guarantee the validity of a parameter, > under which circumstances will hypervisor be threatened? > I'm not familiar with XSA-77, and I'll read it ASAP. > > B.R. > Yu Sorry to bother you, Jan. After a second thought, I guess one of the security concern is when some APP is trying to trigger the HVMOP_set_param directly with some illegal values. So, we need also validate this param in hvm_allow_set_param, current although hvm_allow_set_param has not performed any validation other parameters. We need to do this for the new ones. Is this understanding correct? Another question is: as to the tool stack side, do you think an error message would suffice? Shouldn't xl be terminated? Thanks Yu >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel >> > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
On 2/2/2016 12:16 AM, Jan Beulich wrote: >>>> On 01.02.16 at 16:14, <yu.c.zhang@linux.intel.com> wrote: >> But I still do not quite understand. :) >> If tool stack can guarantee the validity of a parameter, >> under which circumstances will hypervisor be threatened? > > At least in disaggregated environments the hypervisor cannot > trust the (parts of the) tool stack(s) living outside of Dom0. But > even without disaggregation in mind it is bad practice to have > the hypervisor assume the tool stack will only pass sane values. > Just at the example of the param you're introducing: You don't > even do the validation in libxc, so any (theoretical) tool stack > no based on xl/libxl would not be guaranteed to pass a sane > value. And even if you moved it into libxc, one could still argue > that there could an even more theoretical tool stack not even > building on top of libxc. > > Jan > Great. Thank you very much for your patience to explain. Just sent out another mail about my understanding a moment ago, seems I partially get it. :) My vnc connection is too slow, will change the code tomorrow. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel > Yu
>>> On 01.02.16 at 17:19, <yu.c.zhang@linux.intel.com> wrote: > After a second thought, I guess one of the security concern > is when some APP is trying to trigger the HVMOP_set_param > directly with some illegal values. Not sure what "directly" is supposed to mean here. > So, we need also validate this param in hvm_allow_set_param, > current although hvm_allow_set_param has not performed any > validation other parameters. We need to do this for the new > ones. Is this understanding correct? Yes. > Another question is: as to the tool stack side, do you think > an error message would suffice? Shouldn't xl be terminated? I have no idea what consistent behavior in such a case would be - I'll defer input on this to the tool stack maintainers. Jan
On 2/2/2016 12:35 AM, Jan Beulich wrote: >>>> On 01.02.16 at 17:19, <yu.c.zhang@linux.intel.com> wrote: >> After a second thought, I guess one of the security concern >> is when some APP is trying to trigger the HVMOP_set_param >> directly with some illegal values. > > Not sure what "directly" is supposed to mean here. > I mean with no validation by itself, like libxc... >> So, we need also validate this param in hvm_allow_set_param, >> current although hvm_allow_set_param has not performed any >> validation other parameters. We need to do this for the new >> ones. Is this understanding correct? > > Yes. > >> Another question is: as to the tool stack side, do you think >> an error message would suffice? Shouldn't xl be terminated? > > I have no idea what consistent behavior in such a case would > be - I'll defer input on this to the tool stack maintainers. > Thank you. Wei, which one do you prefer? Yu
Yu, Zhang writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."): > On 2/2/2016 12:35 AM, Jan Beulich wrote: >> On 01.02.16 at 17:19, <yu.c.zhang@linux.intel.com> wrote: > >> So, we need also validate this param in hvm_allow_set_param, > >> current although hvm_allow_set_param has not performed any > >> validation other parameters. We need to do this for the new > >> ones. Is this understanding correct? > > > > Yes. > > > >> Another question is: as to the tool stack side, do you think > >> an error message would suffice? Shouldn't xl be terminated? > > > > I have no idea what consistent behavior in such a case would > > be - I'll defer input on this to the tool stack maintainers. > > Thank you. > Wei, which one do you prefer? I think that arrangements should be made for the hypercall failure to be properly reported to the caller, and properly logged. I don't think it is desirable to duplicate the sanity check in xl/libxl/libxc. That would simply result in there being two limits to update. I have to say, though, that the situation with this parameter seems quite unsatisfactory. It seems to be a kind of bodge. The changeable limit is there to prevent excessive resource usage by a guest. But the docs suggest that the excessive usage might be normal. That sounds like a suboptimal design to me. For reference, here is the docs proposed in this patch: =item B<max_wp_ram_ranges=N> Limit the maximum write-protected ram ranges that can be tracked inside one ioreq server rangeset. Ioreq server uses a group of rangesets to track the I/O or memory resources to be emulated. Default limit of ranges that one rangeset can allocate is set to a small value, due to the fact that these ranges are allocated in xen heap. Yet for the write-protected ram ranges, there are circumstances under which the upper limit inside one rangeset should exceed the default one. E.g. in Intel GVT-g, when tracking the PPGTT(per-process graphic translation tables) on Intel broadwell platforms, the number of page tables concerned will be of several thousand. For Intel GVT-g broadwell platform, 8192 is a suggested value for this parameter in most cases. But users who set his item explicitly are also supposed to know the specific scenarios that necessitate this configuration. Especially when this parameter is used: 1> for virtual devices other than vGPU in GVT-g; 2> for GVT-g, there also might be some extreme cases, e.g. too many graphic related applications in one VM, which create a great deal of per-process graphic translation tables; 3> for GVT-g, future cpu platforms which provide even more per-process graphic translation tables. Having said that, if the hypervisor maintainers are happy with a situation where this value is configured explicitly, and the configurations where a non-default value is required is expected to be rare, then I guess we can live with it. Ian.
Thanks for your reply, Ian. On 2/2/2016 1:05 AM, Ian Jackson wrote: > Yu, Zhang writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."): >> On 2/2/2016 12:35 AM, Jan Beulich wrote: >>> On 01.02.16 at 17:19, <yu.c.zhang@linux.intel.com> wrote: >>>> So, we need also validate this param in hvm_allow_set_param, >>>> current although hvm_allow_set_param has not performed any >>>> validation other parameters. We need to do this for the new >>>> ones. Is this understanding correct? >>> >>> Yes. >>> >>>> Another question is: as to the tool stack side, do you think >>>> an error message would suffice? Shouldn't xl be terminated? >>> >>> I have no idea what consistent behavior in such a case would >>> be - I'll defer input on this to the tool stack maintainers. >> >> Thank you. >> Wei, which one do you prefer? > > I think that arrangements should be made for the hypercall failure to > be properly reported to the caller, and properly logged. > > I don't think it is desirable to duplicate the sanity check in > xl/libxl/libxc. That would simply result in there being two limits to > update. > Sorry, I do not follow. What does "being two limits to update" mean? > I have to say, though, that the situation with this parameter seems > quite unsatisfactory. It seems to be a kind of bodge. > By "situation with this parameter", do you mean: a> the introduction of this parameter in tool stack, or b> the sanitizing for this parameter(In fact I'd prefer not to treat the check of this parameter as a sanitizing, cause it only checks the input against 4G to avoid data missing from uint64 to uint32 assignment in hvm_ioreq_server_alloc_rangesets)? > The changeable limit is there to prevent excessive resource usage by a > guest. But the docs suggest that the excessive usage might be > normal. That sounds like a suboptimal design to me. > Yes, there might be situations that this limit be set to some large value. But I that situation would be very rare. Like the docs suggested, for XenGT, 8K is a big enough one for most cases. >For reference, here is the docs proposed in this patch: > > =item B<max_wp_ram_ranges=N> > > Limit the maximum write-protected ram ranges that can be tracked > inside one ioreq server rangeset. > > Ioreq server uses a group of rangesets to track the I/O or memory > resources to be emulated. Default limit of ranges that one rangeset > can allocate is set to a small value, due to the fact that these > ranges are allocated in xen heap. Yet for the write-protected ram > ranges, there are circumstances under which the upper limit inside > one rangeset should exceed the default one. E.g. in Intel GVT-g, > when tracking the PPGTT(per-process graphic translation tables) on > Intel broadwell platforms, the number of page tables concerned will > be of several thousand. > > For Intel GVT-g broadwell platform, 8192 is a suggested value for > this parameter in most cases. But users who set his item explicitly > are also supposed to know the specific scenarios that necessitate > this configuration. Especially when this parameter is used: > 1> for virtual devices other than vGPU in GVT-g; > 2> for GVT-g, there also might be some extreme cases, e.g. too many > graphic related applications in one VM, which create a great deal of > per-process graphic translation tables; > 3> for GVT-g, future cpu platforms which provide even more per-process > graphic translation tables. > > Having said that, if the hypervisor maintainers are happy with a > situation where this value is configured explicitly, and the > configurations where a non-default value is required is expected to be > rare, then I guess we can live with it. > > Ian. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel > Thanks Yu
>>> On 01.02.16 at 18:05, <Ian.Jackson@eu.citrix.com> wrote: > Having said that, if the hypervisor maintainers are happy with a > situation where this value is configured explicitly, and the > configurations where a non-default value is required is expected to be > rare, then I guess we can live with it. Well, from the very beginning I have been not very happy with the introduction of this, and I still consider it half way acceptable only because of not seeing any good alternative. If we look at it strictly, it's in violation of the rule we set forth after XSA-77: No introduction of new code making the system susceptible to bad (malicious) tool stack behavior, and hence we should reject it. Yet that would leave XenGT in a state where it would have no perspective of ever getting merged, which doesn't seem very desirable either. Jan
On 2/2/2016 6:32 PM, Jan Beulich wrote: >>>> On 01.02.16 at 18:05, <Ian.Jackson@eu.citrix.com> wrote: >> Having said that, if the hypervisor maintainers are happy with a >> situation where this value is configured explicitly, and the >> configurations where a non-default value is required is expected to be >> rare, then I guess we can live with it. > > Well, from the very beginning I have been not very happy with > the introduction of this, and I still consider it half way acceptable > only because of not seeing any good alternative. If we look at > it strictly, it's in violation of the rule we set forth after XSA-77: > No introduction of new code making the system susceptible to > bad (malicious) tool stack behavior, and hence we should reject > it. Yet that would leave XenGT in a state where it would have no > perspective of ever getting merged, which doesn't seem very > desirable either. > > Jan > Thanks, Jan. I understand your concern, and to be honest, I do not think this is an optimal solution. But I also have no better idea in mind. :( Another option may be: instead of opening this parameter to the tool stack, we use a XenGT flag, which set the rangeset limit to a default value. But like I said, this default value may not always work on future XenGT platforms. B.R. Yu
>>> On 02.02.16 at 11:56, <yu.c.zhang@linux.intel.com> wrote: > I understand your concern, and to be honest, I do not think > this is an optimal solution. But I also have no better idea > in mind. :( > Another option may be: instead of opening this parameter to > the tool stack, we use a XenGT flag, which set the rangeset > limit to a default value. But like I said, this default value > may not always work on future XenGT platforms. Assuming that you think of something set e.g. by hypervisor command line option: How would that work? I.e. how would that limit the resource use for all VMs not using XenGT? Or if you mean a flag settable in the domain config - how would you avoid a malicious admin to set this flag for all the VMs created in the controlled partition of the system? Jan
On 02/02/16 10:32, Jan Beulich wrote: >>>> On 01.02.16 at 18:05, <Ian.Jackson@eu.citrix.com> wrote: >> Having said that, if the hypervisor maintainers are happy with a >> situation where this value is configured explicitly, and the >> configurations where a non-default value is required is expected to be >> rare, then I guess we can live with it. > Well, from the very beginning I have been not very happy with > the introduction of this, and I still consider it half way acceptable > only because of not seeing any good alternative. If we look at > it strictly, it's in violation of the rule we set forth after XSA-77: > No introduction of new code making the system susceptible to > bad (malicious) tool stack behavior Lets take a step back here. If your toolstack is malicious, you have already lost. Coding Xen around this is a waste of time. The XSM case is for splitting out some of the privileged domains responsibilities to less privileged domains. In these cases, we do indeed want to assure that the somewhat-privileged entity cannot abuse anything outside its area of privilege. This specific issue concerns resource allocation during domain building and is an area which can never ever be given to a less privileged entity. ~Andrew
>>> On 02.02.16 at 12:31, <andrew.cooper3@citrix.com> wrote: > This specific issue concerns resource allocation during domain building > and is an area which can never ever be given to a less privileged entity. Which is because of ...? (And if so, why would we have put XEN_DOMCTL_createdomain on the XSA-77 waiver list?) Jan
On Tue, Feb 02, 2016 at 04:04:14PM +0800, Yu, Zhang wrote: > Thanks for your reply, Ian. > > On 2/2/2016 1:05 AM, Ian Jackson wrote: > >Yu, Zhang writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."): > >>On 2/2/2016 12:35 AM, Jan Beulich wrote: > >>>On 01.02.16 at 17:19, <yu.c.zhang@linux.intel.com> wrote: > >>>>So, we need also validate this param in hvm_allow_set_param, > >>>>current although hvm_allow_set_param has not performed any > >>>>validation other parameters. We need to do this for the new > >>>>ones. Is this understanding correct? > >>> > >>>Yes. > >>> > >>>>Another question is: as to the tool stack side, do you think > >>>>an error message would suffice? Shouldn't xl be terminated? > >>> > >>>I have no idea what consistent behavior in such a case would > >>>be - I'll defer input on this to the tool stack maintainers. > >> > >>Thank you. > >>Wei, which one do you prefer? > > > >I think that arrangements should be made for the hypercall failure to > >be properly reported to the caller, and properly logged. > > > >I don't think it is desirable to duplicate the sanity check in > >xl/libxl/libxc. That would simply result in there being two limits to > >update. > > > > Sorry, I do not follow. What does "being two limits to update" mean? > I can't speak for Ian, but my understanding is that if the code logic is duplicated in several places, you need to update all of them whenever you change the logic. But, he has expressed if this is blocker for this series, so I will let him clarify. Wei.
On 2/2/2016 7:51 PM, Wei Liu wrote: > On Tue, Feb 02, 2016 at 04:04:14PM +0800, Yu, Zhang wrote: >> Thanks for your reply, Ian. >> >> On 2/2/2016 1:05 AM, Ian Jackson wrote: >>> Yu, Zhang writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."): >>>> On 2/2/2016 12:35 AM, Jan Beulich wrote: >>>>> On 01.02.16 at 17:19, <yu.c.zhang@linux.intel.com> wrote: >>>>>> So, we need also validate this param in hvm_allow_set_param, >>>>>> current although hvm_allow_set_param has not performed any >>>>>> validation other parameters. We need to do this for the new >>>>>> ones. Is this understanding correct? >>>>> >>>>> Yes. >>>>> >>>>>> Another question is: as to the tool stack side, do you think >>>>>> an error message would suffice? Shouldn't xl be terminated? >>>>> >>>>> I have no idea what consistent behavior in such a case would >>>>> be - I'll defer input on this to the tool stack maintainers. >>>> >>>> Thank you. >>>> Wei, which one do you prefer? >>> >>> I think that arrangements should be made for the hypercall failure to >>> be properly reported to the caller, and properly logged. >>> >>> I don't think it is desirable to duplicate the sanity check in >>> xl/libxl/libxc. That would simply result in there being two limits to >>> update. >>> >> >> Sorry, I do not follow. What does "being two limits to update" mean? >> > > I can't speak for Ian, but my understanding is that if the code logic is > duplicated in several places, you need to update all of them whenever > you change the logic. But, he has expressed if this is blocker for this > series, so I will let him clarify. Thank you, Wei. This explanation helps. :) B.R. Yu
On 2/2/2016 7:12 PM, Jan Beulich wrote: >>>> On 02.02.16 at 11:56, <yu.c.zhang@linux.intel.com> wrote: >> I understand your concern, and to be honest, I do not think >> this is an optimal solution. But I also have no better idea >> in mind. :( >> Another option may be: instead of opening this parameter to >> the tool stack, we use a XenGT flag, which set the rangeset >> limit to a default value. But like I said, this default value >> may not always work on future XenGT platforms. > > Assuming that you think of something set e.g. by hypervisor > command line option: How would that work? I.e. how would > that limit the resource use for all VMs not using XenGT? Or if > you mean a flag settable in the domain config - how would you > avoid a malicious admin to set this flag for all the VMs created > in the controlled partition of the system? > Well, I am not satisfied with this new parameter, because: 1> exposing an option like max_wp_ram_ranges to the user seems too detailed; 2> but if not, using a XenGT flag means it would be hard for hypervisor to find a default value which can work in all situations theoretically, although in practice, 8K is already a big enough one. However, as to the security concern you raised, I can not fully understand. :) E.g. I believe a malicious admin can also breach the system even without this patch. This argument may not be convincing to you, but as to this specific case, even if an admin set XenGT flag to all VMs, what harm will this action do? It only means the ioreq server can at most allocate 8K ranges, will that consume all the Xen heaps, especially for 64 bit Xen? Anyway, despite different opinions, I still need to say thank you for your explanation. Upstreaming XenGT features is my task, it is painfully rewarding, to receive suggestions from community maintainers, which helps a newbie like me better understand the virtualization technology. :) Thanks Yu
On 02/02/16 11:43, Jan Beulich wrote: >>>> On 02.02.16 at 12:31, <andrew.cooper3@citrix.com> wrote: >> This specific issue concerns resource allocation during domain building >> and is an area which can never ever be given to a less privileged entity. > Which is because of ...? (And if so, why would we have put > XEN_DOMCTL_createdomain on the XSA-77 waiver list?) That list came out of the blue as far as the Xen community went. The purpose of XEN_DOMCTL_createdomain is to mutate the set of valid identifiers in Xen on which XSM permissions are based, and any entity capable of making the hypercall can at the very least cause reuse of an existing identifier. For a different example, take XEN_DOMCTL_gdbsx_guestmemio. This hypercall specifically permits the caller to change arbitrary memory, including that of the Xen itself. Neither of these two operations will ever be safe in the hands of anything but a fully privileged entity. Pretending otherwise isn't going to change this fact. ~Andrew
>>> On 02.02.16 at 15:01, <yu.c.zhang@linux.intel.com> wrote: > On 2/2/2016 7:12 PM, Jan Beulich wrote: >>>>> On 02.02.16 at 11:56, <yu.c.zhang@linux.intel.com> wrote: >>> I understand your concern, and to be honest, I do not think >>> this is an optimal solution. But I also have no better idea >>> in mind. :( >>> Another option may be: instead of opening this parameter to >>> the tool stack, we use a XenGT flag, which set the rangeset >>> limit to a default value. But like I said, this default value >>> may not always work on future XenGT platforms. >> >> Assuming that you think of something set e.g. by hypervisor >> command line option: How would that work? I.e. how would >> that limit the resource use for all VMs not using XenGT? Or if >> you mean a flag settable in the domain config - how would you >> avoid a malicious admin to set this flag for all the VMs created >> in the controlled partition of the system? > > Well, I am not satisfied with this new parameter, because: > 1> exposing an option like max_wp_ram_ranges to the user seems too > detailed; > 2> but if not, using a XenGT flag means it would be hard for hypervisor > to find a default value which can work in all situations theoretically, > although in practice, 8K is already a big enough one. > > However, as to the security concern you raised, I can not fully > understand. :) E.g. I believe a malicious admin can also breach the > system even without this patch. This argument may not be convincing to > you, but as to this specific case, even if an admin set XenGT flag to > all VMs, what harm will this action do? It only means the ioreq server > can at most allocate 8K ranges, will that consume all the Xen heaps, > especially for 64 bit Xen? First of all so far you meant to set a limit of 4G, which - taking a handful of domains - if fully used would take even a mid-size host out of memory. And then you have to consider bad effects resulting from Xen itself not normally having a lot of memory left (especially when "dom0_mem=" is not forcing most of the memory to be in Xen's hands), which may mean that one domain exhausting Xen's memory can affect another domain if Xen can't allocate memory it needs to support that other domain, in the worst case leading to a domain crash. And this all is still leaving aside Xen's own health... Jan
On 2/2/2016 10:42 PM, Jan Beulich wrote: >>>> On 02.02.16 at 15:01, <yu.c.zhang@linux.intel.com> wrote: >> On 2/2/2016 7:12 PM, Jan Beulich wrote: >>>>>> On 02.02.16 at 11:56, <yu.c.zhang@linux.intel.com> wrote: >>>> I understand your concern, and to be honest, I do not think >>>> this is an optimal solution. But I also have no better idea >>>> in mind. :( >>>> Another option may be: instead of opening this parameter to >>>> the tool stack, we use a XenGT flag, which set the rangeset >>>> limit to a default value. But like I said, this default value >>>> may not always work on future XenGT platforms. >>> >>> Assuming that you think of something set e.g. by hypervisor >>> command line option: How would that work? I.e. how would >>> that limit the resource use for all VMs not using XenGT? Or if >>> you mean a flag settable in the domain config - how would you >>> avoid a malicious admin to set this flag for all the VMs created >>> in the controlled partition of the system? >> >> Well, I am not satisfied with this new parameter, because: >> 1> exposing an option like max_wp_ram_ranges to the user seems too >> detailed; >> 2> but if not, using a XenGT flag means it would be hard for hypervisor >> to find a default value which can work in all situations theoretically, >> although in practice, 8K is already a big enough one. >> >> However, as to the security concern you raised, I can not fully >> understand. :) E.g. I believe a malicious admin can also breach the >> system even without this patch. This argument may not be convincing to >> you, but as to this specific case, even if an admin set XenGT flag to >> all VMs, what harm will this action do? It only means the ioreq server >> can at most allocate 8K ranges, will that consume all the Xen heaps, >> especially for 64 bit Xen? > > First of all so far you meant to set a limit of 4G, which - taking a > handful of domains - if fully used would take even a mid-size > host out of memory. And then you have to consider bad effects > resulting from Xen itself not normally having a lot of memory left > (especially when "dom0_mem=" is not forcing most of the memory > to be in Xen's hands), which may mean that one domain > exhausting Xen's memory can affect another domain if Xen can't > allocate memory it needs to support that other domain, in the > worst case leading to a domain crash. And this all is still leaving > aside Xen's own health... > Thanks, Jan. The limit of 4G is to avoid the data missing from uint64 to uint32 assignment. And I can accept the 8K limit for XenGT in practice. After all, it is vGPU page tables we are trying to trap and emulate, not normal page frames. And I guess the reason that one domain exhausting Xen's memory can affect another domain is because rangeset uses Xen heap, instead of the per-domain memory. So what about we use a 8K limit by now for XenGT, and in the future, if a per-domain memory allocation solution for rangeset is ready, we do need to limit the rangeset size. Does this sounds more acceptable? B.R. Yu
On 2/2/2016 11:21 PM, Jan Beulich wrote: >>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote: >> The limit of 4G is to avoid the data missing from uint64 to uint32 >> assignment. And I can accept the 8K limit for XenGT in practice. >> After all, it is vGPU page tables we are trying to trap and emulate, >> not normal page frames. >> >> And I guess the reason that one domain exhausting Xen's memory can >> affect another domain is because rangeset uses Xen heap, instead of the >> per-domain memory. So what about we use a 8K limit by now for XenGT, >> and in the future, if a per-domain memory allocation solution for >> rangeset is ready, we do need to limit the rangeset size. Does this >> sounds more acceptable? > > The lower the limit the better (but no matter how low the limit > it won't make this a pretty thing). Anyway I'd still like to wait > for what Ian may further say on this. > OK then. :) Ian, do you have any suggestions? Thanks Yu
>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote: > The limit of 4G is to avoid the data missing from uint64 to uint32 > assignment. And I can accept the 8K limit for XenGT in practice. > After all, it is vGPU page tables we are trying to trap and emulate, > not normal page frames. > > And I guess the reason that one domain exhausting Xen's memory can > affect another domain is because rangeset uses Xen heap, instead of the > per-domain memory. So what about we use a 8K limit by now for XenGT, > and in the future, if a per-domain memory allocation solution for > rangeset is ready, we do need to limit the rangeset size. Does this > sounds more acceptable? The lower the limit the better (but no matter how low the limit it won't make this a pretty thing). Anyway I'd still like to wait for what Ian may further say on this. Jan
On 2/2/2016 11:21 PM, Jan Beulich wrote: >>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote: >> The limit of 4G is to avoid the data missing from uint64 to uint32 >> assignment. And I can accept the 8K limit for XenGT in practice. >> After all, it is vGPU page tables we are trying to trap and emulate, >> not normal page frames. >> >> And I guess the reason that one domain exhausting Xen's memory can >> affect another domain is because rangeset uses Xen heap, instead of the >> per-domain memory. So what about we use a 8K limit by now for XenGT, >> and in the future, if a per-domain memory allocation solution for >> rangeset is ready, we do need to limit the rangeset size. Does this >> sounds more acceptable? > > The lower the limit the better (but no matter how low the limit > it won't make this a pretty thing). Anyway I'd still like to wait > for what Ian may further say on this. > Hi Jan, I just had a discussion with my colleague. We believe 8K could be the biggest limit for the write-protected ram ranges. If in the future, number of vGPU page tables exceeds this limit, we will modify our back-end device model to find a trade-off method, instead of extending this limit. If you can accept this value as the upper bound of rangeset, maybe we do not need to add any tool stack parameters, but define a MAX_NR_WR_RAM_RANGES for the write-protected ram rangesset. As to other rangesets, we keep their limit as 256. Does this sounds OK? :) B.R. Yu
>>> On 03.02.16 at 08:10, <yu.c.zhang@linux.intel.com> wrote: > On 2/2/2016 11:21 PM, Jan Beulich wrote: >>>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote: >>> The limit of 4G is to avoid the data missing from uint64 to uint32 >>> assignment. And I can accept the 8K limit for XenGT in practice. >>> After all, it is vGPU page tables we are trying to trap and emulate, >>> not normal page frames. >>> >>> And I guess the reason that one domain exhausting Xen's memory can >>> affect another domain is because rangeset uses Xen heap, instead of the >>> per-domain memory. So what about we use a 8K limit by now for XenGT, >>> and in the future, if a per-domain memory allocation solution for >>> rangeset is ready, we do need to limit the rangeset size. Does this >>> sounds more acceptable? >> >> The lower the limit the better (but no matter how low the limit >> it won't make this a pretty thing). Anyway I'd still like to wait >> for what Ian may further say on this. >> > Hi Jan, I just had a discussion with my colleague. We believe 8K could > be the biggest limit for the write-protected ram ranges. If in the > future, number of vGPU page tables exceeds this limit, we will modify > our back-end device model to find a trade-off method, instead of > extending this limit. If you can accept this value as the upper bound > of rangeset, maybe we do not need to add any tool stack parameters, but > define a MAX_NR_WR_RAM_RANGES for the write-protected ram rangesset. As > to other rangesets, we keep their limit as 256. Does this sounds OK? :) I'm getting the impression that we're moving in circles. A blanket limit above the 256 one for all domains is _not_ going to be acceptable; going to 8k will still need host admin consent. With your rangeset performance improvement patch, each range is going to be tracked by a 40 byte structure (up from 32), which already means an overhead increase for all the other ranges. 8k of wp ranges implies an overhead beyond 448k (including the xmalloc() overhead), which is not _that_ much, but also not negligible. Jan
> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: 03 February 2016 08:33 > To: Zhang Yu > Cc: Andrew Cooper; Ian Campbell; Paul Durrant; Wei Liu; Ian Jackson; Stefano > Stabellini; Kevin Tian; zhiyuan.lv@intel.com; xen-devel@lists.xen.org; Keir > (Xen.org) > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > >>> On 03.02.16 at 08:10, <yu.c.zhang@linux.intel.com> wrote: > > On 2/2/2016 11:21 PM, Jan Beulich wrote: > >>>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote: > >>> The limit of 4G is to avoid the data missing from uint64 to uint32 > >>> assignment. And I can accept the 8K limit for XenGT in practice. > >>> After all, it is vGPU page tables we are trying to trap and emulate, > >>> not normal page frames. > >>> > >>> And I guess the reason that one domain exhausting Xen's memory can > >>> affect another domain is because rangeset uses Xen heap, instead of > the > >>> per-domain memory. So what about we use a 8K limit by now for XenGT, > >>> and in the future, if a per-domain memory allocation solution for > >>> rangeset is ready, we do need to limit the rangeset size. Does this > >>> sounds more acceptable? > >> > >> The lower the limit the better (but no matter how low the limit > >> it won't make this a pretty thing). Anyway I'd still like to wait > >> for what Ian may further say on this. > >> > > Hi Jan, I just had a discussion with my colleague. We believe 8K could > > be the biggest limit for the write-protected ram ranges. If in the > > future, number of vGPU page tables exceeds this limit, we will modify > > our back-end device model to find a trade-off method, instead of > > extending this limit. If you can accept this value as the upper bound > > of rangeset, maybe we do not need to add any tool stack parameters, but > > define a MAX_NR_WR_RAM_RANGES for the write-protected ram > rangesset. As > > to other rangesets, we keep their limit as 256. Does this sounds OK? :) > > I'm getting the impression that we're moving in circles. A blanket > limit above the 256 one for all domains is _not_ going to be > acceptable; going to 8k will still need host admin consent. With > your rangeset performance improvement patch, each range is > going to be tracked by a 40 byte structure (up from 32), which > already means an overhead increase for all the other ranges. 8k > of wp ranges implies an overhead beyond 448k (including the > xmalloc() overhead), which is not _that_ much, but also not > negligible. > ... which means we are still going to need a toolstack parameter to set the limit. We already have a parameter for VRAM size so is having a parameter for max. GTT shadow ranges such a bad thing? Is the fact that the memory comes from xenheap rather than domheap the real problem? Paul > Jan
>>> On 03.02.16 at 13:20, <Paul.Durrant@citrix.com> wrote: >> -----Original Message----- >> From: Jan Beulich [mailto:JBeulich@suse.com] >> Sent: 03 February 2016 08:33 >> To: Zhang Yu >> Cc: Andrew Cooper; Ian Campbell; Paul Durrant; Wei Liu; Ian Jackson; Stefano >> Stabellini; Kevin Tian; zhiyuan.lv@intel.com; xen-devel@lists.xen.org; Keir >> (Xen.org) >> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter >> max_wp_ram_ranges. >> >> >>> On 03.02.16 at 08:10, <yu.c.zhang@linux.intel.com> wrote: >> > On 2/2/2016 11:21 PM, Jan Beulich wrote: >> >>>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote: >> >>> The limit of 4G is to avoid the data missing from uint64 to uint32 >> >>> assignment. And I can accept the 8K limit for XenGT in practice. >> >>> After all, it is vGPU page tables we are trying to trap and emulate, >> >>> not normal page frames. >> >>> >> >>> And I guess the reason that one domain exhausting Xen's memory can >> >>> affect another domain is because rangeset uses Xen heap, instead of >> the >> >>> per-domain memory. So what about we use a 8K limit by now for XenGT, >> >>> and in the future, if a per-domain memory allocation solution for >> >>> rangeset is ready, we do need to limit the rangeset size. Does this >> >>> sounds more acceptable? >> >> >> >> The lower the limit the better (but no matter how low the limit >> >> it won't make this a pretty thing). Anyway I'd still like to wait >> >> for what Ian may further say on this. >> >> >> > Hi Jan, I just had a discussion with my colleague. We believe 8K could >> > be the biggest limit for the write-protected ram ranges. If in the >> > future, number of vGPU page tables exceeds this limit, we will modify >> > our back-end device model to find a trade-off method, instead of >> > extending this limit. If you can accept this value as the upper bound >> > of rangeset, maybe we do not need to add any tool stack parameters, but >> > define a MAX_NR_WR_RAM_RANGES for the write-protected ram >> rangesset. As >> > to other rangesets, we keep their limit as 256. Does this sounds OK? :) >> >> I'm getting the impression that we're moving in circles. A blanket >> limit above the 256 one for all domains is _not_ going to be >> acceptable; going to 8k will still need host admin consent. With >> your rangeset performance improvement patch, each range is >> going to be tracked by a 40 byte structure (up from 32), which >> already means an overhead increase for all the other ranges. 8k >> of wp ranges implies an overhead beyond 448k (including the >> xmalloc() overhead), which is not _that_ much, but also not >> negligible. >> > > ... which means we are still going to need a toolstack parameter to set the > limit. We already have a parameter for VRAM size so is having a parameter for > max. GTT shadow ranges such a bad thing? It's workable, but not nice (see also Ian's earlier response). > Is the fact that the memory comes > from xenheap rather than domheap the real problem? Not the primary one, since except on huge memory machines both heaps are identical. To me the primary one is the quite more significant resource consumption in the first place (I'm not going to repeat what I've written in already way too many replies before). Jan
> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: 03 February 2016 12:36 > To: Paul Durrant > Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu; > Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen-devel@lists.xen.org; Keir > (Xen.org) > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > >>> On 03.02.16 at 13:20, <Paul.Durrant@citrix.com> wrote: > >> -----Original Message----- > >> From: Jan Beulich [mailto:JBeulich@suse.com] > >> Sent: 03 February 2016 08:33 > >> To: Zhang Yu > >> Cc: Andrew Cooper; Ian Campbell; Paul Durrant; Wei Liu; Ian Jackson; > Stefano > >> Stabellini; Kevin Tian; zhiyuan.lv@intel.com; xen-devel@lists.xen.org; Keir > >> (Xen.org) > >> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > >> max_wp_ram_ranges. > >> > >> >>> On 03.02.16 at 08:10, <yu.c.zhang@linux.intel.com> wrote: > >> > On 2/2/2016 11:21 PM, Jan Beulich wrote: > >> >>>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote: > >> >>> The limit of 4G is to avoid the data missing from uint64 to uint32 > >> >>> assignment. And I can accept the 8K limit for XenGT in practice. > >> >>> After all, it is vGPU page tables we are trying to trap and emulate, > >> >>> not normal page frames. > >> >>> > >> >>> And I guess the reason that one domain exhausting Xen's memory > can > >> >>> affect another domain is because rangeset uses Xen heap, instead of > >> the > >> >>> per-domain memory. So what about we use a 8K limit by now for > XenGT, > >> >>> and in the future, if a per-domain memory allocation solution for > >> >>> rangeset is ready, we do need to limit the rangeset size. Does this > >> >>> sounds more acceptable? > >> >> > >> >> The lower the limit the better (but no matter how low the limit > >> >> it won't make this a pretty thing). Anyway I'd still like to wait > >> >> for what Ian may further say on this. > >> >> > >> > Hi Jan, I just had a discussion with my colleague. We believe 8K could > >> > be the biggest limit for the write-protected ram ranges. If in the > >> > future, number of vGPU page tables exceeds this limit, we will modify > >> > our back-end device model to find a trade-off method, instead of > >> > extending this limit. If you can accept this value as the upper bound > >> > of rangeset, maybe we do not need to add any tool stack parameters, > but > >> > define a MAX_NR_WR_RAM_RANGES for the write-protected ram > >> rangesset. As > >> > to other rangesets, we keep their limit as 256. Does this sounds OK? :) > >> > >> I'm getting the impression that we're moving in circles. A blanket > >> limit above the 256 one for all domains is _not_ going to be > >> acceptable; going to 8k will still need host admin consent. With > >> your rangeset performance improvement patch, each range is > >> going to be tracked by a 40 byte structure (up from 32), which > >> already means an overhead increase for all the other ranges. 8k > >> of wp ranges implies an overhead beyond 448k (including the > >> xmalloc() overhead), which is not _that_ much, but also not > >> negligible. > >> > > > > ... which means we are still going to need a toolstack parameter to set the > > limit. We already have a parameter for VRAM size so is having a parameter > for > > max. GTT shadow ranges such a bad thing? > > It's workable, but not nice (see also Ian's earlier response). > > > Is the fact that the memory comes > > from xenheap rather than domheap the real problem? > > Not the primary one, since except on huge memory machines > both heaps are identical. To me the primary one is the quite > more significant resource consumption in the first place (I'm not > going to repeat what I've written in already way too many > replies before). Ok. Well the only way round tracking specific ranges for emulation (and consequently suffering the overhead) is tracking by type. For XenGT I guess it would be possible to live with a situation where a single ioreq server can register all wp mem emulations for a given VM. I can't say I particularly like that way of doing things but if it's the only way forward then I guess we may have to live with it. Paul > > Jan
>>> On 03.02.16 at 13:50, <Paul.Durrant@citrix.com> wrote: >> -----Original Message----- >> From: Jan Beulich [mailto:JBeulich@suse.com] >> Sent: 03 February 2016 12:36 >> To: Paul Durrant >> Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu; >> Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen-devel@lists.xen.org; Keir >> (Xen.org) >> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter >> max_wp_ram_ranges. >> >> >>> On 03.02.16 at 13:20, <Paul.Durrant@citrix.com> wrote: >> >> -----Original Message----- >> >> From: Jan Beulich [mailto:JBeulich@suse.com] >> >> Sent: 03 February 2016 08:33 >> >> To: Zhang Yu >> >> Cc: Andrew Cooper; Ian Campbell; Paul Durrant; Wei Liu; Ian Jackson; >> Stefano >> >> Stabellini; Kevin Tian; zhiyuan.lv@intel.com; xen-devel@lists.xen.org; Keir >> >> (Xen.org) >> >> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter >> >> max_wp_ram_ranges. >> >> >> >> >>> On 03.02.16 at 08:10, <yu.c.zhang@linux.intel.com> wrote: >> >> > On 2/2/2016 11:21 PM, Jan Beulich wrote: >> >> >>>>> On 02.02.16 at 16:00, <yu.c.zhang@linux.intel.com> wrote: >> >> >>> The limit of 4G is to avoid the data missing from uint64 to uint32 >> >> >>> assignment. And I can accept the 8K limit for XenGT in practice. >> >> >>> After all, it is vGPU page tables we are trying to trap and emulate, >> >> >>> not normal page frames. >> >> >>> >> >> >>> And I guess the reason that one domain exhausting Xen's memory >> can >> >> >>> affect another domain is because rangeset uses Xen heap, instead of >> >> the >> >> >>> per-domain memory. So what about we use a 8K limit by now for >> XenGT, >> >> >>> and in the future, if a per-domain memory allocation solution for >> >> >>> rangeset is ready, we do need to limit the rangeset size. Does this >> >> >>> sounds more acceptable? >> >> >> >> >> >> The lower the limit the better (but no matter how low the limit >> >> >> it won't make this a pretty thing). Anyway I'd still like to wait >> >> >> for what Ian may further say on this. >> >> >> >> >> > Hi Jan, I just had a discussion with my colleague. We believe 8K could >> >> > be the biggest limit for the write-protected ram ranges. If in the >> >> > future, number of vGPU page tables exceeds this limit, we will modify >> >> > our back-end device model to find a trade-off method, instead of >> >> > extending this limit. If you can accept this value as the upper bound >> >> > of rangeset, maybe we do not need to add any tool stack parameters, >> but >> >> > define a MAX_NR_WR_RAM_RANGES for the write-protected ram >> >> rangesset. As >> >> > to other rangesets, we keep their limit as 256. Does this sounds OK? :) >> >> >> >> I'm getting the impression that we're moving in circles. A blanket >> >> limit above the 256 one for all domains is _not_ going to be >> >> acceptable; going to 8k will still need host admin consent. With >> >> your rangeset performance improvement patch, each range is >> >> going to be tracked by a 40 byte structure (up from 32), which >> >> already means an overhead increase for all the other ranges. 8k >> >> of wp ranges implies an overhead beyond 448k (including the >> >> xmalloc() overhead), which is not _that_ much, but also not >> >> negligible. >> >> >> > >> > ... which means we are still going to need a toolstack parameter to set the >> > limit. We already have a parameter for VRAM size so is having a parameter >> for >> > max. GTT shadow ranges such a bad thing? >> >> It's workable, but not nice (see also Ian's earlier response). >> >> > Is the fact that the memory comes >> > from xenheap rather than domheap the real problem? >> >> Not the primary one, since except on huge memory machines >> both heaps are identical. To me the primary one is the quite >> more significant resource consumption in the first place (I'm not >> going to repeat what I've written in already way too many >> replies before). > > Ok. Well the only way round tracking specific ranges for emulation (and > consequently suffering the overhead) is tracking by type. For XenGT I guess > it would be possible to live with a situation where a single ioreq server can > register all wp mem emulations for a given VM. I can't say I particularly > like that way of doing things but if it's the only way forward then I guess > we may have to live with it. Well, subject to Ian not objecting (still awaiting some follow-up by him), I didn't mean to say doing it the proposed way is a no-go. All that I really insist on is that this larger resource consumption won't go without some form of host admin consent. Jan
> -----Original Message----- [snip] > >> >> I'm getting the impression that we're moving in circles. A blanket > >> >> limit above the 256 one for all domains is _not_ going to be > >> >> acceptable; going to 8k will still need host admin consent. With > >> >> your rangeset performance improvement patch, each range is > >> >> going to be tracked by a 40 byte structure (up from 32), which > >> >> already means an overhead increase for all the other ranges. 8k > >> >> of wp ranges implies an overhead beyond 448k (including the > >> >> xmalloc() overhead), which is not _that_ much, but also not > >> >> negligible. > >> >> > >> > > >> > ... which means we are still going to need a toolstack parameter to set > the > >> > limit. We already have a parameter for VRAM size so is having a > parameter > >> for > >> > max. GTT shadow ranges such a bad thing? > >> > >> It's workable, but not nice (see also Ian's earlier response). > >> > >> > Is the fact that the memory comes > >> > from xenheap rather than domheap the real problem? > >> > >> Not the primary one, since except on huge memory machines > >> both heaps are identical. To me the primary one is the quite > >> more significant resource consumption in the first place (I'm not > >> going to repeat what I've written in already way too many > >> replies before). > > > > Ok. Well the only way round tracking specific ranges for emulation (and > > consequently suffering the overhead) is tracking by type. For XenGT I > guess > > it would be possible to live with a situation where a single ioreq server can > > register all wp mem emulations for a given VM. I can't say I particularly > > like that way of doing things but if it's the only way forward then I guess > > we may have to live with it. > > Well, subject to Ian not objecting (still awaiting some follow-up by > him), I didn't mean to say doing it the proposed way is a no-go. > All that I really insist on is that this larger resource consumption > won't go without some form of host admin consent. > Would you be ok with purely host admin consent e.g. just setting the limit via boot command line? Paul > Jan
>>> On 03.02.16 at 14:07, <Paul.Durrant@citrix.com> wrote: >> -----Original Message----- > [snip] >> >> >> I'm getting the impression that we're moving in circles. A blanket >> >> >> limit above the 256 one for all domains is _not_ going to be >> >> >> acceptable; going to 8k will still need host admin consent. With >> >> >> your rangeset performance improvement patch, each range is >> >> >> going to be tracked by a 40 byte structure (up from 32), which >> >> >> already means an overhead increase for all the other ranges. 8k >> >> >> of wp ranges implies an overhead beyond 448k (including the >> >> >> xmalloc() overhead), which is not _that_ much, but also not >> >> >> negligible. >> >> >> >> >> > >> >> > ... which means we are still going to need a toolstack parameter to set >> the >> >> > limit. We already have a parameter for VRAM size so is having a >> parameter >> >> for >> >> > max. GTT shadow ranges such a bad thing? >> >> >> >> It's workable, but not nice (see also Ian's earlier response). >> >> >> >> > Is the fact that the memory comes >> >> > from xenheap rather than domheap the real problem? >> >> >> >> Not the primary one, since except on huge memory machines >> >> both heaps are identical. To me the primary one is the quite >> >> more significant resource consumption in the first place (I'm not >> >> going to repeat what I've written in already way too many >> >> replies before). >> > >> > Ok. Well the only way round tracking specific ranges for emulation (and >> > consequently suffering the overhead) is tracking by type. For XenGT I >> guess >> > it would be possible to live with a situation where a single ioreq server can >> > register all wp mem emulations for a given VM. I can't say I particularly >> > like that way of doing things but if it's the only way forward then I guess >> > we may have to live with it. >> >> Well, subject to Ian not objecting (still awaiting some follow-up by >> him), I didn't mean to say doing it the proposed way is a no-go. >> All that I really insist on is that this larger resource consumption >> won't go without some form of host admin consent. > > Would you be ok with purely host admin consent e.g. just setting the limit > via boot command line? I wouldn't be happy with that (and I've said so before), since it would allow all VM this extra resource consumption. Jan
> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: 03 February 2016 13:18 > To: Paul Durrant > Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu; > Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen-devel@lists.xen.org; Keir > (Xen.org) > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > >>> On 03.02.16 at 14:07, <Paul.Durrant@citrix.com> wrote: > >> -----Original Message----- > > [snip] > >> >> >> I'm getting the impression that we're moving in circles. A blanket > >> >> >> limit above the 256 one for all domains is _not_ going to be > >> >> >> acceptable; going to 8k will still need host admin consent. With > >> >> >> your rangeset performance improvement patch, each range is > >> >> >> going to be tracked by a 40 byte structure (up from 32), which > >> >> >> already means an overhead increase for all the other ranges. 8k > >> >> >> of wp ranges implies an overhead beyond 448k (including the > >> >> >> xmalloc() overhead), which is not _that_ much, but also not > >> >> >> negligible. > >> >> >> > >> >> > > >> >> > ... which means we are still going to need a toolstack parameter to > set > >> the > >> >> > limit. We already have a parameter for VRAM size so is having a > >> parameter > >> >> for > >> >> > max. GTT shadow ranges such a bad thing? > >> >> > >> >> It's workable, but not nice (see also Ian's earlier response). > >> >> > >> >> > Is the fact that the memory comes > >> >> > from xenheap rather than domheap the real problem? > >> >> > >> >> Not the primary one, since except on huge memory machines > >> >> both heaps are identical. To me the primary one is the quite > >> >> more significant resource consumption in the first place (I'm not > >> >> going to repeat what I've written in already way too many > >> >> replies before). > >> > > >> > Ok. Well the only way round tracking specific ranges for emulation (and > >> > consequently suffering the overhead) is tracking by type. For XenGT I > >> guess > >> > it would be possible to live with a situation where a single ioreq server > can > >> > register all wp mem emulations for a given VM. I can't say I particularly > >> > like that way of doing things but if it's the only way forward then I guess > >> > we may have to live with it. > >> > >> Well, subject to Ian not objecting (still awaiting some follow-up by > >> him), I didn't mean to say doing it the proposed way is a no-go. > >> All that I really insist on is that this larger resource consumption > >> won't go without some form of host admin consent. > > > > Would you be ok with purely host admin consent e.g. just setting the limit > > via boot command line? > > I wouldn't be happy with that (and I've said so before), since it > would allow all VM this extra resource consumption. > The ball is back in Ian's court then. Paul > Jan
Paul Durrant writes ("RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."): > > From: Jan Beulich [mailto:JBeulich@suse.com] ... > > I wouldn't be happy with that (and I've said so before), since it > > would allow all VM this extra resource consumption. > > The ball is back in Ian's court then. Sorry to be vague, but: I'm not definitely objecting to some toolstack parameter. I'm trying to figure out whether this parameter, in this form, with this documentation, makes some kind of sense. In the most recent proposed patch the docs basically say (to most users) "there is this parameter, but it is very complicated, so do not set it. We already have a lot of these kind of parameters. As a general rule they are OK if it is really the case that the parameter should be ignored. I am happy to have a whole lot of strange parameters that the user can ignore. But as far as I can tell from this conversation, users are going to need to set this parameter in normal operation in some configurations. I would ideally like to avoid a situation where (i) the Xen docs say "do not set this parameter because it is confusing" but (ii) other less authoritative sources (wiki pages, or mailing list threads, etc.) say "oh yes just set this weird parameter to 8192 for no readily comprehensible reason". I say `some configurations' because, I'm afraid, most of the conversation about hypervisor internals has gone over my head. Let me try to summarise (correct me if I am wrong): * There are some hypervisor tracking resources associated with each emulated MMIO range. (Do we mean the memory ranges that are configured in the hypervisor to be sent to an ioemu via the ioreq protocol - ie, the system which is normally used in HVM domains to interface to the device model?) (Are these ranges denominated in guest-physical space?) * For almost all domains the set of such MMIO ranges is small or very small. * Such ranges are sometimes created by, or specified by, the guest. (I don't understand why this should be the case but perhaps this is an inherent aspect of the design of this new feature.) * If too many such ranges were created by the guest the guest could consume excessive hypervisor memory. * Therefore normally the number of such ranges per guest is (or should be) limited to a small number. * With `Intel GVT-g broadwell platform' and `vGPU in GVT-g' or `GVT-d' it may be necessary for functionality to allow a larger number of such ranges. But to be able to know what the right interface is for the system administrator (and what to write in the docs), I need know: * Is it possible for libxl to somehow tell from the rest of the configuration that this larger limit should be applied ? * In the configurations where a larger number is needed, what larger limit is appropriate ? How should it be calculated ? * How do we know that this does not itself give an opportunity for hypervisor resource exhaustion attacks by guests ? (Note: if it _does_ give such an opportunity, this should be mentioned more clearly in the documentation.) * If we are talking about mmio ranges for ioreq servers, why do guests which do not use this feature have the ability to create them at all ? (A background problem I have is that this thread is full of product name jargon and assumes a lot of background knowledge of the implementation of these features - background knowledge which I lack and which isn't in these patches. If someone could point me at a quick summary of what `GVT-g' and `GVT-d' are that might help.) Ian.
> -----Original Message----- > From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com] > Sent: 03 February 2016 14:44 > To: Paul Durrant > Cc: Jan Beulich; Andrew Cooper; Ian Campbell; Stefano Stabellini; Wei Liu; > Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen-devel@lists.xen.org; Keir > (Xen.org) > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > Paul Durrant writes ("RE: [Xen-devel] [PATCH v3 3/3] tools: introduce > parameter max_wp_ram_ranges."): > > > From: Jan Beulich [mailto:JBeulich@suse.com] > ... > > > I wouldn't be happy with that (and I've said so before), since it > > > would allow all VM this extra resource consumption. > > > > The ball is back in Ian's court then. > > Sorry to be vague, but: I'm not definitely objecting to some toolstack > parameter. I'm trying to figure out whether this parameter, in this > form, with this documentation, makes some kind of sense. > > In the most recent proposed patch the docs basically say (to most > users) "there is this parameter, but it is very complicated, so do not > set it. We already have a lot of these kind of parameters. As a > general rule they are OK if it is really the case that the parameter > should be ignored. I am happy to have a whole lot of strange > parameters that the user can ignore. > > But as far as I can tell from this conversation, users are going to > need to set this parameter in normal operation in some > configurations. > > I would ideally like to avoid a situation where (i) the Xen docs say > "do not set this parameter because it is confusing" but (ii) other > less authoritative sources (wiki pages, or mailing list threads, etc.) > say "oh yes just set this weird parameter to 8192 for no readily > comprehensible reason". > > I say `some configurations' because, I'm afraid, most of the > conversation about hypervisor internals has gone over my head. Let me > try to summarise (correct me if I am wrong): > > * There are some hypervisor tracking resources associated with each > emulated MMIO range. > > (Do we mean the memory ranges that are configured in the hypervisor > to be sent to an ioemu via the ioreq protocol - ie, the system > which is normally used in HVM domains to interface to the device > model?) Yes, the ioreq server infrastructure needs rangesets to steer particular emulation requests to the correct emulator. > > (Are these ranges denominated in guest-physical space?) > Yes. > * For almost all domains the set of such MMIO ranges is small or very > small. > > * Such ranges are sometimes created by, or specified by, the guest. > > (I don't understand why this should be the case but perhaps this is > an inherent aspect of the design of this new feature.) > This is the 'new' thing here. Because XenGT needs to shadow GTTs (i.e. GPU page tables) by steering writes to the GVT-g ioreq server and the number of GTTs is broadly determined by the number of applications using the GPU inside the VM then we have a situation where the guest is determining Xen's use of memory. > * If too many such ranges were created by the guest the guest could > consume excessive hypervisor memory. > > * Therefore normally the number of such ranges per guest is (or > should be) limited to a small number. > > * With `Intel GVT-g broadwell platform' and `vGPU in GVT-g' or > `GVT-d' it may be necessary for functionality to allow a larger > number of such ranges. > > But to be able to know what the right interface is for the system > administrator (and what to write in the docs), I need know: > > * Is it possible for libxl to somehow tell from the rest of the > configuration that this larger limit should be applied ? > If a XenGT-enabled VM is provisioned through libxl then some larger limit is likely to be required. One of the issues is that it is impossible (or at least difficult) to know how many GTTs are going to need to be shadowed. > * In the configurations where a larger number is needed, what larger > limit is appropriate ? How should it be calculated ? > Intel are suggesting a static 8k value should be ok. > * How do we know that this does not itself give an opportunity for > hypervisor resource exhaustion attacks by guests ? (Note: if it > _does_ give such an opportunity, this should be mentioned more > clearly in the documentation.) > In practice XenGT only supports 4 VMs (including dom0) on Haswell and 7 on Broadwell so even with the larger limit we're not talking about a huge amount of resource. > * If we are talking about mmio ranges for ioreq servers, why do > guests which do not use this feature have the ability to create > them at all ? It's not the guest that directly creates the ranges, it's the emulator. Normally device emulation would require a relatively small number of MMIO ranges and a total number that cannot be influenced by the guest itself. In this case though, as I said above, the number *can* be influenced by the guest (although it is still the emulator which actually causes the ranges to be created). > > (A background problem I have is that this thread is full of product > name jargon and assumes a lot of background knowledge of the > implementation of these features - background knowledge which I lack > and which isn't in these patches. If someone could point me at a > quick summary of what `GVT-g' and `GVT-d' are that might help.) > GVT-d is a name applied to PCI passthrough of an Intel GPU. GVT-g is a name applied to Intel GPU virtualization, which makes use of an emulator to mediate guest access to the real GPU so that it is possible to share the resources securely. Paul > Ian.
On Wed, Feb 3, 2016 at 2:43 PM, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote: > Paul Durrant writes ("RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."): >> > From: Jan Beulich [mailto:JBeulich@suse.com] > ... >> > I wouldn't be happy with that (and I've said so before), since it >> > would allow all VM this extra resource consumption. >> >> The ball is back in Ian's court then. > > Sorry to be vague, but: I'm not definitely objecting to some toolstack > parameter. I'm trying to figure out whether this parameter, in this > form, with this documentation, makes some kind of sense. > > In the most recent proposed patch the docs basically say (to most > users) "there is this parameter, but it is very complicated, so do not > set it. We already have a lot of these kind of parameters. As a > general rule they are OK if it is really the case that the parameter > should be ignored. I am happy to have a whole lot of strange > parameters that the user can ignore. > > But as far as I can tell from this conversation, users are going to > need to set this parameter in normal operation in some > configurations. > > I would ideally like to avoid a situation where (i) the Xen docs say > "do not set this parameter because it is confusing" but (ii) other > less authoritative sources (wiki pages, or mailing list threads, etc.) > say "oh yes just set this weird parameter to 8192 for no readily > comprehensible reason". > > I say `some configurations' because, I'm afraid, most of the > conversation about hypervisor internals has gone over my head. Let me > try to summarise (correct me if I am wrong): > > * There are some hypervisor tracking resources associated with each > emulated MMIO range. > > (Do we mean the memory ranges that are configured in the hypervisor > to be sent to an ioemu via the ioreq protocol - ie, the system > which is normally used in HVM domains to interface to the device > model?) > > (Are these ranges denominated in guest-physical space?) > > * For almost all domains the set of such MMIO ranges is small or very > small. > > * Such ranges are sometimes created by, or specified by, the guest. > > (I don't understand why this should be the case but perhaps this is > an inherent aspect of the design of this new feature.) So the real issue here, as I've said elsewhere is this: These are not MMIO regions. They are not IO because they do not talk to devices, and they are not regions; they are individual gpfns. What's happening here is that qemu wants to be able to do the equivalent of shadow pagetable emulation for the GPU's equivalent of the pagetables (there's a special name, I forget what they're called). These gpfns are just normal guest memory, selected by the operating system and/or the graphics driver to use in the equivalent of a pagetable for the GPU. And instead of making a new interface designed to keep track of gpfns, they are (ab)using the existing "MMIO range" interface. But of course, since they they aren't actually ranges but just gpfns, they're scattered randomly throughout the guest physical address space. That's why XenGT suddenly wants orders of magnitude more "MMIO regions" than any other device has ever needed before -- because they're using a hammer when what they want is a rake. They claim that 8k "should be enough for anybody", but as far as I can tell, the theoretical limit of the number of pages being used in the GPU pagetables could be unbounded. I think at some point I suggested an alternate design based on marking such gpfns with a special p2m type; I can't remember if that suggestion was actually addressed or not. -George
On Wed, Feb 3, 2016 at 3:10 PM, Paul Durrant <Paul.Durrant@citrix.com> wrote: >> * Is it possible for libxl to somehow tell from the rest of the >> configuration that this larger limit should be applied ? >> > > If a XenGT-enabled VM is provisioned through libxl then some larger limit is likely to be required. One of the issues is that it is impossible (or at least difficult) to know how many GTTs are going to need to be shadowed. By GTT, you mean the GPU pagetables I assume? So you're talking about how large this value should be made, not whether the heuristically-chosen larger value should be used. libxl should be able to tell that XenGT is enabled, I assume, so it should be able to automatically bump this to 8k if necessary, right? But I think you'd still need a parameter that you could tweak if it turned out that 8k wasn't enough for a particular workload, right? >> * If we are talking about mmio ranges for ioreq servers, why do >> guests which do not use this feature have the ability to create >> them at all ? > > It's not the guest that directly creates the ranges, it's the emulator. Normally device emulation would require a relatively small number of MMIO ranges and a total number that cannot be influenced by the guest itself. In this case though, as I said above, the number *can* be influenced by the guest (although it is still the emulator which actually causes the ranges to be created). Just to make this clear: The guest chooses how many gpfns are used in the GPU pagetables; for each additional gpfn in the guest pagetable, qemu / xen have the option of either marking it to be emulated (at the moment, by marking it as a one-page "MMIO region") or crashing the guest. >> (A background problem I have is that this thread is full of product >> name jargon and assumes a lot of background knowledge of the >> implementation of these features - background knowledge which I lack >> and which isn't in these patches. If someone could point me at a >> quick summary of what `GVT-g' and `GVT-d' are that might help.) >> > > GVT-d is a name applied to PCI passthrough of an Intel GPU. GVT-g is a name applied to Intel GPU virtualization, which makes use of an emulator to mediate guest access to the real GPU so that it is possible to share the resources securely. And GTT are the GPU equivalent of page tables? -George
On Wed, Feb 3, 2016 at 5:41 PM, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > I think at some point I suggested an alternate design based on marking > such gpfns with a special p2m type; I can't remember if that > suggestion was actually addressed or not. FWIW, the thread where I suggested using p2m types was in response to <1436163912-1506-2-git-send-email-yu.c.zhang@linux.intel.com> Looking through it again, the main objection Paul gave[1] was: "And it's the assertion that use of write_dm will only be relevant to gfns, and that all such notifications only need go to a single ioreq server, that I have a problem with. Whilst the use of io ranges to track gfn updates is, I agree, not ideal I think the overloading of write_dm is not a step in the right direction." Two issues raised here, about using only p2m types to implement write_dm: 1. More than one ioreq server may want to use the write_dm functionality 2. ioreq servers may want to use write_dm for things other than individual gpfns My answer to #1 was: 1. At the moment, we only need to support a single ioreq server using write_dm 2. It's not technically difficult to extend the number of servers supported to something sensible, like 4 (using 4 different write_dm p2m types) 3. The interface can be designed such that we can extend support to multiple servers when we need to. My answer to #2 was that there's no reason why using write_dm could be used for both individual gpfns and ranges; there's no reason the interface can't take a "start" and "count" argument, even if for the time being "count" is almost always going to be 1. Compare this to the downsides of the approach you're proposing: 1. Using 40 bytes of hypervisor space per guest GPU pagetable page (as opposed to using a bit in the existing p2m table) 2. Walking down an RB tree with 8000 individual nodes to find out which server to send the message to (rather than just reading the value from the p2m table). 3. Needing to determine on a guest-by-guest basis whether to change the limit 4. Needing to have an interface to make the limit even bigger, just in case we find workloads that have even more GTTs. I really don't understand where you're coming from on this. The approach you've chosen looks to me to be slower, more difficult to implement, and more complicated; and it's caused a lot more resistance trying to get this series accepted. -George [1] <9AAE0902D5BC7E449B7C8E4E778ABCD02598B222@AMSPEX01CL02.citrite.net>
On Wed, Feb 3, 2016 at 6:21 PM, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > I really don't understand where you're coming from on this. The > approach you've chosen looks to me to be slower, more difficult to > implement, and more complicated; and it's caused a lot more resistance > trying to get this series accepted. Maybe it would be helpful if I did a mock-up implementation of what I had in mind, and you can have something concrete to criticize. I'll see if I can do that sometime this week. -George
On 03/02/16 18:21, George Dunlap wrote: > 2. It's not technically difficult to extend the number of servers > supported to something sensible, like 4 (using 4 different write_dm > p2m types) While technically true, spare bits in the pagetable entries are at a premium, and steadily decreasing as Intel are introducing new features. We have 16 current p2m types, and a finite upper bound of 6 bits of p2m type space, already with a push to reduce this number. While introducing 1 new p2m type for this purpose might be an acceptable tradeoff, a using a p2m type per ioreq server is not IMO. As an orthogonal exercise, several of the existing p2m types are redundant and can (or are already) expressed elsehow. There is definitely room to reduce the current 16 types down a bit without too much effort. ~Andrew
On 03/02/16 18:39, Andrew Cooper wrote: > On 03/02/16 18:21, George Dunlap wrote: >> 2. It's not technically difficult to extend the number of servers >> supported to something sensible, like 4 (using 4 different write_dm >> p2m types) > > While technically true, spare bits in the pagetable entries are at a > premium, and steadily decreasing as Intel are introducing new features. > > We have 16 current p2m types, and a finite upper bound of 6 bits of p2m > type space, already with a push to reduce this number. > > While introducing 1 new p2m type for this purpose might be an acceptable > tradeoff, a using a p2m type per ioreq server is not IMO. It is true that we don't have a ton of elbow room to grow at the moment. But we actually already have a single p2m type -- mmio_write_dm -- that as far as I know is only being used by XenGT. We don't actually need to add any new p2m types for my initial proposal. Going forward, we probably will, at some point, need to implement a parallel "p2t" structure to keep track of types -- and probably will whether end up implementing 4 separate write_dm types or not (for the reasons you describe). -George
On 2/4/2016 1:50 AM, George Dunlap wrote: > On Wed, Feb 3, 2016 at 3:10 PM, Paul Durrant <Paul.Durrant@citrix.com> wrote: >>> * Is it possible for libxl to somehow tell from the rest of the >>> configuration that this larger limit should be applied ? >>> >> >> If a XenGT-enabled VM is provisioned through libxl then some larger limit is likely to be required. One of the issues is that it is impossible (or at least difficult) to know how many GTTs are going to need to be shadowed. > > By GTT, you mean the GPU pagetables I assume? So you're talking about Yes, GTT is "Graphics Translation Table" for short. > how large this value should be made, not whether the > heuristically-chosen larger value should be used. libxl should be > able to tell that XenGT is enabled, I assume, so it should be able to > automatically bump this to 8k if necessary, right? > Yes. > But I think you'd still need a parameter that you could tweak if it > turned out that 8k wasn't enough for a particular workload, right? > Well, not exactly. For XenGT, the latest suggestion is that even when 8K is not enough, we will not extend this limit anymore. But when introducing this parameter, I had thought it might also be helpful for other device virtualization cases which would like to use the mediated passthrough idea. >>> * If we are talking about mmio ranges for ioreq servers, why do >>> guests which do not use this feature have the ability to create >>> them at all ? >> >> It's not the guest that directly creates the ranges, it's the emulator. Normally device emulation would require a relatively small number of MMIO ranges and a total number that cannot be influenced by the guest itself. In this case though, as I said above, the number *can* be influenced by the guest (although it is still the emulator which actually causes the ranges to be created). > > Just to make this clear: The guest chooses how many gpfns are used in > the GPU pagetables; for each additional gpfn in the guest pagetable, > qemu / xen have the option of either marking it to be emulated (at the > moment, by marking it as a one-page "MMIO region") or crashing the > guest. > Well, kind of. The backend device model in dom0(not qemu) makes the decision whether or not this page is to be emulated. >>> (A background problem I have is that this thread is full of product >>> name jargon and assumes a lot of background knowledge of the >>> implementation of these features - background knowledge which I lack >>> and which isn't in these patches. If someone could point me at a >>> quick summary of what `GVT-g' and `GVT-d' are that might help.) >>> >> >> GVT-d is a name applied to PCI passthrough of an Intel GPU. GVT-g is a name applied to Intel GPU virtualization, which makes use of an emulator to mediate guest access to the real GPU so that it is possible to share the resources securely. > > And GTT are the GPU equivalent of page tables? > Yes. Here let me try to give some brief introduction to the jargons: * Intel GVT-d: an intel graphic virtualization solution, which dedicates one physical GPU to a guest exclusively. * Intel GVT-g: an intel graphic virtualization solution, with mediated pass-through support. One physical GPU can be shared by multiple guests. GPU performance-critical resources are partitioned by and passed through to different vGPUs. Other GPU resources are trapped and emulated by the device model. * XenGT: Intel GVT-g code name for Xen. Here this patch series are features required by XenGT. * vGPU: virtual GPU presented to guests. * GTT: abbreviation for graphics translation table, a page table structure which translates the graphic memory address to a physical one. For vGPU, PTEs in its GTT are GPFNs, thus raise a demand for the device model to construct a group of shadow GPU page tables. Thanks Yu
On 2/4/2016 2:21 AM, George Dunlap wrote: > On Wed, Feb 3, 2016 at 5:41 PM, George Dunlap > <George.Dunlap@eu.citrix.com> wrote: >> I think at some point I suggested an alternate design based on marking >> such gpfns with a special p2m type; I can't remember if that >> suggestion was actually addressed or not. > > FWIW, the thread where I suggested using p2m types was in response to > > <1436163912-1506-2-git-send-email-yu.c.zhang@linux.intel.com> > > Looking through it again, the main objection Paul gave[1] was: > > "And it's the assertion that use of write_dm will only be relevant to > gfns, and that all such notifications only need go to a single ioreq > server, that I have a problem with. Whilst the use of io ranges to > track gfn updates is, I agree, not ideal I think the overloading of > write_dm is not a step in the right direction." > > Two issues raised here, about using only p2m types to implement write_dm: > 1. More than one ioreq server may want to use the write_dm functionality > 2. ioreq servers may want to use write_dm for things other than individual gpfns > > My answer to #1 was: > 1. At the moment, we only need to support a single ioreq server using write_dm > 2. It's not technically difficult to extend the number of servers > supported to something sensible, like 4 (using 4 different write_dm > p2m types) > 3. The interface can be designed such that we can extend support to > multiple servers when we need to. > > My answer to #2 was that there's no reason why using write_dm could be > used for both individual gpfns and ranges; there's no reason the > interface can't take a "start" and "count" argument, even if for the > time being "count" is almost always going to be 1. > Well, talking about "the 'count' always going to be 1". I doubt that. :) Statistics in XenGT shows that, GPU page tables are very likely to be allocated in contiguous gpfns. > Compare this to the downsides of the approach you're proposing: > 1. Using 40 bytes of hypervisor space per guest GPU pagetable page (as > opposed to using a bit in the existing p2m table) > 2. Walking down an RB tree with 8000 individual nodes to find out > which server to send the message to (rather than just reading the > value from the p2m table). 8K is an upper limit for the rangeset, in many cases the RB tree will not contain that many nodes. > 3. Needing to determine on a guest-by-guest basis whether to change the limit > 4. Needing to have an interface to make the limit even bigger, just in > case we find workloads that have even more GTTs. > Well, I have suggested in yesterday's reply. XenGT can choose not to change this limit even when workloads are getting heavy - with tradeoffs in the device model side. > I really don't understand where you're coming from on this. The > approach you've chosen looks to me to be slower, more difficult to > implement, and more complicated; and it's caused a lot more resistance > trying to get this series accepted. > I agree utilizing the p2m types to do so is more efficient and quite intuitive. But I hesitate to occupy the software available bits in EPT PTEs(like Andrew's reply). Although we have introduced one, we believe it can also be used for other situations in the future, not just XenGT. Thanks Yu
On 2/4/2016 3:12 AM, George Dunlap wrote: > On 03/02/16 18:39, Andrew Cooper wrote: >> On 03/02/16 18:21, George Dunlap wrote: >>> 2. It's not technically difficult to extend the number of servers >>> supported to something sensible, like 4 (using 4 different write_dm >>> p2m types) >> >> While technically true, spare bits in the pagetable entries are at a >> premium, and steadily decreasing as Intel are introducing new features. >> >> We have 16 current p2m types, and a finite upper bound of 6 bits of p2m >> type space, already with a push to reduce this number. >> >> While introducing 1 new p2m type for this purpose might be an acceptable >> tradeoff, a using a p2m type per ioreq server is not IMO. > > It is true that we don't have a ton of elbow room to grow at the moment. > > But we actually already have a single p2m type -- mmio_write_dm -- that > as far as I know is only being used by XenGT. We don't actually need to > add any new p2m types for my initial proposal. > > Going forward, we probably will, at some point, need to implement a > parallel "p2t" structure to keep track of types -- and probably will > whether end up implementing 4 separate write_dm types or not (for the > reasons you describe). > Thank you, George. Could you please elaborate more about the idea of "p2t"? Yu
> -----Original Message----- > From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com] > Sent: 04 February 2016 08:51 > To: George Dunlap; Ian Jackson > Cc: Paul Durrant; Kevin Tian; Wei Liu; Ian Campbell; Andrew Cooper; xen- > devel@lists.xen.org; Stefano Stabellini; zhiyuan.lv@intel.com; Jan Beulich; > Keir (Xen.org) > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > > > On 2/4/2016 2:21 AM, George Dunlap wrote: > > On Wed, Feb 3, 2016 at 5:41 PM, George Dunlap > > <George.Dunlap@eu.citrix.com> wrote: > >> I think at some point I suggested an alternate design based on marking > >> such gpfns with a special p2m type; I can't remember if that > >> suggestion was actually addressed or not. > > > > FWIW, the thread where I suggested using p2m types was in response to > > > > <1436163912-1506-2-git-send-email-yu.c.zhang@linux.intel.com> > > > > Looking through it again, the main objection Paul gave[1] was: > > > > "And it's the assertion that use of write_dm will only be relevant to > > gfns, and that all such notifications only need go to a single ioreq > > server, that I have a problem with. Whilst the use of io ranges to > > track gfn updates is, I agree, not ideal I think the overloading of > > write_dm is not a step in the right direction." > > > > Two issues raised here, about using only p2m types to implement > write_dm: > > 1. More than one ioreq server may want to use the write_dm functionality > > 2. ioreq servers may want to use write_dm for things other than individual > gpfns > > > > My answer to #1 was: > > 1. At the moment, we only need to support a single ioreq server using > write_dm > > 2. It's not technically difficult to extend the number of servers > > supported to something sensible, like 4 (using 4 different write_dm > > p2m types) > > 3. The interface can be designed such that we can extend support to > > multiple servers when we need to. > > > > My answer to #2 was that there's no reason why using write_dm could be > > used for both individual gpfns and ranges; there's no reason the > > interface can't take a "start" and "count" argument, even if for the > > time being "count" is almost always going to be 1. > > > > Well, talking about "the 'count' always going to be 1". I doubt that. :) > Statistics in XenGT shows that, GPU page tables are very likely to > be allocated in contiguous gpfns. > > > Compare this to the downsides of the approach you're proposing: > > 1. Using 40 bytes of hypervisor space per guest GPU pagetable page (as > > opposed to using a bit in the existing p2m table) > > 2. Walking down an RB tree with 8000 individual nodes to find out > > which server to send the message to (rather than just reading the > > value from the p2m table). > > 8K is an upper limit for the rangeset, in many cases the RB tree will > not contain that many nodes. > > > 3. Needing to determine on a guest-by-guest basis whether to change the > limit > > 4. Needing to have an interface to make the limit even bigger, just in > > case we find workloads that have even more GTTs. > > > > Well, I have suggested in yesterday's reply. XenGT can choose not to > change this limit even when workloads are getting heavy - with > tradeoffs in the device model side. I assume this means that the emulator can 'unshadow' GTTs (I guess on an LRU basis) so that it can shadow new ones when the limit has been exhausted? If so, how bad is performance likely to be if we live with a lower limit and take the hit of unshadowing if the guest GTTs become heavily fragmented? Paul > > > I really don't understand where you're coming from on this. The > > approach you've chosen looks to me to be slower, more difficult to > > implement, and more complicated; and it's caused a lot more resistance > > trying to get this series accepted. > > > > I agree utilizing the p2m types to do so is more efficient and quite > intuitive. But I hesitate to occupy the software available bits in EPT > PTEs(like Andrew's reply). Although we have introduced one, we believe > it can also be used for other situations in the future, not just XenGT. > > Thanks > Yu
On 2/4/2016 5:28 PM, Paul Durrant wrote: >> -----Original Message----- >> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com] >> Sent: 04 February 2016 08:51 >> To: George Dunlap; Ian Jackson >> Cc: Paul Durrant; Kevin Tian; Wei Liu; Ian Campbell; Andrew Cooper; xen- >> devel@lists.xen.org; Stefano Stabellini; zhiyuan.lv@intel.com; Jan Beulich; >> Keir (Xen.org) >> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter >> max_wp_ram_ranges. >> >> >> >> On 2/4/2016 2:21 AM, George Dunlap wrote: >>> On Wed, Feb 3, 2016 at 5:41 PM, George Dunlap >>> <George.Dunlap@eu.citrix.com> wrote: >>>> I think at some point I suggested an alternate design based on marking >>>> such gpfns with a special p2m type; I can't remember if that >>>> suggestion was actually addressed or not. >>> >>> FWIW, the thread where I suggested using p2m types was in response to >>> >>> <1436163912-1506-2-git-send-email-yu.c.zhang@linux.intel.com> >>> >>> Looking through it again, the main objection Paul gave[1] was: >>> >>> "And it's the assertion that use of write_dm will only be relevant to >>> gfns, and that all such notifications only need go to a single ioreq >>> server, that I have a problem with. Whilst the use of io ranges to >>> track gfn updates is, I agree, not ideal I think the overloading of >>> write_dm is not a step in the right direction." >>> >>> Two issues raised here, about using only p2m types to implement >> write_dm: >>> 1. More than one ioreq server may want to use the write_dm functionality >>> 2. ioreq servers may want to use write_dm for things other than individual >> gpfns >>> >>> My answer to #1 was: >>> 1. At the moment, we only need to support a single ioreq server using >> write_dm >>> 2. It's not technically difficult to extend the number of servers >>> supported to something sensible, like 4 (using 4 different write_dm >>> p2m types) >>> 3. The interface can be designed such that we can extend support to >>> multiple servers when we need to. >>> >>> My answer to #2 was that there's no reason why using write_dm could be >>> used for both individual gpfns and ranges; there's no reason the >>> interface can't take a "start" and "count" argument, even if for the >>> time being "count" is almost always going to be 1. >>> >> >> Well, talking about "the 'count' always going to be 1". I doubt that. :) >> Statistics in XenGT shows that, GPU page tables are very likely to >> be allocated in contiguous gpfns. >> >>> Compare this to the downsides of the approach you're proposing: >>> 1. Using 40 bytes of hypervisor space per guest GPU pagetable page (as >>> opposed to using a bit in the existing p2m table) >>> 2. Walking down an RB tree with 8000 individual nodes to find out >>> which server to send the message to (rather than just reading the >>> value from the p2m table). >> >> 8K is an upper limit for the rangeset, in many cases the RB tree will >> not contain that many nodes. >> >>> 3. Needing to determine on a guest-by-guest basis whether to change the >> limit >>> 4. Needing to have an interface to make the limit even bigger, just in >>> case we find workloads that have even more GTTs. >>> >> >> Well, I have suggested in yesterday's reply. XenGT can choose not to >> change this limit even when workloads are getting heavy - with >> tradeoffs in the device model side. > > I assume this means that the emulator can 'unshadow' GTTs (I guess on an LRU basis) so that it can shadow new ones when the limit has been exhausted? > If so, how bad is performance likely to be if we live with a lower limit and take the hit of unshadowing if the guest GTTs become heavily fragmented? > Thank you, Paul. Well, I was told the emulator have approaches to delay the shadowing of the GTT till future GPU commands are submitted. By now, I'm not sure about the performance penalties if the limit is set too low. Although we are confident 8K is a secure limit, it seems still too high to be accepted. We will perform more experiments with this new approach to find a balance between the lowest limit and the XenGT performance. So another question is, if value of this limit really matters, will a lower one be more acceptable(the current 256 being not enough)? Thanks Yu find a
> -----Original Message----- [snip] > >>> Compare this to the downsides of the approach you're proposing: > >>> 1. Using 40 bytes of hypervisor space per guest GPU pagetable page (as > >>> opposed to using a bit in the existing p2m table) > >>> 2. Walking down an RB tree with 8000 individual nodes to find out > >>> which server to send the message to (rather than just reading the > >>> value from the p2m table). > >> > >> 8K is an upper limit for the rangeset, in many cases the RB tree will > >> not contain that many nodes. > >> > >>> 3. Needing to determine on a guest-by-guest basis whether to change > the > >> limit > >>> 4. Needing to have an interface to make the limit even bigger, just in > >>> case we find workloads that have even more GTTs. > >>> > >> > >> Well, I have suggested in yesterday's reply. XenGT can choose not to > >> change this limit even when workloads are getting heavy - with > >> tradeoffs in the device model side. > > > > I assume this means that the emulator can 'unshadow' GTTs (I guess on an > LRU basis) so that it can shadow new ones when the limit has been > exhausted? > > If so, how bad is performance likely to be if we live with a lower limit and > take the hit of unshadowing if the guest GTTs become heavily fragmented? > > > Thank you, Paul. > > Well, I was told the emulator have approaches to delay the shadowing of > the GTT till future GPU commands are submitted. By now, I'm not sure > about the performance penalties if the limit is set too low. Although > we are confident 8K is a secure limit, it seems still too high to be > accepted. We will perform more experiments with this new approach to > find a balance between the lowest limit and the XenGT performance. > > So another question is, if value of this limit really matters, will a > lower one be more acceptable(the current 256 being not enough)? Well, do you know conclusively that 256 is absolutely not enough, or is it just too low for acceptable performance? Paul > > Thanks > Yu > find a
On Wed, 2016-02-03 at 17:41 +0000, George Dunlap wrote: > But of course, since they they aren't actually ranges but just gpfns, > they're scattered randomly throughout the guest physical address > space. (Possibly) stupid question: Since, AIUI, the in-guest GPU driver is XenGT aware could it not allocate a contiguous range of pages at start of day to use as GPU PTs? Or even just N contiguous regions, e.g. i think the "8K" refers to pages, which is 16 2M allocations, which is a far more manageable number of ranges to track than 8096 individual pages. Ian.
>>> On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote: > So another question is, if value of this limit really matters, will a > lower one be more acceptable(the current 256 being not enough)? If you've carefully read George's replies, a primary aspect is whether we wouldn't better revert commit f5a32c5b8e ("x86/HVM: differentiate IO/mem resources tracked by ioreq server"), as with the alternative approach we wouldn't even need HVMOP_IO_RANGE_WP_MEM afaict. And then the question you raise would become irrelevant. The part of the public interface being tools only allows some freedom in when to do this, but I think it would be a bad idea to ship 4.7 with this still in if you're not going to pursue this route. Jan
On Thu, Feb 4, 2016 at 8:51 AM, Yu, Zhang <yu.c.zhang@linux.intel.com> wrote: >> Going forward, we probably will, at some point, need to implement a >> parallel "p2t" structure to keep track of types -- and probably will >> whether end up implementing 4 separate write_dm types or not (for the >> reasons you describe). >> > > Thank you, George. Could you please elaborate more about the idea of > "p2t"? So the p2m table is a partially-sparse structure designed to map pfns to mfns. At the bottom is a 64-bit entry that contains certain bits specified by the hardware (mfn number, permissions, other features). There are at the moment extra bits that aren't use, and in these bits we store information about the pfn that Xen wants to know -- among other things, the 'type' of the gpfn. But as Andy pointed out, Intel are adding new features which take up more of these bits; and at the same time, Xen has more features for which using a p2m type is the most obvious solution. So the idea would be to have a separate structure, similar to the p2m table, but wouldn't be used by the hardware -- it would only be used by Xen to map pfn to type (or whatever other information it wanted about the gpfn). This does mean duplicating all the non-leaf nodes, as well as doing two sparse-tree-walks rather than just one when looking up gpfn information. -George
On Thu, Feb 4, 2016 at 9:38 AM, Yu, Zhang <yu.c.zhang@linux.intel.com> wrote: > On 2/4/2016 5:28 PM, Paul Durrant wrote: >> I assume this means that the emulator can 'unshadow' GTTs (I guess on an >> LRU basis) so that it can shadow new ones when the limit has been exhausted? >> If so, how bad is performance likely to be if we live with a lower limit >> and take the hit of unshadowing if the guest GTTs become heavily fragmented? >> > Thank you, Paul. > > Well, I was told the emulator have approaches to delay the shadowing of > the GTT till future GPU commands are submitted. By now, I'm not sure > about the performance penalties if the limit is set too low. Although > we are confident 8K is a secure limit, it seems still too high to be > accepted. We will perform more experiments with this new approach to > find a balance between the lowest limit and the XenGT performance. Just to check some of my assumptions: I assume that unlike memory accesses, your GPU hardware cannot 'recover' from faults in the GTTs. That is, for memory, you can take a page fault, fix up the pagetables, and then re-execute the original instruction; but so far I haven't heard of any devices being able to seamlessly re-execute a transaction after a fault. Is my understanding correct? If that is the case, then for every top-level value (whatever the equivalent of the CR3), you need to be able to shadow the entire GTT tree below it, yes? You can't use a trick that the memory shadow pagetables can use, of unshadowing parts of the tree and reshadowing them. So as long as the currently-in-use GTT tree contains no more than $LIMIT ranges, you can unshadow and reshadow; this will be slow, but strictly speaking correct. What do you do if the guest driver switches to a GTT such that the entire tree takes up more than $LIMIT entries? -George
On Thu, 2016-02-04 at 10:49 +0000, George Dunlap wrote: > On Thu, Feb 4, 2016 at 8:51 AM, Yu, Zhang <yu.c.zhang@linux.intel.com> > wrote: > > > Going forward, we probably will, at some point, need to implement a > > > parallel "p2t" structure to keep track of types -- and probably will > > > whether end up implementing 4 separate write_dm types or not (for the > > > reasons you describe). > > > > > > > Thank you, George. Could you please elaborate more about the idea of > > "p2t"? > > So the p2m table is a partially-sparse structure designed to map pfns > to mfns.  At the bottom is a 64-bit entry that contains certain bits > specified by the hardware (mfn number, permissions, other features). > There are at the moment extra bits that aren't use, and in these bits > we store information about the pfn that Xen wants to know -- among > other things, the 'type' of the gpfn. > > But as Andy pointed out, Intel are adding new features which take up > more of these bits; and at the same time, Xen has more features for > which using a p2m type is the most obvious solution.  So the idea > would be to have a separate structure, similar to the p2m table, but > wouldn't be used by the hardware -- it would only be used by Xen to > map pfn to type (or whatever other information it wanted about the > gpfn).  This does mean duplicating all the non-leaf nodes, as well as > doing two sparse-tree-walks rather  than just one when looking up gpfn > information. FWIW ARM already has such a structure to support xenaccess, since ARM had far fewer available bits to start with. It is not quite the same as above since it is only populated with pages for which xenaccess is in use rather than all pages (and is only consulted if we have reason to believe the page might be subject to xenaccess). FWIW it uses Xen's radix tree stuff. Ian.
On Thu, 2016-02-04 at 11:08 +0000, Ian Campbell wrote: > On Thu, 2016-02-04 at 10:49 +0000, George Dunlap wrote: > > On Thu, Feb 4, 2016 at 8:51 AM, Yu, Zhang <yu.c.zhang@linux.intel.com> > > wrote: > > > > Going forward, we probably will, at some point, need to implement a > > > > parallel "p2t" structure to keep track of types -- and probably > > > > will > > > > whether end up implementing 4 separate write_dm types or not (for > > > > the > > > > reasons you describe). > > > > > > > > > > Thank you, George. Could you please elaborate more about the idea of > > > "p2t"? > > > > So the p2m table is a partially-sparse structure designed to map pfns > > to mfns.  At the bottom is a 64-bit entry that contains certain bits > > specified by the hardware (mfn number, permissions, other features). > > There are at the moment extra bits that aren't use, and in these bits > > we store information about the pfn that Xen wants to know -- among > > other things, the 'type' of the gpfn. > > > > But as Andy pointed out, Intel are adding new features which take up > > more of these bits; and at the same time, Xen has more features for > > which using a p2m type is the most obvious solution.  So the idea > > would be to have a separate structure, similar to the p2m table, but > > wouldn't be used by the hardware -- it would only be used by Xen to > > map pfn to type (or whatever other information it wanted about the > > gpfn).  This does mean duplicating all the non-leaf nodes, as well as > > doing two sparse-tree-walks rather  than just one when looking up gpfn > > information. > > FWIW ARM already has such a structure to support xenaccess, since ARM had > far fewer available bits to start with. > > It is not quite the same as above since it is only populated with pages for > which xenaccess is in use rather than all pages (and is only consulted if > we have reason to believe the page might be subject to xenaccess). FWIW it > uses Xen's radix tree stuff. A second FWIW in case it is useful, but Linux on 32-bit ARM (older ones with 32-bit PTEs) allocates every PT page as a pair of pages, so it gets 32 software defined bits for every h/w visible pte at pte+4K. If you wanted to avoid the order-1 allocations which that implies then I suppose you could chain to the "pteext" page from the struct page of the h/w visible page. Ian.
Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."): > On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote: > > So another question is, if value of this limit really matters, will a > > lower one be more acceptable(the current 256 being not enough)? > > If you've carefully read George's replies, [...] Thanks to George for the very clear explanation, and also to him for an illuminating in-person discussion. It is disturbing that as a result of me as a tools maintainer asking questions about what seems to me to be a troublesome a user-visible control setting in libxl, we are now apparently revisiting lower layers of the hypervisor design, which have already been committed. While I find George's line of argument convincing, neither I nor George are maintainers of the relevant hypervisor code. I am not going to insist that anything in the hypervisor is done different and am not trying to use my tools maintainer position to that end. Clearly there has been a failure of our workflow to consider and review everything properly together. But given where we are now, I think that this discussion about hypervisor internals is probably a distraction. Let pose again some questions that I still don't have clear answers to: * Is it possible for libxl to somehow tell from the rest of the configuration that this larger limit should be applied ? AFAICT there is nothing in libxl directly involving vgpu. How can libxl be used to create a guest with vgpu enabled ? I had thought that this was done merely with the existing PCI passthrough configuration, but it now seems that somehow a second device model would have to be started. libxl doesn't have code to do that. * In the configurations where a larger number is needed, what larger limit is appropriate ? How should it be calculated ? AFAICT from the discussion, 8192 is a reasonable bet. Is everyone happy with it. Ian. PS: Earlier I asked: * How do we know that this does not itself give an opportunity for hypervisor resource exhaustion attacks by guests ? (Note: if it _does_ give such an opportunity, this should be mentioned more clearly in the documentation.) * If we are talking about mmio ranges for ioreq servers, why do guests which do not use this feature have the ability to create them at all ? I now understand that these mmio ranges are created by the device model. Of course the device model needs to be able to create mmio ranges for the guest. And since they consume hypervisor resources, the number of these must be limited (device models not necessarily being trusted).
> -----Original Message----- > From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com] > Sent: 04 February 2016 13:34 > To: Jan Beulich > Cc: Zhang Yu; Andrew Cooper; George Dunlap; Ian Campbell; Paul Durrant; > Stefano Stabellini; Wei Liu; Kevin Tian; zhiyuan.lv@intel.com; xen- > devel@lists.xen.org; Keir (Xen.org) > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce > parameter max_wp_ram_ranges."): > > On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote: > > > So another question is, if value of this limit really matters, will a > > > lower one be more acceptable(the current 256 being not enough)? > > > > If you've carefully read George's replies, [...] > > Thanks to George for the very clear explanation, and also to him for > an illuminating in-person discussion. > > It is disturbing that as a result of me as a tools maintainer asking > questions about what seems to me to be a troublesome a user-visible > control setting in libxl, we are now apparently revisiting lower > layers of the hypervisor design, which have already been committed. > > While I find George's line of argument convincing, neither I nor > George are maintainers of the relevant hypervisor code. I am not > going to insist that anything in the hypervisor is done different and > am not trying to use my tools maintainer position to that end. > > Clearly there has been a failure of our workflow to consider and > review everything properly together. But given where we are now, I > think that this discussion about hypervisor internals is probably a > distraction. > > > Let pose again some questions that I still don't have clear answers > to: > > * Is it possible for libxl to somehow tell from the rest of the > configuration that this larger limit should be applied ? > > AFAICT there is nothing in libxl directly involving vgpu. How can > libxl be used to create a guest with vgpu enabled ? I had thought > that this was done merely with the existing PCI passthrough > configuration, but it now seems that somehow a second device model > would have to be started. libxl doesn't have code to do that. > AIUI if the setting of the increased limit is tied to provisioning a gvt-g instance for a VM then I don't there needs to be extra information in the VM config. These seems like the most sensible thing to do. > * In the configurations where a larger number is needed, what larger > limit is appropriate ? How should it be calculated ? > > AFAICT from the discussion, 8192 is a reasonable bet. Is everyone > happy with it. > > Ian. > > PS: Earlier I asked: > > * How do we know that this does not itself give an opportunity for > hypervisor resource exhaustion attacks by guests ? (Note: if it > _does_ give such an opportunity, this should be mentioned more > clearly in the documentation.) > > * If we are talking about mmio ranges for ioreq servers, why do > guests which do not use this feature have the ability to create > them at all ? > > I now understand that these mmio ranges are created by the device > model. Of course the device model needs to be able to create mmio > ranges for the guest. And since they consume hypervisor resources, > the number of these must be limited (device models not necessarily > being trusted). ...but I think there is still an open question as to whether the toolstack is allowed to set that limit for a VM or not. IMO the toolstack should be allowed to set that limit when creating a domain. Paul
>>> On 04.02.16 at 14:33, <Ian.Jackson@eu.citrix.com> wrote: > Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges."): >> On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote: >> > So another question is, if value of this limit really matters, will a >> > lower one be more acceptable(the current 256 being not enough)? >> >> If you've carefully read George's replies, [...] > > Thanks to George for the very clear explanation, and also to him for > an illuminating in-person discussion. > > It is disturbing that as a result of me as a tools maintainer asking > questions about what seems to me to be a troublesome a user-visible > control setting in libxl, we are now apparently revisiting lower > layers of the hypervisor design, which have already been committed. > > While I find George's line of argument convincing, neither I nor > George are maintainers of the relevant hypervisor code. I am not > going to insist that anything in the hypervisor is done different and > am not trying to use my tools maintainer position to that end. > > Clearly there has been a failure of our workflow to consider and > review everything properly together. But given where we are now, I > think that this discussion about hypervisor internals is probably a > distraction. While I recall George having made that alternative suggestion, both Yu and Paul having reservations against it made me not insist on that alternative. Instead I've been trying to limit some of the bad effects that the variant originally proposed brought with it. Clearly, with the more detailed reply George has now given (involving areas where he is the maintainer for), I should have been more demanding towards the exploration of that alternative. That's clearly unfortunate, and I apologize for that, but such things happen. As to one of the patches already having for committed - I'm not worried about that at all. We can always revert, that's why the thing is called "unstable". Everything below here I think was meant to be mainly directed at Yu instead of me. > Let pose again some questions that I still don't have clear answers > to: > > * Is it possible for libxl to somehow tell from the rest of the > configuration that this larger limit should be applied ? > > AFAICT there is nothing in libxl directly involving vgpu. How can > libxl be used to create a guest with vgpu enabled ? I had thought > that this was done merely with the existing PCI passthrough > configuration, but it now seems that somehow a second device model > would have to be started. libxl doesn't have code to do that. > > * In the configurations where a larger number is needed, what larger > limit is appropriate ? How should it be calculated ? > > AFAICT from the discussion, 8192 is a reasonable bet. Is everyone > happy with it. > > Ian. > > PS: Earlier I asked: > > * How do we know that this does not itself give an opportunity for > hypervisor resource exhaustion attacks by guests ? (Note: if it > _does_ give such an opportunity, this should be mentioned more > clearly in the documentation.) > > * If we are talking about mmio ranges for ioreq servers, why do > guests which do not use this feature have the ability to create > them at all ? > > I now understand that these mmio ranges are created by the device > model. Of course the device model needs to be able to create mmio > ranges for the guest. And since they consume hypervisor resources, > the number of these must be limited (device models not necessarily > being trusted). Which is why I have been so hesitant to accept these patches. Jan
>>> On 04.02.16 at 14:47, <Paul.Durrant@citrix.com> wrote: >> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com] >> Sent: 04 February 2016 13:34 >> * Is it possible for libxl to somehow tell from the rest of the >> configuration that this larger limit should be applied ? >> >> AFAICT there is nothing in libxl directly involving vgpu. How can >> libxl be used to create a guest with vgpu enabled ? I had thought >> that this was done merely with the existing PCI passthrough >> configuration, but it now seems that somehow a second device model >> would have to be started. libxl doesn't have code to do that. >> > > AIUI if the setting of the increased limit is tied to provisioning a gvt-g > instance for a VM then I don't there needs to be extra information in the VM > config. These seems like the most sensible thing to do. I don't understand this: For one, it's still unclear to me on what basis it would be known that a given VM is a "gvt-g instance". And even if that's indeed derivable from something, the uncertainty about a workable upper bound on the number of WP ranges would still seem to demand the value to be specifiable separately... >> I now understand that these mmio ranges are created by the device >> model. Of course the device model needs to be able to create mmio >> ranges for the guest. And since they consume hypervisor resources, >> the number of these must be limited (device models not necessarily >> being trusted). > > ...but I think there is still an open question as to whether the toolstack > is allowed to set that limit for a VM or not. IMO the toolstack should be > allowed to set that limit when creating a domain. ... as you indicate here. Jan
> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: 04 February 2016 14:13 > To: Paul Durrant > Cc: Andrew Cooper; George Dunlap; Ian Campbell; Ian Jackson; Stefano > Stabellini; Wei Liu; Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen- > devel@lists.xen.org; Keir (Xen.org) > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > >>> On 04.02.16 at 14:47, <Paul.Durrant@citrix.com> wrote: > >> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com] > >> Sent: 04 February 2016 13:34 > >> * Is it possible for libxl to somehow tell from the rest of the > >> configuration that this larger limit should be applied ? > >> > >> AFAICT there is nothing in libxl directly involving vgpu. How can > >> libxl be used to create a guest with vgpu enabled ? I had thought > >> that this was done merely with the existing PCI passthrough > >> configuration, but it now seems that somehow a second device model > >> would have to be started. libxl doesn't have code to do that. > >> > > > > AIUI if the setting of the increased limit is tied to provisioning a gvt-g > > instance for a VM then I don't there needs to be extra information in the > VM > > config. These seems like the most sensible thing to do. > > I don't understand this: For one, it's still unclear to me on what basis > it would be known that a given VM is a "gvt-g instance". And even if > that's indeed derivable from something, the uncertainty about a > workable upper bound on the number of WP ranges would still seem > to demand the value to be specifiable separately... There are patches in the XenGT xen repo which add extra parameters into the VM config to allow libxl to provision a gvt-g instance (of which there are a finite number per GPU) for a VM. The increased limit could be applied when doing so and it may be feasible to determine (maybe from the version of the GPU h/w) what a reasonable limit is. Paul > > >> I now understand that these mmio ranges are created by the device > >> model. Of course the device model needs to be able to create mmio > >> ranges for the guest. And since they consume hypervisor resources, > >> the number of these must be limited (device models not necessarily > >> being trusted). > > > > ...but I think there is still an open question as to whether the toolstack > > is allowed to set that limit for a VM or not. IMO the toolstack should be > > allowed to set that limit when creating a domain. > > ... as you indicate here. > > Jan
Paul Durrant writes ("RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter max_wp_ram_ranges."): > There are patches in the XenGT xen repo which add extra parameters > into the VM config to allow libxl to provision a gvt-g instance (of > which there are a finite number per GPU) for a VM. The increased > limit could be applied when doing so and it may be feasible to > determine (maybe from the version of the GPU h/w) what a reasonable > limit is. Right. So in that case there would be no need for this user-visible knob (along with its sadly not very helpful documentation) to be exposed in the libxl stable API. Ian.
> -----Original Message----- > From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com] > Sent: 04 February 2016 15:06 > To: Paul Durrant > Cc: Jan Beulich; Andrew Cooper; George Dunlap; Ian Campbell; Stefano > Stabellini; Wei Liu; Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen- > devel@lists.xen.org; Keir (Xen.org) > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > Paul Durrant writes ("RE: [Xen-devel] [PATCH v3 3/3] tools: introduce > parameter max_wp_ram_ranges."): > > There are patches in the XenGT xen repo which add extra parameters > > into the VM config to allow libxl to provision a gvt-g instance (of > > which there are a finite number per GPU) for a VM. The increased > > limit could be applied when doing so and it may be feasible to > > determine (maybe from the version of the GPU h/w) what a reasonable > > limit is. > > Right. So in that case there would be no need for this user-visible > knob (along with its sadly not very helpful documentation) to be > exposed in the libxl stable API. > No, I really don't think that should be necessary. Libxl merely needs to configure a limit in the hypervisor appropriate to the gvt-g instance that is provisioned. Paul > Ian.
On 04/02/16 14:08, Jan Beulich wrote: >>>> On 04.02.16 at 14:33, <Ian.Jackson@eu.citrix.com> wrote: >> Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter >> max_wp_ram_ranges."): >>> On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote: >>>> So another question is, if value of this limit really matters, will a >>>> lower one be more acceptable(the current 256 being not enough)? >>> >>> If you've carefully read George's replies, [...] >> >> Thanks to George for the very clear explanation, and also to him for >> an illuminating in-person discussion. >> >> It is disturbing that as a result of me as a tools maintainer asking >> questions about what seems to me to be a troublesome a user-visible >> control setting in libxl, we are now apparently revisiting lower >> layers of the hypervisor design, which have already been committed. >> >> While I find George's line of argument convincing, neither I nor >> George are maintainers of the relevant hypervisor code. I am not >> going to insist that anything in the hypervisor is done different and >> am not trying to use my tools maintainer position to that end. >> >> Clearly there has been a failure of our workflow to consider and >> review everything properly together. But given where we are now, I >> think that this discussion about hypervisor internals is probably a >> distraction. > > While I recall George having made that alternative suggestion, > both Yu and Paul having reservations against it made me not > insist on that alternative. Instead I've been trying to limit some > of the bad effects that the variant originally proposed brought > with it. Clearly, with the more detailed reply George has now > given (involving areas where he is the maintainer for), I should > have been more demanding towards the exploration of that > alternative. That's clearly unfortunate, and I apologize for that, > but such things happen. > > As to one of the patches already having for committed - I'm not > worried about that at all. We can always revert, that's why the > thing is called "unstable". It looks like I should have been more careful to catch up on the current state of things before I started arguing again -- please accept my apologies. I see that patch 2/3 addresses the gpfn/io question in the commit message by saying, "Previously, a new hypercall or subop was suggested to map write-protected pages into ioreq server. However, it turned out handler of this new hypercall would be almost the same with the existing pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's already a type parameter in this hypercall. So no new hypercall defined, only a new type is introduced." And I see that 2/3 internally separates the WP_RAM type into a separate rangeset, whose size can be adjusted separately. This addresses my complaint about the interface using gpfns rather than MMIO ranges as an interface (somewhat anyway). Sorry for not acknowledging this at first. The question of the internal implementation -- whether to use RB tree rangesets, or radix trees (as apparently ARM memaccess does) or p2m types -- is an internal implementation question. I think p2m types is long-term the best way to go, but it won't hurt to have the current implementation checked in, as long as it doesn't have any impacts on the stable interface. At the moment, as far as I can tell, there's no way for libxl to even run a version of qemu with XenGT enabled, so there's no real need for libxl to be involved. The purpose of having the limit would putatively be to prevent a guest being able to trigger an exhaustion of hypervisor memory by inducing the device model to mark an arbitrary number of ranges as mmio_dm. Two angles on this. First, assuming that limiting the number of ranges is what we want: I'm not really a fan of using HVM_PARAMs for this, but as long as it's not considered a public interface (i.e., it could go away or disappear and everything would Just Work), then I wouldn't object. Although I would ask: would it instead be suitable for now to just set the default limit for WP_RAM to 8196 in the hypervisor, since we do expect it to be tracking gpfn ranges rather than IO regions? And if we determine in the future that more ranges are necessary, to then do the work of moving it to using p2m types (or exposing a knob to adjust it)? But (and this the other angle): is simply marking a numerical limit sufficient to avoid memory exhaustion? Is there a danger that after creating several guests, such that Xen was now running very low on memory, that a guest would (purposely or not) cause memory to be exhausted sometime further after boot, causing a system-wide DoS (or just general lack of stability)? In the shadow / hap memory case, the memory is pre-allocated up front, which makes sure that nothing a guest does can cause Xen to run out of memory once it's booted. Without pre-allocating it, it's still possible that the admin might start up enough VMs that exhaustion is *possible*, but won't be triggered until later (when the guest starts using more GTTs). Although in fact this really points to the need for a general overhaul in how memory allocation on behalf of a domain is handled in general; that's a bigger chunk of work. But in any case, it seems to me that we can entirely avoid the question of how many ranges might ever be necessary by starting with a fixed limit in the hypervisor, and then moving to a p2m-type based implementation if and when that becomes unsatisfactory. -George
Hi George, On Thu, Feb 04, 2016 at 11:06:33AM +0000, George Dunlap wrote: > On Thu, Feb 4, 2016 at 9:38 AM, Yu, Zhang <yu.c.zhang@linux.intel.com> wrote: > > On 2/4/2016 5:28 PM, Paul Durrant wrote: > >> I assume this means that the emulator can 'unshadow' GTTs (I guess on an > >> LRU basis) so that it can shadow new ones when the limit has been exhausted? > >> If so, how bad is performance likely to be if we live with a lower limit > >> and take the hit of unshadowing if the guest GTTs become heavily fragmented? > >> > > Thank you, Paul. > > > > Well, I was told the emulator have approaches to delay the shadowing of > > the GTT till future GPU commands are submitted. By now, I'm not sure > > about the performance penalties if the limit is set too low. Although > > we are confident 8K is a secure limit, it seems still too high to be > > accepted. We will perform more experiments with this new approach to > > find a balance between the lowest limit and the XenGT performance. > > Just to check some of my assumptions: > > I assume that unlike memory accesses, your GPU hardware cannot > 'recover' from faults in the GTTs. That is, for memory, you can take a > page fault, fix up the pagetables, and then re-execute the original > instruction; but so far I haven't heard of any devices being able to > seamlessly re-execute a transaction after a fault. Is my > understanding correct? That's correct. At least for now, we do not have GPU page fault. > > If that is the case, then for every top-level value (whatever the > equivalent of the CR3), you need to be able to shadow the entire GTT > tree below it, yes? You can't use a trick that the memory shadow That's right also. > pagetables can use, of unshadowing parts of the tree and reshadowing > them. > > So as long as the currently-in-use GTT tree contains no more than > $LIMIT ranges, you can unshadow and reshadow; this will be slow, but > strictly speaking correct. > > What do you do if the guest driver switches to a GTT such that the > entire tree takes up more than $LIMIT entries? GPU has some special properties different from CPU, which make things easier. The GPU page table is constructed by CPU and used by GPU workloads. GPU workload itself will not change the page table. Meanwhile, GPU workload submission, in virtualized environment, is controled by our device model. Then we can reshadow the whole table every time before we submit workload. That can reduce the total number required for write protection but with performance impact, because GPU has to have more idle time waiting for CPU. Hope the info helps. Thanks! Regards, -Zhiyuan > > -George
> From: Ian Campbell [mailto:ian.campbell@citrix.com] > Sent: Thursday, February 04, 2016 6:06 PM > > On Wed, 2016-02-03 at 17:41 +0000, George Dunlap wrote: > > But of course, since they they aren't actually ranges but just gpfns, > > they're scattered randomly throughout the guest physical address > > space. > > (Possibly) stupid question: > > Since, AIUI, the in-guest GPU driver is XenGT aware could it not allocate a > contiguous range of pages at start of day to use as GPU PTs? Or even just N > contiguous regions, e.g. i think the "8K" refers to pages, which is 16 2M > allocations, which is a far more manageable number of ranges to track than > 8096 individual pages. > We add XenGT awareness in guest driver at minimum requirement (e.g. to handle address space ballooning due to graphics memory partitioning), which only impacts hardware specific initialization code (thinking vgpu as a new sku). However we're hesitating to touch other general driver code, such as allocation policy you mentioned earlier. Changing that part specifically for one sku needs very strong reason to convince driver maintainer. And we cannot assume above allocation policy can be always met. Thanks Kevin
> From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Thursday, February 04, 2016 10:13 PM > > >>> On 04.02.16 at 14:47, <Paul.Durrant@citrix.com> wrote: > >> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com] > >> Sent: 04 February 2016 13:34 > >> * Is it possible for libxl to somehow tell from the rest of the > >> configuration that this larger limit should be applied ? > >> > >> AFAICT there is nothing in libxl directly involving vgpu. How can > >> libxl be used to create a guest with vgpu enabled ? I had thought > >> that this was done merely with the existing PCI passthrough > >> configuration, but it now seems that somehow a second device model > >> would have to be started. libxl doesn't have code to do that. > >> > > > > AIUI if the setting of the increased limit is tied to provisioning a gvt-g > > instance for a VM then I don't there needs to be extra information in the VM > > config. These seems like the most sensible thing to do. > > I don't understand this: For one, it's still unclear to me on what basis > it would be known that a given VM is a "gvt-g instance". And even if > that's indeed derivable from something, the uncertainty about a > workable upper bound on the number of WP ranges would still seem > to demand the value to be specifiable separately... > We'll invent a different parameter e.g. gvt-g from existing passthrough one. So just for this question, a toolstack can know whether a VM is provisioned with a vgpu based on that parameter from config file. Thanks Kevin
> From: Lv, Zhiyuan > Sent: Friday, February 05, 2016 10:01 AM > > Hi George, > > On Thu, Feb 04, 2016 at 11:06:33AM +0000, George Dunlap wrote: > > On Thu, Feb 4, 2016 at 9:38 AM, Yu, Zhang <yu.c.zhang@linux.intel.com> wrote: > > > On 2/4/2016 5:28 PM, Paul Durrant wrote: > > >> I assume this means that the emulator can 'unshadow' GTTs (I guess on an > > >> LRU basis) so that it can shadow new ones when the limit has been exhausted? > > >> If so, how bad is performance likely to be if we live with a lower limit > > >> and take the hit of unshadowing if the guest GTTs become heavily fragmented? > > >> > > > Thank you, Paul. > > > > > > Well, I was told the emulator have approaches to delay the shadowing of > > > the GTT till future GPU commands are submitted. By now, I'm not sure > > > about the performance penalties if the limit is set too low. Although > > > we are confident 8K is a secure limit, it seems still too high to be > > > accepted. We will perform more experiments with this new approach to > > > find a balance between the lowest limit and the XenGT performance. > > > > Just to check some of my assumptions: > > > > I assume that unlike memory accesses, your GPU hardware cannot > > 'recover' from faults in the GTTs. That is, for memory, you can take a > > page fault, fix up the pagetables, and then re-execute the original > > instruction; but so far I haven't heard of any devices being able to > > seamlessly re-execute a transaction after a fault. Is my > > understanding correct? > > That's correct. At least for now, we do not have GPU page fault. > > > > > If that is the case, then for every top-level value (whatever the > > equivalent of the CR3), you need to be able to shadow the entire GTT > > tree below it, yes? You can't use a trick that the memory shadow > > That's right also. > > > pagetables can use, of unshadowing parts of the tree and reshadowing > > them. > > > > So as long as the currently-in-use GTT tree contains no more than > > $LIMIT ranges, you can unshadow and reshadow; this will be slow, but > > strictly speaking correct. > > > > What do you do if the guest driver switches to a GTT such that the > > entire tree takes up more than $LIMIT entries? > > GPU has some special properties different from CPU, which make things > easier. The GPU page table is constructed by CPU and used by GPU > workloads. GPU workload itself will not change the page table. > Meanwhile, GPU workload submission, in virtualized environment, is > controled by our device model. Then we can reshadow the whole table > every time before we submit workload. That can reduce the total number > required for write protection but with performance impact, because GPU > has to have more idle time waiting for CPU. Hope the info helps. > Thanks! > Putting in another way, it's fully under mediation when a GPU page table (GTT) will be referenced by the GPU, so there're plenty of room to optimize existing shadowing (always shadowing all recognized GPU page tables), e.g. shadowing only active one when a VM is scheduled in. It's a performance matter but no correctness issue. This is why Yu mentioned earlier whether we can just set a default limit which is good for majority of use cases, while extending our device mode to drop/recreate some shadow tables upon the limitation is hit. I think this matches how today's CPU shadow page table is implemented, which also has a limitation of how many shadow pages are allowed per-VM. Thanks Kevin
> From: Paul Durrant [mailto:Paul.Durrant@citrix.com] > Sent: Thursday, February 04, 2016 11:51 PM > > > -----Original Message----- > > From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com] > > Sent: 04 February 2016 15:06 > > To: Paul Durrant > > Cc: Jan Beulich; Andrew Cooper; George Dunlap; Ian Campbell; Stefano > > Stabellini; Wei Liu; Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen- > > devel@lists.xen.org; Keir (Xen.org) > > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > > max_wp_ram_ranges. > > > > Paul Durrant writes ("RE: [Xen-devel] [PATCH v3 3/3] tools: introduce > > parameter max_wp_ram_ranges."): > > > There are patches in the XenGT xen repo which add extra parameters > > > into the VM config to allow libxl to provision a gvt-g instance (of > > > which there are a finite number per GPU) for a VM. The increased > > > limit could be applied when doing so and it may be feasible to > > > determine (maybe from the version of the GPU h/w) what a reasonable > > > limit is. > > > > Right. So in that case there would be no need for this user-visible > > knob (along with its sadly not very helpful documentation) to be > > exposed in the libxl stable API. > > > > No, I really don't think that should be necessary. Libxl merely needs to configure a limit in > the hypervisor appropriate to the gvt-g instance that is provisioned. > Agree. Thanks Kevin
> From: George Dunlap [mailto:george.dunlap@citrix.com] > Sent: Friday, February 05, 2016 1:12 AM > > On 04/02/16 14:08, Jan Beulich wrote: > >>>> On 04.02.16 at 14:33, <Ian.Jackson@eu.citrix.com> wrote: > >> Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > >> max_wp_ram_ranges."): > >>> On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote: > >>>> So another question is, if value of this limit really matters, will a > >>>> lower one be more acceptable(the current 256 being not enough)? > >>> > >>> If you've carefully read George's replies, [...] > >> > >> Thanks to George for the very clear explanation, and also to him for > >> an illuminating in-person discussion. > >> > >> It is disturbing that as a result of me as a tools maintainer asking > >> questions about what seems to me to be a troublesome a user-visible > >> control setting in libxl, we are now apparently revisiting lower > >> layers of the hypervisor design, which have already been committed. > >> > >> While I find George's line of argument convincing, neither I nor > >> George are maintainers of the relevant hypervisor code. I am not > >> going to insist that anything in the hypervisor is done different and > >> am not trying to use my tools maintainer position to that end. > >> > >> Clearly there has been a failure of our workflow to consider and > >> review everything properly together. But given where we are now, I > >> think that this discussion about hypervisor internals is probably a > >> distraction. > > > > While I recall George having made that alternative suggestion, > > both Yu and Paul having reservations against it made me not > > insist on that alternative. Instead I've been trying to limit some > > of the bad effects that the variant originally proposed brought > > with it. Clearly, with the more detailed reply George has now > > given (involving areas where he is the maintainer for), I should > > have been more demanding towards the exploration of that > > alternative. That's clearly unfortunate, and I apologize for that, > > but such things happen. > > > > As to one of the patches already having for committed - I'm not > > worried about that at all. We can always revert, that's why the > > thing is called "unstable". > > It looks like I should have been more careful to catch up on the current > state of things before I started arguing again -- please accept my > apologies. Thanks George for your careful thinking. > > I see that patch 2/3 addresses the gpfn/io question in the commit > message by saying, "Previously, a new hypercall or subop was suggested > to map write-protected pages into ioreq server. However, it turned out > handler of this new hypercall would be almost the same with the existing > pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's already a > type parameter in this hypercall. So no new hypercall defined, only a > new type is introduced." > > And I see that 2/3 internally separates the WP_RAM type into a separate > rangeset, whose size can be adjusted separately. > > This addresses my complaint about the interface using gpfns rather than > MMIO ranges as an interface (somewhat anyway). Sorry for not > acknowledging this at first. > > The question of the internal implementation -- whether to use RB tree > rangesets, or radix trees (as apparently ARM memaccess does) or p2m > types -- is an internal implementation question. I think p2m types is > long-term the best way to go, but it won't hurt to have the current > implementation checked in, as long as it doesn't have any impacts on the > stable interface. I'm still trying to understand your suggestion vs. this one. Today we already have a p2m_mmio_write_dm type. It's there already, and any write fault hitting that type will be delivered to ioreq server. Then next open is how a ioreq server could know whether it should handle this request or not, which is why some tracking structures (either RB/radix) are created to maintain that specific information. It's under the assumption that multiple ioreq servers co-exist, so a loop check on all ioreq servers is required to identify the right target. And multiple ioreq servers are real case in XenGT, because our vGPU device model is in kernel, as part of Intel i915 graphics driver. So at least two ioreq servers already exist, with one routing to XenGT in Dom0 kernel space and the other to the default Qemu in Dom0 user. In your long-term approach with p2m types, looks you are proposing encoding ioreq server ID in p2m type directly (e.g. 4bits), which then eliminates the need of tracking in ioreq server side so the whole security concern is gone. And no limitation at all. Because available p2m bits are limited, as Andrew pointed out, so it might be reasonable to implement this approach when a new p2t structure is added, which is why we consider it as a long-term approach. Please correct me if above understanding is correct? > > At the moment, as far as I can tell, there's no way for libxl to even > run a version of qemu with XenGT enabled, so there's no real need for > libxl to be involved. no way because we have upstreamed all toolstack changes yet, but we should still discuss the requirement as we've been doing in this thread right? or do you mean something different? > > The purpose of having the limit would putatively be to prevent a guest > being able to trigger an exhaustion of hypervisor memory by inducing the > device model to mark an arbitrary number of ranges as mmio_dm. > > Two angles on this. > > First, assuming that limiting the number of ranges is what we want: I'm > not really a fan of using HVM_PARAMs for this, but as long as it's not > considered a public interface (i.e., it could go away or disappear and > everything would Just Work), then I wouldn't object. > > Although I would ask: would it instead be suitable for now to just set > the default limit for WP_RAM to 8196 in the hypervisor, since we do > expect it to be tracking gpfn ranges rather than IO regions? And if we > determine in the future that more ranges are necessary, to then do the > work of moving it to using p2m types (or exposing a knob to adjust it)? > > But (and this the other angle): is simply marking a numerical limit > sufficient to avoid memory exhaustion? Is there a danger that after > creating several guests, such that Xen was now running very low on > memory, that a guest would (purposely or not) cause memory to be > exhausted sometime further after boot, causing a system-wide DoS (or > just general lack of stability)? > > In the shadow / hap memory case, the memory is pre-allocated up front, > which makes sure that nothing a guest does can cause Xen to run out of > memory once it's booted. Without pre-allocating it, it's still possible > that the admin might start up enough VMs that exhaustion is *possible*, > but won't be triggered until later (when the guest starts using more GTTs). that's a valid concern though I believe not cleanly addressed in many places. :-) > > Although in fact this really points to the need for a general overhaul > in how memory allocation on behalf of a domain is handled in general; > that's a bigger chunk of work. > > But in any case, it seems to me that we can entirely avoid the question > of how many ranges might ever be necessary by starting with a fixed > limit in the hypervisor, and then moving to a p2m-type based > implementation if and when that becomes unsatisfactory. > I agree with this approach. Let's see whether we can get consensus from others to make a conclusion. Actually another quick thought. Please help check whether it makes sense or just be more tricky. Could we have an intermediate option toward p2m-type based option, by assuming only one ioreq server can handle write_dm related faults? If that's the case: - we don't need to add more bits in p2m type; - XenGT can register as the default ioreq server for write_dm; - when a write-dm fault is triggered, we'll loop all ioreq servers to find the default one and then deliver through that path w/o further check; - we may not change ioreq server interface at all. Assume the ioreq sever which receives the 1st write-protection request from device model is the default one; It is not very clean, but code change is minimal. Also it's easily extensible to support multiple ioreq server both with write_dm requirement in the future. But I haven't think about all corner cases carefully. Welcome comments. :-) Thanks Kevin
>>> On 04.02.16 at 18:12, <george.dunlap@citrix.com> wrote: > Two angles on this. > > First, assuming that limiting the number of ranges is what we want: I'm > not really a fan of using HVM_PARAMs for this, but as long as it's not > considered a public interface (i.e., it could go away or disappear and > everything would Just Work), then I wouldn't object. The parameter setting interface generally is guest exposed, it's just that many parameters can't be set by guests for themselves. Of course the new #define could be put inside a Xen/tools conditional. > Although I would ask: would it instead be suitable for now to just set > the default limit for WP_RAM to 8196 in the hypervisor, since we do > expect it to be tracking gpfn ranges rather than IO regions? And if we > determine in the future that more ranges are necessary, to then do the > work of moving it to using p2m types (or exposing a knob to adjust it)? That's what collides with disaggregation (as pointed out before): Any tool stack component could then create wp-ram pages in guests it controls, no matter whether those guests actually have a need for such. I continue to think that if we indeed got this route, then the host admin should be required to explicitly state that the risks are acceptable (by allowing only certain guests to actually create [many] of such pages). > But (and this the other angle): is simply marking a numerical limit > sufficient to avoid memory exhaustion? Is there a danger that after > creating several guests, such that Xen was now running very low on > memory, that a guest would (purposely or not) cause memory to be > exhausted sometime further after boot, causing a system-wide DoS (or > just general lack of stability)? The guest itself can't, but other than fully privileged tool stack components could, and that's still something we need to avoid. > In the shadow / hap memory case, the memory is pre-allocated up front, > which makes sure that nothing a guest does can cause Xen to run out of > memory once it's booted. Without pre-allocating it, it's still possible > that the admin might start up enough VMs that exhaustion is *possible*, > but won't be triggered until later (when the guest starts using more GTTs). > > Although in fact this really points to the need for a general overhaul > in how memory allocation on behalf of a domain is handled in general; > that's a bigger chunk of work. Well, right now there's pretty little allocation happening at run time afaict, i.e. having a couple of pages available will generally keep things going smoothly. (Once upon a time it was that there were - iirc - no active runtime allocations at all, i.e. such had occurred only in the context of things like domctls, where failure could be dealt with by ballooning a few pages out of some domain and retrying. I'm afraid that's not the case anymore nowadays, and even if it was would collide with disaggregation, as a tool stack component legitimately issuing a domctl may not have the privileges to balloon other than the domain it controls, in particular not Dom0.) > But in any case, it seems to me that we can entirely avoid the question > of how many ranges might ever be necessary by starting with a fixed > limit in the hypervisor, and then moving to a p2m-type based > implementation if and when that becomes unsatisfactory. With all of the above I think we should rather explore the p2m-type based approach, in particular the suggestion Kevin has made to direct all p2m_mmio_write_dm (a name which appears to have been badly chosen, considering that we're talking about RAM pages here) write accesses to the default ioreq server (utilizing that upstream qemu doesn't itself register as such). Jan
>>> On 05.02.16 at 04:44, <kevin.tian@intel.com> wrote: > This is why Yu mentioned earlier whether we can just set a default > limit which is good for majority of use cases, while extending our > device mode to drop/recreate some shadow tables upon the limitation > is hit. I think this matches how today's CPU shadow page table is > implemented, which also has a limitation of how many shadow pages > are allowed per-VM. Except that un-shadowing some victim page in order to shadow one to make forward progress can there be done in the page fault handler, i.e. we don't need to up front determine the set of pages to be shadowed for a certain operation to complete. Jan
On 2/4/2016 7:06 PM, George Dunlap wrote: > On Thu, Feb 4, 2016 at 9:38 AM, Yu, Zhang <yu.c.zhang@linux.intel.com> wrote: >> On 2/4/2016 5:28 PM, Paul Durrant wrote: >>> I assume this means that the emulator can 'unshadow' GTTs (I guess on an >>> LRU basis) so that it can shadow new ones when the limit has been exhausted? >>> If so, how bad is performance likely to be if we live with a lower limit >>> and take the hit of unshadowing if the guest GTTs become heavily fragmented? >>> >> Thank you, Paul. >> >> Well, I was told the emulator have approaches to delay the shadowing of >> the GTT till future GPU commands are submitted. By now, I'm not sure >> about the performance penalties if the limit is set too low. Although >> we are confident 8K is a secure limit, it seems still too high to be >> accepted. We will perform more experiments with this new approach to >> find a balance between the lowest limit and the XenGT performance. > > Just to check some of my assumptions: > > I assume that unlike memory accesses, your GPU hardware cannot > 'recover' from faults in the GTTs. That is, for memory, you can take a > page fault, fix up the pagetables, and then re-execute the original > instruction; but so far I haven't heard of any devices being able to > seamlessly re-execute a transaction after a fault. Is my > understanding correct? > Yes > If that is the case, then for every top-level value (whatever the > equivalent of the CR3), you need to be able to shadow the entire GTT > tree below it, yes? You can't use a trick that the memory shadow > pagetables can use, of unshadowing parts of the tree and reshadowing > them. > > So as long as the currently-in-use GTT tree contains no more than > $LIMIT ranges, you can unshadow and reshadow; this will be slow, but > strictly speaking correct. > > What do you do if the guest driver switches to a GTT such that the > entire tree takes up more than $LIMIT entries? > Good question. Like the memory virtualization, IIUC, besides wp the guest page tables, we can also track the updates of them when cr3 is written or when a tlb flush occurs. We can consider to optimize our GPU device model to achieve similar goal, e.g. when a root pointer(like cr3) to the page table is written and when a set of commands is submitted(Both situations are trigger by MMIO operations). But taking consideration of performance, we may probably still need to wp all the page tables when they are created at the first time. It requires a lot optimization work in the device model side to find a balance between a minimal wp-ed gpfns and a reasonable performance. We'd like to have a try. :) Yu
On 2/5/2016 1:12 AM, George Dunlap wrote: > On 04/02/16 14:08, Jan Beulich wrote: >>>>> On 04.02.16 at 14:33, <Ian.Jackson@eu.citrix.com> wrote: >>> Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter >>> max_wp_ram_ranges."): >>>> On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote: >>>>> So another question is, if value of this limit really matters, will a >>>>> lower one be more acceptable(the current 256 being not enough)? >>>> >>>> If you've carefully read George's replies, [...] >>> >>> Thanks to George for the very clear explanation, and also to him for >>> an illuminating in-person discussion. >>> >>> It is disturbing that as a result of me as a tools maintainer asking >>> questions about what seems to me to be a troublesome a user-visible >>> control setting in libxl, we are now apparently revisiting lower >>> layers of the hypervisor design, which have already been committed. >>> >>> While I find George's line of argument convincing, neither I nor >>> George are maintainers of the relevant hypervisor code. I am not >>> going to insist that anything in the hypervisor is done different and >>> am not trying to use my tools maintainer position to that end. >>> >>> Clearly there has been a failure of our workflow to consider and >>> review everything properly together. But given where we are now, I >>> think that this discussion about hypervisor internals is probably a >>> distraction. >> >> While I recall George having made that alternative suggestion, >> both Yu and Paul having reservations against it made me not >> insist on that alternative. Instead I've been trying to limit some >> of the bad effects that the variant originally proposed brought >> with it. Clearly, with the more detailed reply George has now >> given (involving areas where he is the maintainer for), I should >> have been more demanding towards the exploration of that >> alternative. That's clearly unfortunate, and I apologize for that, >> but such things happen. >> >> As to one of the patches already having for committed - I'm not >> worried about that at all. We can always revert, that's why the >> thing is called "unstable". > > It looks like I should have been more careful to catch up on the current > state of things before I started arguing again -- please accept my > apologies. > In fact, I need to say thank you all for your patience and suggestions. I'm thrilled to see XenGT is receiving so much attention. :) > I see that patch 2/3 addresses the gpfn/io question in the commit > message by saying, "Previously, a new hypercall or subop was suggested > to map write-protected pages into ioreq server. However, it turned out > handler of this new hypercall would be almost the same with the existing > pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's already a > type parameter in this hypercall. So no new hypercall defined, only a > new type is introduced." > > And I see that 2/3 internally separates the WP_RAM type into a separate > rangeset, whose size can be adjusted separately. > > This addresses my complaint about the interface using gpfns rather than > MMIO ranges as an interface (somewhat anyway). Sorry for not > acknowledging this at first. > > The question of the internal implementation -- whether to use RB tree > rangesets, or radix trees (as apparently ARM memaccess does) or p2m > types -- is an internal implementation question. I think p2m types is > long-term the best way to go, but it won't hurt to have the current > implementation checked in, as long as it doesn't have any impacts on the > stable interface. > > At the moment, as far as I can tell, there's no way for libxl to even > run a version of qemu with XenGT enabled, so there's no real need for > libxl to be involved. > I agree. > The purpose of having the limit would putatively be to prevent a guest > being able to trigger an exhaustion of hypervisor memory by inducing the > device model to mark an arbitrary number of ranges as mmio_dm. > > Two angles on this. > > First, assuming that limiting the number of ranges is what we want: I'm > not really a fan of using HVM_PARAMs for this, but as long as it's not > considered a public interface (i.e., it could go away or disappear and > everything would Just Work), then I wouldn't object. > > Although I would ask: would it instead be suitable for now to just set > the default limit for WP_RAM to 8196 in the hypervisor, since we do > expect it to be tracking gpfn ranges rather than IO regions? And if we That is what we have suggesting in v9. But Jan proposed we leave this option to the admin. And to some extent, I can understand his concern. > determine in the future that more ranges are necessary, to then do the > work of moving it to using p2m types (or exposing a knob to adjust it)? > > But (and this the other angle): is simply marking a numerical limit > sufficient to avoid memory exhaustion? Is there a danger that after > creating several guests, such that Xen was now running very low on > memory, that a guest would (purposely or not) cause memory to be > exhausted sometime further after boot, causing a system-wide DoS (or > just general lack of stability)? > This worry sounds reasonable. So from this point of view, I guess value of this limit does matter. Setting it to 256 will not cause any trouble, but setting it to a huge one would make this limitation useless. Previously, I had thought 8K is an acceptable one for XenGT to run smoothly, and for Xen heap to not over consume its memory, but seems this value is not small enough to be acceptable. :) Yu
On 2/5/2016 12:18 PM, Tian, Kevin wrote: >> From: George Dunlap [mailto:george.dunlap@citrix.com] >> Sent: Friday, February 05, 2016 1:12 AM >> >> On 04/02/16 14:08, Jan Beulich wrote: >>>>>> On 04.02.16 at 14:33, <Ian.Jackson@eu.citrix.com> wrote: >>>> Jan Beulich writes ("Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter >>>> max_wp_ram_ranges."): >>>>> On 04.02.16 at 10:38, <yu.c.zhang@linux.intel.com> wrote: >>>>>> So another question is, if value of this limit really matters, will a >>>>>> lower one be more acceptable(the current 256 being not enough)? >>>>> >>>>> If you've carefully read George's replies, [...] >>>> >>>> Thanks to George for the very clear explanation, and also to him for >>>> an illuminating in-person discussion. >>>> >>>> It is disturbing that as a result of me as a tools maintainer asking >>>> questions about what seems to me to be a troublesome a user-visible >>>> control setting in libxl, we are now apparently revisiting lower >>>> layers of the hypervisor design, which have already been committed. >>>> >>>> While I find George's line of argument convincing, neither I nor >>>> George are maintainers of the relevant hypervisor code. I am not >>>> going to insist that anything in the hypervisor is done different and >>>> am not trying to use my tools maintainer position to that end. >>>> >>>> Clearly there has been a failure of our workflow to consider and >>>> review everything properly together. But given where we are now, I >>>> think that this discussion about hypervisor internals is probably a >>>> distraction. >>> >>> While I recall George having made that alternative suggestion, >>> both Yu and Paul having reservations against it made me not >>> insist on that alternative. Instead I've been trying to limit some >>> of the bad effects that the variant originally proposed brought >>> with it. Clearly, with the more detailed reply George has now >>> given (involving areas where he is the maintainer for), I should >>> have been more demanding towards the exploration of that >>> alternative. That's clearly unfortunate, and I apologize for that, >>> but such things happen. >>> >>> As to one of the patches already having for committed - I'm not >>> worried about that at all. We can always revert, that's why the >>> thing is called "unstable". >> >> It looks like I should have been more careful to catch up on the current >> state of things before I started arguing again -- please accept my >> apologies. > > Thanks George for your careful thinking. > >> >> I see that patch 2/3 addresses the gpfn/io question in the commit >> message by saying, "Previously, a new hypercall or subop was suggested >> to map write-protected pages into ioreq server. However, it turned out >> handler of this new hypercall would be almost the same with the existing >> pair - HVMOP_[un]map_io_range_to_ioreq_server, and there's already a >> type parameter in this hypercall. So no new hypercall defined, only a >> new type is introduced." >> >> And I see that 2/3 internally separates the WP_RAM type into a separate >> rangeset, whose size can be adjusted separately. >> >> This addresses my complaint about the interface using gpfns rather than >> MMIO ranges as an interface (somewhat anyway). Sorry for not >> acknowledging this at first. >> >> The question of the internal implementation -- whether to use RB tree >> rangesets, or radix trees (as apparently ARM memaccess does) or p2m >> types -- is an internal implementation question. I think p2m types is >> long-term the best way to go, but it won't hurt to have the current >> implementation checked in, as long as it doesn't have any impacts on the >> stable interface. > > I'm still trying to understand your suggestion vs. this one. Today we > already have a p2m_mmio_write_dm type. It's there already, and any > write fault hitting that type will be delivered to ioreq server. Then next > open is how a ioreq server could know whether it should handle this > request or not, which is why some tracking structures (either RB/radix) > are created to maintain that specific information. It's under the assumption > that multiple ioreq servers co-exist, so a loop check on all ioreq servers > is required to identify the right target. And multiple ioreq servers are > real case in XenGT, because our vGPU device model is in kernel, as > part of Intel i915 graphics driver. So at least two ioreq servers already > exist, with one routing to XenGT in Dom0 kernel space and the other > to the default Qemu in Dom0 user. > > In your long-term approach with p2m types, looks you are proposing > encoding ioreq server ID in p2m type directly (e.g. 4bits), which then > eliminates the need of tracking in ioreq server side so the whole > security concern is gone. And no limitation at all. Because available > p2m bits are limited, as Andrew pointed out, so it might be reasonable > to implement this approach when a new p2t structure is added, which > is why we consider it as a long-term approach. > > Please correct me if above understanding is correct? > >> >> At the moment, as far as I can tell, there's no way for libxl to even >> run a version of qemu with XenGT enabled, so there's no real need for >> libxl to be involved. > > no way because we have upstreamed all toolstack changes yet, but > we should still discuss the requirement as we've been doing in this > thread right? or do you mean something different? > >> >> The purpose of having the limit would putatively be to prevent a guest >> being able to trigger an exhaustion of hypervisor memory by inducing the >> device model to mark an arbitrary number of ranges as mmio_dm. >> >> Two angles on this. >> >> First, assuming that limiting the number of ranges is what we want: I'm >> not really a fan of using HVM_PARAMs for this, but as long as it's not >> considered a public interface (i.e., it could go away or disappear and >> everything would Just Work), then I wouldn't object. >> >> Although I would ask: would it instead be suitable for now to just set >> the default limit for WP_RAM to 8196 in the hypervisor, since we do >> expect it to be tracking gpfn ranges rather than IO regions? And if we >> determine in the future that more ranges are necessary, to then do the >> work of moving it to using p2m types (or exposing a knob to adjust it)? >> >> But (and this the other angle): is simply marking a numerical limit >> sufficient to avoid memory exhaustion? Is there a danger that after >> creating several guests, such that Xen was now running very low on >> memory, that a guest would (purposely or not) cause memory to be >> exhausted sometime further after boot, causing a system-wide DoS (or >> just general lack of stability)? >> >> In the shadow / hap memory case, the memory is pre-allocated up front, >> which makes sure that nothing a guest does can cause Xen to run out of >> memory once it's booted. Without pre-allocating it, it's still possible >> that the admin might start up enough VMs that exhaustion is *possible*, >> but won't be triggered until later (when the guest starts using more GTTs). > > that's a valid concern though I believe not cleanly addressed in many places. :-) > >> >> Although in fact this really points to the need for a general overhaul >> in how memory allocation on behalf of a domain is handled in general; >> that's a bigger chunk of work. >> >> But in any case, it seems to me that we can entirely avoid the question >> of how many ranges might ever be necessary by starting with a fixed >> limit in the hypervisor, and then moving to a p2m-type based >> implementation if and when that becomes unsatisfactory. >> > > I agree with this approach. Let's see whether we can get consensus from > others to make a conclusion. > > Actually another quick thought. Please help check whether it makes sense > or just be more tricky. > > Could we have an intermediate option toward p2m-type based option, by > assuming only one ioreq server can handle write_dm related faults? If that's > the case: > > - we don't need to add more bits in p2m type; > - XenGT can register as the default ioreq server for write_dm; > - when a write-dm fault is triggered, we'll loop all ioreq servers to find the > default one and then deliver through that path w/o further check; > - we may not change ioreq server interface at all. Assume the ioreq sever > which receives the 1st write-protection request from device model is the > default one; Thank you, Kevin. With this suggestion, we do not need to worry about the limits in rangeset, therefore no performance worries to traverse the ranges. One disadvantage is that we need to add a field like wp_ioreq_server in struct hvm_domain, yet there's already a default_ioreq_server for qemu. So I'd like to hear Paul's opion about this. By now, we all agree that this new max_wp_ram_ranges in tool stack is not suitable. And to give a summary of the choices we have: 1> George's suggestion: "starting with a fixed limit in the hypervisor, and then moving to a p2m-type based implementation if and when that becomes unsatisfactory". My understanding is this looks like the v9 implementation. 2> If hard code the limit as 8K is acceptable, we can use a xengt flag in hvm configuration, which leave this choice to admin and without exposing to much hypervisor implantation details. 3> Kevin's suggestion: leverage the p2m type, which no longer needs the underlying rangeset. 4> I'll continue to do some path-finding in the device model side, to see if, after optimization, this limit can be set to a lower value. Anyway, thank you all for your attentions and suggestions. This weekend is Chinese Spring Festival, my mail replies would be slow in the next 2 weeks. Wish you all good luck in the new year! :) Yu
> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: 05 February 2016 08:33 > To: George Dunlap > Cc: Andrew Cooper; Ian Campbell; Paul Durrant; Stefano Stabellini; Wei Liu; > Ian Jackson; Kevin Tian; zhiyuan.lv@intel.com; Zhang Yu; xen- > devel@lists.xen.org; Keir (Xen.org) > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > >>> On 04.02.16 at 18:12, <george.dunlap@citrix.com> wrote: > > Two angles on this. > > > > First, assuming that limiting the number of ranges is what we want: I'm > > not really a fan of using HVM_PARAMs for this, but as long as it's not > > considered a public interface (i.e., it could go away or disappear and > > everything would Just Work), then I wouldn't object. > > The parameter setting interface generally is guest exposed, it's just > that many parameters can't be set by guests for themselves. Of > course the new #define could be put inside a Xen/tools conditional. > > > Although I would ask: would it instead be suitable for now to just set > > the default limit for WP_RAM to 8196 in the hypervisor, since we do > > expect it to be tracking gpfn ranges rather than IO regions? And if we > > determine in the future that more ranges are necessary, to then do the > > work of moving it to using p2m types (or exposing a knob to adjust it)? > > That's what collides with disaggregation (as pointed out before): > Any tool stack component could then create wp-ram pages in > guests it controls, no matter whether those guests actually have > a need for such. I continue to think that if we indeed got this > route, then the host admin should be required to explicitly state > that the risks are acceptable (by allowing only certain guests to > actually create [many] of such pages). > > > But (and this the other angle): is simply marking a numerical limit > > sufficient to avoid memory exhaustion? Is there a danger that after > > creating several guests, such that Xen was now running very low on > > memory, that a guest would (purposely or not) cause memory to be > > exhausted sometime further after boot, causing a system-wide DoS (or > > just general lack of stability)? > > The guest itself can't, but other than fully privileged tool stack > components could, and that's still something we need to avoid. > > > In the shadow / hap memory case, the memory is pre-allocated up front, > > which makes sure that nothing a guest does can cause Xen to run out of > > memory once it's booted. Without pre-allocating it, it's still possible > > that the admin might start up enough VMs that exhaustion is *possible*, > > but won't be triggered until later (when the guest starts using more GTTs). > > > > Although in fact this really points to the need for a general overhaul > > in how memory allocation on behalf of a domain is handled in general; > > that's a bigger chunk of work. > > Well, right now there's pretty little allocation happening at run time > afaict, i.e. having a couple of pages available will generally keep > things going smoothly. (Once upon a time it was that there were - > iirc - no active runtime allocations at all, i.e. such had occurred only > in the context of things like domctls, where failure could be dealt > with by ballooning a few pages out of some domain and retrying. > I'm afraid that's not the case anymore nowadays, and even if it > was would collide with disaggregation, as a tool stack component > legitimately issuing a domctl may not have the privileges to > balloon other than the domain it controls, in particular not Dom0.) > > > But in any case, it seems to me that we can entirely avoid the question > > of how many ranges might ever be necessary by starting with a fixed > > limit in the hypervisor, and then moving to a p2m-type based > > implementation if and when that becomes unsatisfactory. > > With all of the above I think we should rather explore the p2m-type > based approach, in particular the suggestion Kevin has made to > direct all p2m_mmio_write_dm (a name which appears to have been > badly chosen, considering that we're talking about RAM pages > here) write accesses to the default ioreq server (utilizing that > upstream qemu doesn't itself register as such). > Utilizing the default server is a backwards step. GVT-g would have to use the old HVM_PARAM mechanism to cause it's emulator to become default. I think a more appropriate mechanism would be p2m_mmio_write_dm to become something like 'p2m_ioreq_server_write' and then have a hypercall to allow it to be mapped to a particular ioreq server. Obviously only one could claim it but, with a p2t, the bit could be re-purposed to simply mean 'go look in the p2t' for more information and then the p2t could be structured to allow emulations to be steered to one of many ioreq servers (for read and/or write emulation). Paul > Jan
>>> On 05.02.16 at 10:24, <Paul.Durrant@citrix.com> wrote: > Utilizing the default server is a backwards step. GVT-g would have to use the > old HVM_PARAM mechanism to cause it's emulator to become default. I think a > more appropriate mechanism would be p2m_mmio_write_dm to become something > like 'p2m_ioreq_server_write' and then have a hypercall to allow it to be > mapped to a particular ioreq server. > Obviously only one could claim it but, with a p2t, the bit could be > re-purposed to simply mean 'go look in the p2t' for more information and > then the p2t could be structured to allow emulations to be steered to one of > many ioreq servers (for read and/or write emulation). Sounds reasonable. Jan
On Fri, Feb 5, 2016 at 3:44 AM, Tian, Kevin <kevin.tian@intel.com> wrote: >> > So as long as the currently-in-use GTT tree contains no more than >> > $LIMIT ranges, you can unshadow and reshadow; this will be slow, but >> > strictly speaking correct. >> > >> > What do you do if the guest driver switches to a GTT such that the >> > entire tree takes up more than $LIMIT entries? >> >> GPU has some special properties different from CPU, which make things >> easier. The GPU page table is constructed by CPU and used by GPU >> workloads. GPU workload itself will not change the page table. >> Meanwhile, GPU workload submission, in virtualized environment, is >> controled by our device model. Then we can reshadow the whole table >> every time before we submit workload. That can reduce the total number >> required for write protection but with performance impact, because GPU >> has to have more idle time waiting for CPU. Hope the info helps. >> Thanks! >> > > Putting in another way, it's fully under mediation when a GPU page table > (GTT) will be referenced by the GPU, so there're plenty of room to > optimize existing shadowing (always shadowing all recognized GPU page > tables), e.g. shadowing only active one when a VM is scheduled in. It's > a performance matter but no correctness issue. > > This is why Yu mentioned earlier whether we can just set a default > limit which is good for majority of use cases, while extending our > device mode to drop/recreate some shadow tables upon the limitation > is hit. I think this matches how today's CPU shadow page table is > implemented, which also has a limitation of how many shadow pages > are allowed per-VM. I don't think you've understood my question (or maybe I still don't understood the situation properly). So in memory pagetables, there's a "tree" that contains a single top-level page, which points to other pages, which defines one address space (usually corresponding to one process or thread). (This is often just refered to as 'cr3', since it's the value you write into the cr3 register on x86 processors.) I'm assuming that the structure is similar for your GPU translation tables -- that a single GTT is effectively a "tree" sort of like a process address space for an OS. And it sounds like what you're saying is: suppose we have 10 different GTTs (i.e., an entire tree / gpu thread), and each one require 1024 ranges to shadow. In that case, a limit of 8192 ranges means we can only keep 8 of the ten actually shadowed at any one time. This is not optimal, since it will occasionally mean unshadowing an entire GTT and re-shadowing another one, but it will work, because we can always make sure that the currently-active GTT is shadowed. My question is, suppose a single GTT / gpu thread / tree has 9000 ranges. It would be trivial for an attacker to break into the operating system and *construct* such a tree, but it's entirely possible that due to a combination of memory fragmentation and very large usage, the normal driver might accidentally create such a GTT. In that case, the device model will not be able to write-protect all the pages in the single GTT, and thus will not be able to correctly track changes to the currently-active GTT. What does your device model do in that case? -George
On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant <Paul.Durrant@citrix.com> wrote: > Utilizing the default server is a backwards step. GVT-g would have to use the old HVM_PARAM mechanism to cause it's emulator to become default. I think a more appropriate mechanism would be p2m_mmio_write_dm to become something like 'p2m_ioreq_server_write' and then have a hypercall to allow it to be mapped to a particular ioreq server. > Obviously only one could claim it but, with a p2t, the bit could be re-purposed to simply mean 'go look in the p2t' for more information and then the p2t could be structured to allow emulations to be steered to one of many ioreq servers (for read and/or write emulation). Right; I had in mind that Xen would allow at any given time a max of N ioreq servers to register for mmio_write_dm ranges, first-come first-served; with 'N' being '1' to begin with. If a second ioreq server requested mmio_write_dm functionality, it would get -EBUSY. This would allow their current setup (one qemu dm which doesn't do mmio_write_dm, one xengt dm which does) to work without needing to worry any more about how many pages might need to be tracked (either for efficiency or correctness). We could then extend this to some larger number (4 seems pretty reasonable to me) either by adding an extra 3 types, or by some other method (such as the one Paul suggests). -George
> -----Original Message----- > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of > George Dunlap > Sent: 05 February 2016 11:14 > To: Paul Durrant > Cc: Jan Beulich; George Dunlap; Kevin Tian; Wei Liu; Ian Campbell; Andrew > Cooper; Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; > zhiyuan.lv@intel.com; Ian Jackson; Keir (Xen.org) > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant <Paul.Durrant@citrix.com> > wrote: > > Utilizing the default server is a backwards step. GVT-g would have to use > the old HVM_PARAM mechanism to cause it's emulator to become default. I > think a more appropriate mechanism would be p2m_mmio_write_dm to > become something like 'p2m_ioreq_server_write' and then have a hypercall > to allow it to be mapped to a particular ioreq server. > > Obviously only one could claim it but, with a p2t, the bit could be re- > purposed to simply mean 'go look in the p2t' for more information and then > the p2t could be structured to allow emulations to be steered to one of many > ioreq servers (for read and/or write emulation). > > Right; I had in mind that Xen would allow at any given time a max of N > ioreq servers to register for mmio_write_dm ranges, first-come > first-served; with 'N' being '1' to begin with. If a second ioreq > server requested mmio_write_dm functionality, it would get -EBUSY. > This would allow their current setup (one qemu dm which doesn't do > mmio_write_dm, one xengt dm which does) to work without needing to > worry any more about how many pages might need to be tracked (either > for efficiency or correctness). > > We could then extend this to some larger number (4 seems pretty > reasonable to me) either by adding an extra 3 types, or by some other > method (such as the one Paul suggests). I think it would be best to do away with the 'write dm' name though. I would like to see it be possible to steer reads+writes, as well as writes (and maybe just reads?) to a particular ioreq server based on type information. So maybe we just call the existing type 'p2m_ioreq_server' and then, in the absence of a p2t, hardcode this to go to whichever emulator makes the new TBD hypercall. I think we need a proper design at this point. Given that it's Chinese New Year maybe I'll have a stab in Yu's absence. Paul > > -George
Hi George, On Fri, Feb 05, 2016 at 11:05:39AM +0000, George Dunlap wrote: > On Fri, Feb 5, 2016 at 3:44 AM, Tian, Kevin <kevin.tian@intel.com> wrote: > >> > So as long as the currently-in-use GTT tree contains no more than > >> > $LIMIT ranges, you can unshadow and reshadow; this will be slow, but > >> > strictly speaking correct. > >> > > >> > What do you do if the guest driver switches to a GTT such that the > >> > entire tree takes up more than $LIMIT entries? > >> > >> GPU has some special properties different from CPU, which make things > >> easier. The GPU page table is constructed by CPU and used by GPU > >> workloads. GPU workload itself will not change the page table. > >> Meanwhile, GPU workload submission, in virtualized environment, is > >> controled by our device model. Then we can reshadow the whole table > >> every time before we submit workload. That can reduce the total number > >> required for write protection but with performance impact, because GPU > >> has to have more idle time waiting for CPU. Hope the info helps. > >> Thanks! > >> > > > > Putting in another way, it's fully under mediation when a GPU page table > > (GTT) will be referenced by the GPU, so there're plenty of room to > > optimize existing shadowing (always shadowing all recognized GPU page > > tables), e.g. shadowing only active one when a VM is scheduled in. It's > > a performance matter but no correctness issue. > > > > This is why Yu mentioned earlier whether we can just set a default > > limit which is good for majority of use cases, while extending our > > device mode to drop/recreate some shadow tables upon the limitation > > is hit. I think this matches how today's CPU shadow page table is > > implemented, which also has a limitation of how many shadow pages > > are allowed per-VM. > > I don't think you've understood my question (or maybe I still don't > understood the situation properly). > > So in memory pagetables, there's a "tree" that contains a single > top-level page, which points to other pages, which defines one address > space (usually corresponding to one process or thread). (This is > often just refered to as 'cr3', since it's the value you write into > the cr3 register on x86 processors.) I'm assuming that the structure > is similar for your GPU translation tables -- that a single GTT is > effectively a "tree" sort of like a process address space for an OS. > > And it sounds like what you're saying is: suppose we have 10 different > GTTs (i.e., an entire tree / gpu thread), and each one require 1024 > ranges to shadow. In that case, a limit of 8192 ranges means we can > only keep 8 of the ten actually shadowed at any one time. This is not > optimal, since it will occasionally mean unshadowing an entire GTT and > re-shadowing another one, but it will work, because we can always make > sure that the currently-active GTT is shadowed. > > My question is, suppose a single GTT / gpu thread / tree has 9000 > ranges. It would be trivial for an attacker to break into the > operating system and *construct* such a tree, but it's entirely > possible that due to a combination of memory fragmentation and very > large usage, the normal driver might accidentally create such a GTT. > In that case, the device model will not be able to write-protect all > the pages in the single GTT, and thus will not be able to correctly > track changes to the currently-active GTT. What does your device > model do in that case? We can live with the partially write protected tree. That is because GPU's workload execution is controlled by the device model. We still have chance to update the shadow page table before we submit workload to GPU. The impact is performance not correctness. Thanks! -Zhiyuan > > -George
On Fri, Feb 5, 2016 at 3:13 PM, Zhiyuan Lv <zhiyuan.lv@intel.com> wrote: >> My question is, suppose a single GTT / gpu thread / tree has 9000 >> ranges. It would be trivial for an attacker to break into the >> operating system and *construct* such a tree, but it's entirely >> possible that due to a combination of memory fragmentation and very >> large usage, the normal driver might accidentally create such a GTT. >> In that case, the device model will not be able to write-protect all >> the pages in the single GTT, and thus will not be able to correctly >> track changes to the currently-active GTT. What does your device >> model do in that case? > > We can live with the partially write protected tree. That is because > GPU's workload execution is controlled by the device model. We still > have chance to update the shadow page table before we submit workload > to GPU. The impact is performance not correctness. Thanks! Right -- so it's actually never a hard limit. That's good to know. -George
> From: Paul Durrant [mailto:Paul.Durrant@citrix.com] > Sent: Friday, February 05, 2016 7:24 PM > > > -----Original Message----- > > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of > > George Dunlap > > Sent: 05 February 2016 11:14 > > To: Paul Durrant > > Cc: Jan Beulich; George Dunlap; Kevin Tian; Wei Liu; Ian Campbell; Andrew > > Cooper; Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; > > zhiyuan.lv@intel.com; Ian Jackson; Keir (Xen.org) > > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > > max_wp_ram_ranges. > > > > On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant <Paul.Durrant@citrix.com> > > wrote: > > > Utilizing the default server is a backwards step. GVT-g would have to use > > the old HVM_PARAM mechanism to cause it's emulator to become default. I > > think a more appropriate mechanism would be p2m_mmio_write_dm to > > become something like 'p2m_ioreq_server_write' and then have a hypercall > > to allow it to be mapped to a particular ioreq server. > > > Obviously only one could claim it but, with a p2t, the bit could be re- > > purposed to simply mean 'go look in the p2t' for more information and then > > the p2t could be structured to allow emulations to be steered to one of many > > ioreq servers (for read and/or write emulation). > > > > Right; I had in mind that Xen would allow at any given time a max of N > > ioreq servers to register for mmio_write_dm ranges, first-come > > first-served; with 'N' being '1' to begin with. If a second ioreq > > server requested mmio_write_dm functionality, it would get -EBUSY. > > This would allow their current setup (one qemu dm which doesn't do > > mmio_write_dm, one xengt dm which does) to work without needing to > > worry any more about how many pages might need to be tracked (either > > for efficiency or correctness). > > > > We could then extend this to some larger number (4 seems pretty > > reasonable to me) either by adding an extra 3 types, or by some other > > method (such as the one Paul suggests). > > I think it would be best to do away with the 'write dm' name though. I would like to see it > be possible to steer reads+writes, as well as writes (and maybe just reads?) to a particular > ioreq server based on type information. So maybe we just call the existing type > 'p2m_ioreq_server' and then, in the absence of a p2t, hardcode this to go to whichever > emulator makes the new TBD hypercall. > I think we need a proper design at this point. Given that it's Chinese New Year maybe I'll > have a stab in Yu's absence. > Hi, Paul, what about your progress on this? My feeling is that we do not need a new hypercall to explicitly claim whether a ioreq server wants to handle write requests. It can be implicitly marked upon whether a specific page is requested for write-protection through a specific ioreq channel, and then that ioreq server will claim the attribute automatically. Thanks Kevin
> -----Original Message----- > From: Tian, Kevin [mailto:kevin.tian@intel.com] > Sent: 16 February 2016 07:23 > To: Paul Durrant; George Dunlap > Cc: Jan Beulich; George Dunlap; Wei Liu; Ian Campbell; Andrew Cooper; > Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; Lv, Zhiyuan; Ian > Jackson; Keir (Xen.org) > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > > From: Paul Durrant [mailto:Paul.Durrant@citrix.com] > > Sent: Friday, February 05, 2016 7:24 PM > > > > > -----Original Message----- > > > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of > > > George Dunlap > > > Sent: 05 February 2016 11:14 > > > To: Paul Durrant > > > Cc: Jan Beulich; George Dunlap; Kevin Tian; Wei Liu; Ian Campbell; > Andrew > > > Cooper; Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; > > > zhiyuan.lv@intel.com; Ian Jackson; Keir (Xen.org) > > > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > > > max_wp_ram_ranges. > > > > > > On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant <Paul.Durrant@citrix.com> > > > wrote: > > > > Utilizing the default server is a backwards step. GVT-g would have to > use > > > the old HVM_PARAM mechanism to cause it's emulator to become > default. I > > > think a more appropriate mechanism would be p2m_mmio_write_dm to > > > become something like 'p2m_ioreq_server_write' and then have a > hypercall > > > to allow it to be mapped to a particular ioreq server. > > > > Obviously only one could claim it but, with a p2t, the bit could be re- > > > purposed to simply mean 'go look in the p2t' for more information and > then > > > the p2t could be structured to allow emulations to be steered to one of > many > > > ioreq servers (for read and/or write emulation). > > > > > > Right; I had in mind that Xen would allow at any given time a max of N > > > ioreq servers to register for mmio_write_dm ranges, first-come > > > first-served; with 'N' being '1' to begin with. If a second ioreq > > > server requested mmio_write_dm functionality, it would get -EBUSY. > > > This would allow their current setup (one qemu dm which doesn't do > > > mmio_write_dm, one xengt dm which does) to work without needing to > > > worry any more about how many pages might need to be tracked (either > > > for efficiency or correctness). > > > > > > We could then extend this to some larger number (4 seems pretty > > > reasonable to me) either by adding an extra 3 types, or by some other > > > method (such as the one Paul suggests). > > > > I think it would be best to do away with the 'write dm' name though. I > would like to see it > > be possible to steer reads+writes, as well as writes (and maybe just reads?) > to a particular > > ioreq server based on type information. So maybe we just call the existing > type > > 'p2m_ioreq_server' and then, in the absence of a p2t, hardcode this to go > to whichever > > emulator makes the new TBD hypercall. > > I think we need a proper design at this point. Given that it's Chinese New > Year maybe I'll > > have a stab in Yu's absence. > > > > Hi, Paul, what about your progress on this? > > My feeling is that we do not need a new hypercall to explicitly claim > whether a ioreq server wants to handle write requests. It can be > implicitly marked upon whether a specific page is requested for > write-protection through a specific ioreq channel, and then that > ioreq server will claim the attribute automatically. Hi Kevin, Is there a hypercall to do that? Maybe I'm missing something but I was under the impression that the only way to set write protection was via an HVMOP_set_mem_type and that does not carry an ioreq server id. I'm afraid I have made little progress due to the distractions of trying get some patches into Linux but my thoughts are around replacing the HVM_mmio_write_dm with something like HVM_emulate_0 (i.e. the zero-th example of a type that requires emulation, to be followed by others in future) and then add a hypercall along the lines of HVMOP_map_mem_type_to_ioreq_server which will take an ioerq server id, a type and flags saying whether it wishes to handle reads and/or writes to that type. Thoughts (anyone)? Paul > > Thanks > Kevin
>>> On 16.02.16 at 09:50, <Paul.Durrant@citrix.com> wrote: >> -----Original Message----- >> From: Tian, Kevin [mailto:kevin.tian@intel.com] >> Sent: 16 February 2016 07:23 >> To: Paul Durrant; George Dunlap >> Cc: Jan Beulich; George Dunlap; Wei Liu; Ian Campbell; Andrew Cooper; >> Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; Lv, Zhiyuan; Ian >> Jackson; Keir (Xen.org) >> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter >> max_wp_ram_ranges. >> >> > From: Paul Durrant [mailto:Paul.Durrant@citrix.com] >> > Sent: Friday, February 05, 2016 7:24 PM >> > >> > > -----Original Message----- >> > > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of >> > > George Dunlap >> > > Sent: 05 February 2016 11:14 >> > > To: Paul Durrant >> > > Cc: Jan Beulich; George Dunlap; Kevin Tian; Wei Liu; Ian Campbell; >> Andrew >> > > Cooper; Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; >> > > zhiyuan.lv@intel.com; Ian Jackson; Keir (Xen.org) >> > > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter >> > > max_wp_ram_ranges. >> > > >> > > On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant <Paul.Durrant@citrix.com> >> > > wrote: >> > > > Utilizing the default server is a backwards step. GVT-g would have to >> use >> > > the old HVM_PARAM mechanism to cause it's emulator to become >> default. I >> > > think a more appropriate mechanism would be p2m_mmio_write_dm to >> > > become something like 'p2m_ioreq_server_write' and then have a >> hypercall >> > > to allow it to be mapped to a particular ioreq server. >> > > > Obviously only one could claim it but, with a p2t, the bit could be re- >> > > purposed to simply mean 'go look in the p2t' for more information and >> then >> > > the p2t could be structured to allow emulations to be steered to one of >> many >> > > ioreq servers (for read and/or write emulation). >> > > >> > > Right; I had in mind that Xen would allow at any given time a max of N >> > > ioreq servers to register for mmio_write_dm ranges, first-come >> > > first-served; with 'N' being '1' to begin with. If a second ioreq >> > > server requested mmio_write_dm functionality, it would get -EBUSY. >> > > This would allow their current setup (one qemu dm which doesn't do >> > > mmio_write_dm, one xengt dm which does) to work without needing to >> > > worry any more about how many pages might need to be tracked (either >> > > for efficiency or correctness). >> > > >> > > We could then extend this to some larger number (4 seems pretty >> > > reasonable to me) either by adding an extra 3 types, or by some other >> > > method (such as the one Paul suggests). >> > >> > I think it would be best to do away with the 'write dm' name though. I >> would like to see it >> > be possible to steer reads+writes, as well as writes (and maybe just > reads?) >> to a particular >> > ioreq server based on type information. So maybe we just call the existing >> type >> > 'p2m_ioreq_server' and then, in the absence of a p2t, hardcode this to go >> to whichever >> > emulator makes the new TBD hypercall. >> > I think we need a proper design at this point. Given that it's Chinese New >> Year maybe I'll >> > have a stab in Yu's absence. >> > >> >> Hi, Paul, what about your progress on this? >> >> My feeling is that we do not need a new hypercall to explicitly claim >> whether a ioreq server wants to handle write requests. It can be >> implicitly marked upon whether a specific page is requested for >> write-protection through a specific ioreq channel, and then that >> ioreq server will claim the attribute automatically. > > Hi Kevin, > > Is there a hypercall to do that? Maybe I'm missing something but I was under > the impression that the only way to set write protection was via an > HVMOP_set_mem_type and that does not carry an ioreq server id. > > I'm afraid I have made little progress due to the distractions of trying get > some patches into Linux but my thoughts are around replacing the > HVM_mmio_write_dm with something like HVM_emulate_0 (i.e. the zero-th example > of a type that requires emulation, to be followed by others in future) and > then add a hypercall along the lines of HVMOP_map_mem_type_to_ioreq_server > which will take an ioerq server id, a type and flags saying whether it wishes > to handle reads and/or writes to that type. > > Thoughts (anyone)? I think as a general idea also allowing reads to be intercepted is nice, but would incur quite a few changes which we don't currently have a user for. Hence I'd suggest making the public interface ready for that without actually implementing hypervisor support. Jan
> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: 16 February 2016 10:34 > To: Paul Durrant > Cc: Andrew Cooper; George Dunlap; Ian Campbell; Ian Jackson; Stefano > Stabellini; Wei Liu; Kevin Tian; Zhiyuan Lv; Zhang Yu; xen-devel@lists.xen.org; > George Dunlap; Keir (Xen.org) > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > >>> On 16.02.16 at 09:50, <Paul.Durrant@citrix.com> wrote: > >> -----Original Message----- > >> From: Tian, Kevin [mailto:kevin.tian@intel.com] > >> Sent: 16 February 2016 07:23 > >> To: Paul Durrant; George Dunlap > >> Cc: Jan Beulich; George Dunlap; Wei Liu; Ian Campbell; Andrew Cooper; > >> Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; Lv, Zhiyuan; Ian > >> Jackson; Keir (Xen.org) > >> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > >> max_wp_ram_ranges. > >> > >> > From: Paul Durrant [mailto:Paul.Durrant@citrix.com] > >> > Sent: Friday, February 05, 2016 7:24 PM > >> > > >> > > -----Original Message----- > >> > > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of > >> > > George Dunlap > >> > > Sent: 05 February 2016 11:14 > >> > > To: Paul Durrant > >> > > Cc: Jan Beulich; George Dunlap; Kevin Tian; Wei Liu; Ian Campbell; > >> Andrew > >> > > Cooper; Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; > >> > > zhiyuan.lv@intel.com; Ian Jackson; Keir (Xen.org) > >> > > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > >> > > max_wp_ram_ranges. > >> > > > >> > > On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant > <Paul.Durrant@citrix.com> > >> > > wrote: > >> > > > Utilizing the default server is a backwards step. GVT-g would have to > >> use > >> > > the old HVM_PARAM mechanism to cause it's emulator to become > >> default. I > >> > > think a more appropriate mechanism would be p2m_mmio_write_dm > to > >> > > become something like 'p2m_ioreq_server_write' and then have a > >> hypercall > >> > > to allow it to be mapped to a particular ioreq server. > >> > > > Obviously only one could claim it but, with a p2t, the bit could be re- > >> > > purposed to simply mean 'go look in the p2t' for more information and > >> then > >> > > the p2t could be structured to allow emulations to be steered to one > of > >> many > >> > > ioreq servers (for read and/or write emulation). > >> > > > >> > > Right; I had in mind that Xen would allow at any given time a max of N > >> > > ioreq servers to register for mmio_write_dm ranges, first-come > >> > > first-served; with 'N' being '1' to begin with. If a second ioreq > >> > > server requested mmio_write_dm functionality, it would get -EBUSY. > >> > > This would allow their current setup (one qemu dm which doesn't do > >> > > mmio_write_dm, one xengt dm which does) to work without needing > to > >> > > worry any more about how many pages might need to be tracked > (either > >> > > for efficiency or correctness). > >> > > > >> > > We could then extend this to some larger number (4 seems pretty > >> > > reasonable to me) either by adding an extra 3 types, or by some other > >> > > method (such as the one Paul suggests). > >> > > >> > I think it would be best to do away with the 'write dm' name though. I > >> would like to see it > >> > be possible to steer reads+writes, as well as writes (and maybe just > > reads?) > >> to a particular > >> > ioreq server based on type information. So maybe we just call the > existing > >> type > >> > 'p2m_ioreq_server' and then, in the absence of a p2t, hardcode this to > go > >> to whichever > >> > emulator makes the new TBD hypercall. > >> > I think we need a proper design at this point. Given that it's Chinese > New > >> Year maybe I'll > >> > have a stab in Yu's absence. > >> > > >> > >> Hi, Paul, what about your progress on this? > >> > >> My feeling is that we do not need a new hypercall to explicitly claim > >> whether a ioreq server wants to handle write requests. It can be > >> implicitly marked upon whether a specific page is requested for > >> write-protection through a specific ioreq channel, and then that > >> ioreq server will claim the attribute automatically. > > > > Hi Kevin, > > > > Is there a hypercall to do that? Maybe I'm missing something but I was > under > > the impression that the only way to set write protection was via an > > HVMOP_set_mem_type and that does not carry an ioreq server id. > > > > I'm afraid I have made little progress due to the distractions of trying get > > some patches into Linux but my thoughts are around replacing the > > HVM_mmio_write_dm with something like HVM_emulate_0 (i.e. the zero- > th example > > of a type that requires emulation, to be followed by others in future) and > > then add a hypercall along the lines of > HVMOP_map_mem_type_to_ioreq_server > > which will take an ioerq server id, a type and flags saying whether it wishes > > to handle reads and/or writes to that type. > > > > Thoughts (anyone)? > > I think as a general idea also allowing reads to be intercepted is > nice, but would incur quite a few changes which we don't currently > have a user for. Hence I'd suggest making the public interface > ready for that without actually implementing hypervisor support. > Well, we need some form a hypervisor support to replace what's already there. I'd envisaged that setting HVM_emulate_0 type on a page would mean nothing until an ioreq server claims it (i.e. it stays as r/w RAM) but when the ioreq server makes the claim the EPT is changed according to whether reads and/or writes are wanted and then the fault handler steers transactions to the (single at the moment) ioreq server. I'll need to code up a PoC to make sure I'm not barking up the wrong tree though. Paul > Jan
> From: Paul Durrant [mailto:Paul.Durrant@citrix.com] > Sent: Tuesday, February 16, 2016 7:11 PM > > > -----Original Message----- > > From: Jan Beulich [mailto:JBeulich@suse.com] > > Sent: 16 February 2016 10:34 > > To: Paul Durrant > > Cc: Andrew Cooper; George Dunlap; Ian Campbell; Ian Jackson; Stefano > > Stabellini; Wei Liu; Kevin Tian; Zhiyuan Lv; Zhang Yu; xen-devel@lists.xen.org; > > George Dunlap; Keir (Xen.org) > > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > > max_wp_ram_ranges. > > > > >>> On 16.02.16 at 09:50, <Paul.Durrant@citrix.com> wrote: > > >> -----Original Message----- > > >> From: Tian, Kevin [mailto:kevin.tian@intel.com] > > >> Sent: 16 February 2016 07:23 > > >> To: Paul Durrant; George Dunlap > > >> Cc: Jan Beulich; George Dunlap; Wei Liu; Ian Campbell; Andrew Cooper; > > >> Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; Lv, Zhiyuan; Ian > > >> Jackson; Keir (Xen.org) > > >> Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > > >> max_wp_ram_ranges. > > >> > > >> > From: Paul Durrant [mailto:Paul.Durrant@citrix.com] > > >> > Sent: Friday, February 05, 2016 7:24 PM > > >> > > > >> > > -----Original Message----- > > >> > > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of > > >> > > George Dunlap > > >> > > Sent: 05 February 2016 11:14 > > >> > > To: Paul Durrant > > >> > > Cc: Jan Beulich; George Dunlap; Kevin Tian; Wei Liu; Ian Campbell; > > >> Andrew > > >> > > Cooper; Zhang Yu; xen-devel@lists.xen.org; Stefano Stabellini; > > >> > > zhiyuan.lv@intel.com; Ian Jackson; Keir (Xen.org) > > >> > > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > > >> > > max_wp_ram_ranges. > > >> > > > > >> > > On Fri, Feb 5, 2016 at 9:24 AM, Paul Durrant > > <Paul.Durrant@citrix.com> > > >> > > wrote: > > >> > > > Utilizing the default server is a backwards step. GVT-g would have to > > >> use > > >> > > the old HVM_PARAM mechanism to cause it's emulator to become > > >> default. I > > >> > > think a more appropriate mechanism would be p2m_mmio_write_dm > > to > > >> > > become something like 'p2m_ioreq_server_write' and then have a > > >> hypercall > > >> > > to allow it to be mapped to a particular ioreq server. > > >> > > > Obviously only one could claim it but, with a p2t, the bit could be re- > > >> > > purposed to simply mean 'go look in the p2t' for more information and > > >> then > > >> > > the p2t could be structured to allow emulations to be steered to one > > of > > >> many > > >> > > ioreq servers (for read and/or write emulation). > > >> > > > > >> > > Right; I had in mind that Xen would allow at any given time a max of N > > >> > > ioreq servers to register for mmio_write_dm ranges, first-come > > >> > > first-served; with 'N' being '1' to begin with. If a second ioreq > > >> > > server requested mmio_write_dm functionality, it would get -EBUSY. > > >> > > This would allow their current setup (one qemu dm which doesn't do > > >> > > mmio_write_dm, one xengt dm which does) to work without needing > > to > > >> > > worry any more about how many pages might need to be tracked > > (either > > >> > > for efficiency or correctness). > > >> > > > > >> > > We could then extend this to some larger number (4 seems pretty > > >> > > reasonable to me) either by adding an extra 3 types, or by some other > > >> > > method (such as the one Paul suggests). > > >> > > > >> > I think it would be best to do away with the 'write dm' name though. I > > >> would like to see it > > >> > be possible to steer reads+writes, as well as writes (and maybe just > > > reads?) > > >> to a particular > > >> > ioreq server based on type information. So maybe we just call the > > existing > > >> type > > >> > 'p2m_ioreq_server' and then, in the absence of a p2t, hardcode this to > > go > > >> to whichever > > >> > emulator makes the new TBD hypercall. > > >> > I think we need a proper design at this point. Given that it's Chinese > > New > > >> Year maybe I'll > > >> > have a stab in Yu's absence. > > >> > > > >> > > >> Hi, Paul, what about your progress on this? > > >> > > >> My feeling is that we do not need a new hypercall to explicitly claim > > >> whether a ioreq server wants to handle write requests. It can be > > >> implicitly marked upon whether a specific page is requested for > > >> write-protection through a specific ioreq channel, and then that > > >> ioreq server will claim the attribute automatically. > > > > > > Hi Kevin, > > > > > > Is there a hypercall to do that? Maybe I'm missing something but I was > > under > > > the impression that the only way to set write protection was via an > > > HVMOP_set_mem_type and that does not carry an ioreq server id. Thanks for clarification. I've got a bit messed on current state before. :-) > > > > > > I'm afraid I have made little progress due to the distractions of trying get > > > some patches into Linux but my thoughts are around replacing the > > > HVM_mmio_write_dm with something like HVM_emulate_0 (i.e. the zero- > > th example > > > of a type that requires emulation, to be followed by others in future) and > > > then add a hypercall along the lines of > > HVMOP_map_mem_type_to_ioreq_server > > > which will take an ioerq server id, a type and flags saying whether it wishes > > > to handle reads and/or writes to that type. > > > > > > Thoughts (anyone)? > > > > I think as a general idea also allowing reads to be intercepted is > > nice, but would incur quite a few changes which we don't currently > > have a user for. Hence I'd suggest making the public interface > > ready for that without actually implementing hypervisor support. > > > > Well, we need some form a hypervisor support to replace what's already there. > > I'd envisaged that setting HVM_emulate_0 type on a page would mean nothing until an for "mean nothing" what is the default policy then if guest happens to access it before any ioreq server claims it? > ioreq server claims it (i.e. it stays as r/w RAM) but when the ioreq server makes the claim > the EPT is changed according to whether reads and/or writes are wanted and then the fault > handler steers transactions to the (single at the moment) ioreq server. I'll need to code up > a PoC to make sure I'm not barking up the wrong tree though. > Curious any reason why we must have a HVM_emulate_0 placeholder first and why we can't allow ioreq server to claim on any existing type? Thinking about XenGT usage, I cannot envisage when a page should be set to HVM_emulate_0 first. The write-protection operation is dynamic according to guest page table operations, upon which we'll directly jump to claim phase... btw does this design consider the case where multiple ioreq servers may claim on same page? For example, different usages may both want to capture write requests on the same set of pages (say XenGT selectively write-protects a subset of pages due to shadow GTT, while another agent wants to monitor all guest writes to any guest memory page). Thanks Kevin
> -----Original Message----- [snip] > > > > > > > > I'm afraid I have made little progress due to the distractions of trying > get > > > > some patches into Linux but my thoughts are around replacing the > > > > HVM_mmio_write_dm with something like HVM_emulate_0 (i.e. the > zero- > > > th example > > > > of a type that requires emulation, to be followed by others in future) > and > > > > then add a hypercall along the lines of > > > HVMOP_map_mem_type_to_ioreq_server > > > > which will take an ioerq server id, a type and flags saying whether it > wishes > > > > to handle reads and/or writes to that type. > > > > > > > > Thoughts (anyone)? > > > > > > I think as a general idea also allowing reads to be intercepted is > > > nice, but would incur quite a few changes which we don't currently > > > have a user for. Hence I'd suggest making the public interface > > > ready for that without actually implementing hypervisor support. > > > > > > > Well, we need some form a hypervisor support to replace what's already > there. > > > > I'd envisaged that setting HVM_emulate_0 type on a page would mean > nothing until an > > for "mean nothing" what is the default policy then if guest happens to access > it > before any ioreq server claims it? > My thoughts were that, since no specific emulation has yet been requested (because no ioreq server has yet claimed it) that the default policy is to treat it as r/w RAM as I said below. This is because I think the only legal type transitions should be from HVMMEM_ram_rw to HVMMEM_emulate_0 and back again. > > ioreq server claims it (i.e. it stays as r/w RAM) but when the ioreq server > makes the claim > > the EPT is changed according to whether reads and/or writes are wanted > and then the fault > > handler steers transactions to the (single at the moment) ioreq server. I'll > need to code up > > a PoC to make sure I'm not barking up the wrong tree though. > > > > Curious any reason why we must have a HVM_emulate_0 placeholder > first and why we can't allow ioreq server to claim on any existing type? Which type were you thinking of? Remember that the ioreq server would be claiming *all* pages of that type. > Thinking about XenGT usage, I cannot envisage when a page should > be set to HVM_emulate_0 first. The write-protection operation is dynamic > according to guest page table operations, upon which we'll directly jump > to claim phase... I don't imagine that things would happen that way round in the common case. For XenGT I'd expect the ioreq server to immediately claim HVMMEM_emulate_0 and then set that type on any page that it wants to trap accesses on (which means that I am assuming that the same emulation - i.e. write accesses only - is desired for all pages... but I think that is indeed the case). > > btw does this design consider the case where multiple ioreq servers > may claim on same page? Yes it does and there are currently insufficient page types to allow any more than a single ioreq server to claim a type. My plan is that, in future, we can add a p2t mapping table to allow for more types and then introduce HVMMEM_ioreq_1, HVMMEM_ioreq_2, etc. > For example, different usages may both > want to capture write requests on the same set of pages (say XenGT > selectively write-protects a subset of pages due to shadow GTT, while > another agent wants to monitor all guest writes to any guest memory > page). Monitoring is a different thing altogether. Emulation is costly and not something you'd want to use for that purpose. If you want to monitor writes then log-dirty already exists for that purpose. > > Thanks > Kevin I hope my explanation helped. I think things will be clearer once I've had chance to actually put together a design doc. and hack up a PoC (probably only for EPT at first). Paul
>>> On 17.02.16 at 09:58, <Paul.Durrant@citrix.com> wrote: >> > I'd envisaged that setting HVM_emulate_0 type on a page would mean >> nothing until an >> >> for "mean nothing" what is the default policy then if guest happens to access >> it >> before any ioreq server claims it? >> > > My thoughts were that, since no specific emulation has yet been requested > (because no ioreq server has yet claimed it) that the default policy is to > treat it as r/w RAM as I said below. This is because I think the only legal > type transitions should be from HVMMEM_ram_rw to HVMMEM_emulate_0 and back > again. +1 Jan
> From: Paul Durrant [mailto:Paul.Durrant@citrix.com] > Sent: Wednesday, February 17, 2016 4:58 PM > > > > > btw does this design consider the case where multiple ioreq servers > > may claim on same page? > > Yes it does and there are currently insufficient page types to allow any more than a single > ioreq server to claim a type. My plan is that, in future, we can add a p2t mapping table to > allow for more types and then introduce HVMMEM_ioreq_1, HVMMEM_ioreq_2, etc. so these new types actually represent ioreq server ID, right? If yes I can then understand your earlier explanations. > > > For example, different usages may both > > want to capture write requests on the same set of pages (say XenGT > > selectively write-protects a subset of pages due to shadow GTT, while > > another agent wants to monitor all guest writes to any guest memory > > page). > > Monitoring is a different thing altogether. Emulation is costly and not something you'd want > to use for that purpose. If you want to monitor writes then log-dirty already exists for that > purpose. Agree. > > > > > Thanks > > Kevin > > I hope my explanation helped. I think things will be clearer once I've had chance to actually > put together a design doc. and hack up a PoC (probably only for EPT at first). > Thanks for the help. Let's see whether we can have some solution ready for 4.7. :-) Thanks Kevin
> -----Original Message----- > From: Tian, Kevin [mailto:kevin.tian@intel.com] > Sent: 17 February 2016 09:58 > To: Paul Durrant; Jan Beulich > Cc: Andrew Cooper; George Dunlap; Ian Campbell; Ian Jackson; Stefano > Stabellini; Wei Liu; Lv, Zhiyuan; Zhang Yu; xen-devel@lists.xen.org; George > Dunlap; Keir (Xen.org) > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > > From: Paul Durrant [mailto:Paul.Durrant@citrix.com] > > Sent: Wednesday, February 17, 2016 4:58 PM > > > > > > > > btw does this design consider the case where multiple ioreq servers > > > may claim on same page? > > > > Yes it does and there are currently insufficient page types to allow any > more than a single > > ioreq server to claim a type. My plan is that, in future, we can add a p2t > mapping table to > > allow for more types and then introduce HVMMEM_ioreq_1, > HVMMEM_ioreq_2, etc. > > so these new types actually represent ioreq server ID, right? If yes I can then > understand your earlier explanations. Well, not quite. Each of these types can be steered to exactly one ioreq server but there will be a new hypercall to control the mapping. The design doc will make it clear. Paul > > > > > > For example, different usages may both > > > want to capture write requests on the same set of pages (say XenGT > > > selectively write-protects a subset of pages due to shadow GTT, while > > > another agent wants to monitor all guest writes to any guest memory > > > page). > > > > Monitoring is a different thing altogether. Emulation is costly and not > something you'd want > > to use for that purpose. If you want to monitor writes then log-dirty > already exists for that > > purpose. > > Agree. > > > > > > > > > Thanks > > > Kevin > > > > I hope my explanation helped. I think things will be clearer once I've had > chance to actually > > put together a design doc. and hack up a PoC (probably only for EPT at > first). > > > > Thanks for the help. Let's see whether we can have some solution ready for > 4.7. :-) > > Thanks > Kevin
>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote: > Thanks for the help. Let's see whether we can have some solution ready for > 4.7. :-) Since we now seem to all agree that a different approach is going to be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM: differentiate IO/mem resources tracked by ioreq server"). Please voice objections to the plan pretty soon. Jan
> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: 17 February 2016 10:22 > To: Paul Durrant; Kevin Tian; Zhang Yu > Cc: Andrew Cooper; George Dunlap; Ian Campbell; Ian Jackson; Stefano > Stabellini; Wei Liu; Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir > (Xen.org) > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > >>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote: > > Thanks for the help. Let's see whether we can have some solution ready > for > > 4.7. :-) > > Since we now seem to all agree that a different approach is going to > be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM: > differentiate IO/mem resources tracked by ioreq server"). Please > voice objections to the plan pretty soon. > I'm happy with reversion at this point. Paul
> From: Paul Durrant [mailto:Paul.Durrant@citrix.com] > Sent: Wednesday, February 17, 2016 6:24 PM > > > -----Original Message----- > > From: Jan Beulich [mailto:JBeulich@suse.com] > > Sent: 17 February 2016 10:22 > > To: Paul Durrant; Kevin Tian; Zhang Yu > > Cc: Andrew Cooper; George Dunlap; Ian Campbell; Ian Jackson; Stefano > > Stabellini; Wei Liu; Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir > > (Xen.org) > > Subject: RE: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > > max_wp_ram_ranges. > > > > >>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote: > > > Thanks for the help. Let's see whether we can have some solution ready > > for > > > 4.7. :-) > > > > Since we now seem to all agree that a different approach is going to > > be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM: > > differentiate IO/mem resources tracked by ioreq server"). Please > > voice objections to the plan pretty soon. > > > > I'm happy with reversion at this point. > Agree. Thanks Kevin
On 17/02/16 10:22, Jan Beulich wrote: >>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote: >> Thanks for the help. Let's see whether we can have some solution ready for >> 4.7. :-) > > Since we now seem to all agree that a different approach is going to > be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM: > differentiate IO/mem resources tracked by ioreq server"). Please > voice objections to the plan pretty soon. FWIW, after this discussion, I don't have an objection to the basic interface in this series as it is, since it addresses my request that it be memory-based, and it could be switched to using p2m types behind-the-scenes -- with the exception of the knob to control how many ranges are allowed (since it exposes the internal implementation). I wouldn't object to the current implementation going in as it was in v9 (apparently), and then changing the type stuff around behind the scenes later as an optimization. I also don't think it would be terribly difficult to change the implementation as it is to just use write_dm for a single IOREQ server. We can rename it ioreq_server and expand it later. Sorry if this wasn't clear from my comments before. A new interface that's been carefully thought through would of course be nice too. -George
> -----Original Message----- > From: George Dunlap [mailto:george.dunlap@citrix.com] > Sent: 17 February 2016 11:02 > To: Jan Beulich; Paul Durrant; Kevin Tian; Zhang Yu > Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu; > Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir (Xen.org) > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > On 17/02/16 10:22, Jan Beulich wrote: > >>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote: > >> Thanks for the help. Let's see whether we can have some solution ready > for > >> 4.7. :-) > > > > Since we now seem to all agree that a different approach is going to > > be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM: > > differentiate IO/mem resources tracked by ioreq server"). Please > > voice objections to the plan pretty soon. > > FWIW, after this discussion, I don't have an objection to the basic > interface in this series as it is, since it addresses my request that it > be memory-based, and it could be switched to using p2m types > behind-the-scenes -- with the exception of the knob to control how many > ranges are allowed (since it exposes the internal implementation). > > I wouldn't object to the current implementation going in as it was in v9 > (apparently), and then changing the type stuff around behind the scenes > later as an optimization. I also don't think it would be terribly > difficult to change the implementation as it is to just use write_dm for > a single IOREQ server. We can rename it ioreq_server and expand it > later. Sorry if this wasn't clear from my comments before. > The problem is that, since you objected to wp_mem being treated the same as mmio, we had to introduce a new range type to the map/unmap hypercalls and that will become redundant if we steer by type. If we could have just increased the size of the rangeset for mmio and used that *for now* then weeks of work could have been saved going down this dead end. Paul > A new interface that's been carefully thought through would of course be > nice too. > > -George
On Wed, Feb 17, 2016 at 11:12 AM, Paul Durrant <Paul.Durrant@citrix.com> wrote: >> -----Original Message----- >> From: George Dunlap [mailto:george.dunlap@citrix.com] >> Sent: 17 February 2016 11:02 >> To: Jan Beulich; Paul Durrant; Kevin Tian; Zhang Yu >> Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu; >> Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir (Xen.org) >> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter >> max_wp_ram_ranges. >> >> On 17/02/16 10:22, Jan Beulich wrote: >> >>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote: >> >> Thanks for the help. Let's see whether we can have some solution ready >> for >> >> 4.7. :-) >> > >> > Since we now seem to all agree that a different approach is going to >> > be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM: >> > differentiate IO/mem resources tracked by ioreq server"). Please >> > voice objections to the plan pretty soon. >> >> FWIW, after this discussion, I don't have an objection to the basic >> interface in this series as it is, since it addresses my request that it >> be memory-based, and it could be switched to using p2m types >> behind-the-scenes -- with the exception of the knob to control how many >> ranges are allowed (since it exposes the internal implementation). >> >> I wouldn't object to the current implementation going in as it was in v9 >> (apparently), and then changing the type stuff around behind the scenes >> later as an optimization. I also don't think it would be terribly >> difficult to change the implementation as it is to just use write_dm for >> a single IOREQ server. We can rename it ioreq_server and expand it >> later. Sorry if this wasn't clear from my comments before. >> > > The problem is that, since you objected to wp_mem being treated the same as mmio, we had to introduce a new range type to the map/unmap hypercalls and that will become redundant if we steer by type. If we could have just increased the size of the rangeset for mmio and used that *for now* then weeks of work could have been saved going down this dead end. So first of all, let me say that I handled things respect to this thread rather poorly in a lot of ways, and you're right to be frustrated. All I can say is, I'm sorry and I'll try to do better. I'm not sure I understand what you're saying with this line: "We had to introduce a new range type to the map/unmap hypercalls that will become redundant if we steer by type". The interface you have here simply says, "Please allow the guest to read from these gpfns, but give me (an ioreq server) a notification if the guest attempts to write to them." One way to implement this is the way you have done -- by taking the existing (basically unused) write_dm p2m type, and layering the existing MMIO rangesets on top of it. Another way you could have (and AFAICS still can) implemented this interface is by allowing ioreq servers asking for write intercepts on gpfns to claim individual p2m types set aside for that purpose -- starting with the existing write_dm, which would only allow one such server to register, and later perhaps extending to allow more. In what way would the second implementation make the new type for the map/unmap hypercalls redundant? -George
> -----Original Message----- > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of > George Dunlap > Sent: 22 February 2016 15:56 > To: Paul Durrant > Cc: George Dunlap; Jan Beulich; Kevin Tian; Zhang Yu; Wei Liu; Ian Campbell; > Andrew Cooper; xen-devel@lists.xen.org; Stefano Stabellini; Zhiyuan Lv; Ian > Jackson; Keir (Xen.org) > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > On Wed, Feb 17, 2016 at 11:12 AM, Paul Durrant <Paul.Durrant@citrix.com> > wrote: > >> -----Original Message----- > >> From: George Dunlap [mailto:george.dunlap@citrix.com] > >> Sent: 17 February 2016 11:02 > >> To: Jan Beulich; Paul Durrant; Kevin Tian; Zhang Yu > >> Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu; > >> Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir (Xen.org) > >> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > >> max_wp_ram_ranges. > >> > >> On 17/02/16 10:22, Jan Beulich wrote: > >> >>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote: > >> >> Thanks for the help. Let's see whether we can have some solution > ready > >> for > >> >> 4.7. :-) > >> > > >> > Since we now seem to all agree that a different approach is going to > >> > be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM: > >> > differentiate IO/mem resources tracked by ioreq server"). Please > >> > voice objections to the plan pretty soon. > >> > >> FWIW, after this discussion, I don't have an objection to the basic > >> interface in this series as it is, since it addresses my request that it > >> be memory-based, and it could be switched to using p2m types > >> behind-the-scenes -- with the exception of the knob to control how many > >> ranges are allowed (since it exposes the internal implementation). > >> > >> I wouldn't object to the current implementation going in as it was in v9 > >> (apparently), and then changing the type stuff around behind the scenes > >> later as an optimization. I also don't think it would be terribly > >> difficult to change the implementation as it is to just use write_dm for > >> a single IOREQ server. We can rename it ioreq_server and expand it > >> later. Sorry if this wasn't clear from my comments before. > >> > > > > The problem is that, since you objected to wp_mem being treated the > same as mmio, we had to introduce a new range type to the map/unmap > hypercalls and that will become redundant if we steer by type. If we could > have just increased the size of the rangeset for mmio and used that *for > now* then weeks of work could have been saved going down this dead end. > > So first of all, let me say that I handled things respect to this > thread rather poorly in a lot of ways, and you're right to be > frustrated. All I can say is, I'm sorry and I'll try to do better. > > I'm not sure I understand what you're saying with this line: "We had > to introduce a new range type to the map/unmap hypercalls that will > become redundant if we steer by type". > What I means is that if we are going to do away with the concept of 'wp mem' and replace it with 'special type for steering to ioreq server' then we only need range types of pci, portio and mmio as we had before. > The interface you have here simply says, "Please allow the guest to > read from these gpfns, but give me (an ioreq server) a notification if > the guest attempts to write to them." One way to implement this is > the way you have done -- by taking the existing (basically unused) > write_dm p2m type, and layering the existing MMIO rangesets on top of > it. That type was introduced for this purpose; it was not there before. > Another way you could have (and AFAICS still can) implemented > this interface is by allowing ioreq servers asking for write > intercepts on gpfns to claim individual p2m types set aside for that > purpose -- starting with the existing write_dm, which would only allow > one such server to register, and later perhaps extending to allow > more. > > In what way would the second implementation make the new type for the > map/unmap hypercalls redundant? > Because that type is going to go away, now that it has no use. Paul > -George
On 22/02/16 16:02, Paul Durrant wrote: >> -----Original Message----- >> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of >> George Dunlap >> Sent: 22 February 2016 15:56 >> To: Paul Durrant >> Cc: George Dunlap; Jan Beulich; Kevin Tian; Zhang Yu; Wei Liu; Ian Campbell; >> Andrew Cooper; xen-devel@lists.xen.org; Stefano Stabellini; Zhiyuan Lv; Ian >> Jackson; Keir (Xen.org) >> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter >> max_wp_ram_ranges. >> >> On Wed, Feb 17, 2016 at 11:12 AM, Paul Durrant <Paul.Durrant@citrix.com> >> wrote: >>>> -----Original Message----- >>>> From: George Dunlap [mailto:george.dunlap@citrix.com] >>>> Sent: 17 February 2016 11:02 >>>> To: Jan Beulich; Paul Durrant; Kevin Tian; Zhang Yu >>>> Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei Liu; >>>> Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir (Xen.org) >>>> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter >>>> max_wp_ram_ranges. >>>> >>>> On 17/02/16 10:22, Jan Beulich wrote: >>>>>>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote: >>>>>> Thanks for the help. Let's see whether we can have some solution >> ready >>>> for >>>>>> 4.7. :-) >>>>> >>>>> Since we now seem to all agree that a different approach is going to >>>>> be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM: >>>>> differentiate IO/mem resources tracked by ioreq server"). Please >>>>> voice objections to the plan pretty soon. >>>> >>>> FWIW, after this discussion, I don't have an objection to the basic >>>> interface in this series as it is, since it addresses my request that it >>>> be memory-based, and it could be switched to using p2m types >>>> behind-the-scenes -- with the exception of the knob to control how many >>>> ranges are allowed (since it exposes the internal implementation). >>>> >>>> I wouldn't object to the current implementation going in as it was in v9 >>>> (apparently), and then changing the type stuff around behind the scenes >>>> later as an optimization. I also don't think it would be terribly >>>> difficult to change the implementation as it is to just use write_dm for >>>> a single IOREQ server. We can rename it ioreq_server and expand it >>>> later. Sorry if this wasn't clear from my comments before. >>>> >>> >>> The problem is that, since you objected to wp_mem being treated the >> same as mmio, we had to introduce a new range type to the map/unmap >> hypercalls and that will become redundant if we steer by type. If we could >> have just increased the size of the rangeset for mmio and used that *for >> now* then weeks of work could have been saved going down this dead end. >> >> So first of all, let me say that I handled things respect to this >> thread rather poorly in a lot of ways, and you're right to be >> frustrated. All I can say is, I'm sorry and I'll try to do better. >> >> I'm not sure I understand what you're saying with this line: "We had >> to introduce a new range type to the map/unmap hypercalls that will >> become redundant if we steer by type". >> > > What I means is that if we are going to do away with the concept of 'wp mem' and replace it with 'special type for steering to ioreq server' then we only need range types of pci, portio and mmio as we had before. I'm still confused. To me, there are two distinct things here: the interface and the implementation. From an interface perspective, ioreq servers need to be able to ask Xen to intercept writes to specific memory regions. I insisted that this interface be specifically designed for memory, not calling them MMIO regions (since it's actually memory and not MMIO regions); and I said that the explosion in number of required rangesets demonstrated that rangesets were a bad fit to implement this interface. What you did in an earlier version of this series (correct me if I'm wrong) is to make a separate hypercall for memory, but still keep using the same internal implementation (i.e., still having a write_dm p2m type and using rangesets to determine which ioreq server to give the notification to). But since the interface for memory looks exactly the same as the interface for MMIO, at some point this morphed back to "use xen_hvm_io_range but with a different range type (i.e., WP_MEM)". From an *interface* perspective that makes sense, because in both cases you want to be able to specify a domain, an ioreq server, and a range of physical addresses. I don't have any objection to the change you made to hvm_op.h in this version of the series. But from an *implementation* perspective, you're back to the very thing I objected to -- having a large and potentially unbounded number of rangesets, and baking that implementation into the interface by requiring the administrator to specify the maxiumum number of rangesets with a HVM_PARAM. My suggestion was that you retain the same interface that you exposed in hvm_op.h (adding WP_MEM), but instead of using rangesets to determine which ioreq server to send the notification to, use p2m types instead. The interface to qemu can be exactly the same as it is now, but you won't need to increase the number of rangesets. In fact, if you only allow one ioreq server to register for WP_MEM at a time, then you don't even need to change the p2m code. One way or another, you still want the ioreq server to be able to intercept writes to guest memory, so I don't understand in what way "we are going to do away with the concept of 'wp mem'". The only question is whether it's going to go through xen_hvm_io_range or some other hypecall -- and I don't particularly care which one it ends up being, as long as it doesn't require setting this magic potentially unbounded parameter. -George
> -----Original Message----- > From: George Dunlap [mailto:george.dunlap@citrix.com] > Sent: 22 February 2016 16:46 > To: Paul Durrant > Cc: Jan Beulich; Kevin Tian; Zhang Yu; Wei Liu; Ian Campbell; Andrew Cooper; > xen-devel@lists.xen.org; Stefano Stabellini; Zhiyuan Lv; Ian Jackson; Keir > (Xen.org) > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > On 22/02/16 16:02, Paul Durrant wrote: > >> -----Original Message----- > >> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of > >> George Dunlap > >> Sent: 22 February 2016 15:56 > >> To: Paul Durrant > >> Cc: George Dunlap; Jan Beulich; Kevin Tian; Zhang Yu; Wei Liu; Ian > Campbell; > >> Andrew Cooper; xen-devel@lists.xen.org; Stefano Stabellini; Zhiyuan Lv; > Ian > >> Jackson; Keir (Xen.org) > >> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > >> max_wp_ram_ranges. > >> > >> On Wed, Feb 17, 2016 at 11:12 AM, Paul Durrant > <Paul.Durrant@citrix.com> > >> wrote: > >>>> -----Original Message----- > >>>> From: George Dunlap [mailto:george.dunlap@citrix.com] > >>>> Sent: 17 February 2016 11:02 > >>>> To: Jan Beulich; Paul Durrant; Kevin Tian; Zhang Yu > >>>> Cc: Andrew Cooper; Ian Campbell; Ian Jackson; Stefano Stabellini; Wei > Liu; > >>>> Zhiyuan Lv; xen-devel@lists.xen.org; George Dunlap; Keir (Xen.org) > >>>> Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > >>>> max_wp_ram_ranges. > >>>> > >>>> On 17/02/16 10:22, Jan Beulich wrote: > >>>>>>>> On 17.02.16 at 10:58, <kevin.tian@intel.com> wrote: > >>>>>> Thanks for the help. Let's see whether we can have some solution > >> ready > >>>> for > >>>>>> 4.7. :-) > >>>>> > >>>>> Since we now seem to all agree that a different approach is going to > >>>>> be taken, I think we indeed should revert f5a32c5b8e ("x86/HVM: > >>>>> differentiate IO/mem resources tracked by ioreq server"). Please > >>>>> voice objections to the plan pretty soon. > >>>> > >>>> FWIW, after this discussion, I don't have an objection to the basic > >>>> interface in this series as it is, since it addresses my request that it > >>>> be memory-based, and it could be switched to using p2m types > >>>> behind-the-scenes -- with the exception of the knob to control how > many > >>>> ranges are allowed (since it exposes the internal implementation). > >>>> > >>>> I wouldn't object to the current implementation going in as it was in v9 > >>>> (apparently), and then changing the type stuff around behind the > scenes > >>>> later as an optimization. I also don't think it would be terribly > >>>> difficult to change the implementation as it is to just use write_dm for > >>>> a single IOREQ server. We can rename it ioreq_server and expand it > >>>> later. Sorry if this wasn't clear from my comments before. > >>>> > >>> > >>> The problem is that, since you objected to wp_mem being treated the > >> same as mmio, we had to introduce a new range type to the map/unmap > >> hypercalls and that will become redundant if we steer by type. If we could > >> have just increased the size of the rangeset for mmio and used that *for > >> now* then weeks of work could have been saved going down this dead > end. > >> > >> So first of all, let me say that I handled things respect to this > >> thread rather poorly in a lot of ways, and you're right to be > >> frustrated. All I can say is, I'm sorry and I'll try to do better. > >> > >> I'm not sure I understand what you're saying with this line: "We had > >> to introduce a new range type to the map/unmap hypercalls that will > >> become redundant if we steer by type". > >> > > > > What I means is that if we are going to do away with the concept of 'wp > mem' and replace it with 'special type for steering to ioreq server' then we > only need range types of pci, portio and mmio as we had before. > > I'm still confused. > So am I. I thought I was being quite clear. > To me, there are two distinct things here: the interface and the > implementation. > Indeed. > From an interface perspective, ioreq servers need to be able to ask Xen > to intercept writes to specific memory regions. No. You insisted on making MMIO distinct from 'memory'. We need to be able to intercept MMIO, port IO and PCI access to specific regions. > I insisted that this > interface be specifically designed for memory, not calling them MMIO > regions (since it's actually memory and not MMIO regions); and I said > that the explosion in number of required rangesets demonstrated that > rangesets were a bad fit to implement this interface. > The intended overloading of MMIO for this purpose was a pragmatic choice at the time, however it appears pragmatism has no place here. > What you did in an earlier version of this series (correct me if I'm > wrong) is to make a separate hypercall for memory, but still keep using > the same internal implementation (i.e., still having a write_dm p2m type > and using rangesets to determine which ioreq server to give the > notification to). But since the interface for memory looks exactly the > same as the interface for MMIO, at some point this morphed back to "use > xen_hvm_io_range but with a different range type (i.e., WP_MEM)". > Yes, and that's now pointless since we're going to use purely p2m types for sending memory accesses to ioreq servers. > From an *interface* perspective that makes sense, because in both cases > you want to be able to specify a domain, an ioreq server, and a range of > physical addresses. I don't have any objection to the change you made > to hvm_op.h in this version of the series. > No, if we are intercepting 'memory' purely by p2m type then there is no need for the additional range type. > But from an *implementation* perspective, you're back to the very thing > I objected to -- having a large and potentially unbounded number of > rangesets, and baking that implementation into the interface by > requiring the administrator to specify the maxiumum number of rangesets > with a HVM_PARAM. > > My suggestion was that you retain the same interface that you exposed in > hvm_op.h (adding WP_MEM), but instead of using rangesets to determine > which ioreq server to send the notification to, use p2m types instead. Ok. But ultimately we want multiple p2m types for this purpose and the ability to send accesses to a specific type to a specific ioreq server. > The interface to qemu can be exactly the same as it is now, but you > won't need to increase the number of rangesets. In fact, if you only > allow one ioreq server to register for WP_MEM at a time, then you don't > even need to change the p2m code. > Who said anything about QEMU? It's not relevant here. Functionally maybe not at this time *but* that limit of a single ioreq server should not be baked into the interface forever more. > One way or another, you still want the ioreq server to be able to > intercept writes to guest memory, so I don't understand in what way "we > are going to do away with the concept of 'wp mem'". > As I have said previously on this thread, 'wp mem' should be replaced by a 'steer to ioreq server' type. That fulfils the purpose and makes it clear what the type is for. Then additional 'steer to a different ioreq server' types can be added later when we have sufficient type-space. Paul > The only question is whether it's going to go through xen_hvm_io_range > or some other hypecall -- and I don't particularly care which one it > ends up being, as long as it doesn't require setting this magic > potentially unbounded parameter. > > -George
On 22/02/16 17:01, Paul Durrant wrote: >> What you did in an earlier version of this series (correct me if I'm >> wrong) is to make a separate hypercall for memory, but still keep using >> the same internal implementation (i.e., still having a write_dm p2m type >> and using rangesets to determine which ioreq server to give the >> notification to). But since the interface for memory looks exactly the >> same as the interface for MMIO, at some point this morphed back to "use >> xen_hvm_io_range but with a different range type (i.e., WP_MEM)". >> > > Yes, and that's now pointless since we're going to use purely p2m types for sending memory accesses to ioreq servers. > >> From an *interface* perspective that makes sense, because in both cases >> you want to be able to specify a domain, an ioreq server, and a range of >> physical addresses. I don't have any objection to the change you made >> to hvm_op.h in this version of the series. >> > > No, if we are intercepting 'memory' purely by p2m type then there is no need for the additional range type. So here seems to be the crux of our disagreement. I don't understand why you think that the WP_MEM interface described in patch 2 of this series can't be implemented using p2m types rather than rangesets. -George
> -----Original Message----- > From: George Dunlap [mailto:george.dunlap@citrix.com] > Sent: 22 February 2016 17:23 > To: Paul Durrant > Cc: Jan Beulich; Kevin Tian; Zhang Yu; Wei Liu; Ian Campbell; Andrew Cooper; > xen-devel@lists.xen.org; Stefano Stabellini; Zhiyuan Lv; Ian Jackson; Keir > (Xen.org) > Subject: Re: [Xen-devel] [PATCH v3 3/3] tools: introduce parameter > max_wp_ram_ranges. > > On 22/02/16 17:01, Paul Durrant wrote: > >> What you did in an earlier version of this series (correct me if I'm > >> wrong) is to make a separate hypercall for memory, but still keep using > >> the same internal implementation (i.e., still having a write_dm p2m type > >> and using rangesets to determine which ioreq server to give the > >> notification to). But since the interface for memory looks exactly the > >> same as the interface for MMIO, at some point this morphed back to "use > >> xen_hvm_io_range but with a different range type (i.e., WP_MEM)". > >> > > > > Yes, and that's now pointless since we're going to use purely p2m types for > sending memory accesses to ioreq servers. > > > >> From an *interface* perspective that makes sense, because in both cases > >> you want to be able to specify a domain, an ioreq server, and a range of > >> physical addresses. I don't have any objection to the change you made > >> to hvm_op.h in this version of the series. > >> > > > > No, if we are intercepting 'memory' purely by p2m type then there is no > need for the additional range type. > > So here seems to be the crux of our disagreement. > > I don't understand why you think that the WP_MEM interface described in > patch 2 of this series can't be implemented using p2m types rather than > rangesets. > Huh? I just said that's exactly how I intend to implement it: "if we are intercepting 'memory' purely by p2m type then there is no need for the additional range type" I.e. we will not intercept memory accesses by range, we will only intercept them by type. Paul > -George
diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index 8899f75..c294fd3 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -962,6 +962,32 @@ FIFO-based event channel ABI support up to 131,071 event channels. Other guests are limited to 4095 (64-bit x86 and ARM) or 1023 (32-bit x86). +=item B<max_wp_ram_ranges=N> + +Limit the maximum write-protected ram ranges that can be tracked +inside one ioreq server rangeset. + +Ioreq server uses a group of rangesets to track the I/O or memory +resources to be emulated. Default limit of ranges that one rangeset +can allocate is set to a small value, due to the fact that these +ranges are allocated in xen heap. Yet for the write-protected ram +ranges, there are circumstances under which the upper limit inside +one rangeset should exceed the default one. E.g. in Intel GVT-g, +when tracking the PPGTT(per-process graphic translation tables) on +Intel broadwell platforms, the number of page tables concerned will +be of several thousand. + +For Intel GVT-g broadwell platform, 8192 is a suggested value for +this parameter in most cases. But users who set his item explicitly +are also supposed to know the specific scenarios that necessitate +this configuration. Especially when this parameter is used: +1> for virtual devices other than vGPU in GVT-g; +2> for GVT-g, there also might be some extreme cases, e.g. too many +graphic related applications in one VM, which create a great deal of +per-process graphic translation tables; +3> for GVT-g, future cpu platforms which provide even more per-process +graphic translation tables. + =back =head2 Paravirtualised (PV) Guest Specific Options diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index fa87f53..18828c5 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -136,6 +136,11 @@ #define LIBXL_HAVE_BUILDINFO_EVENT_CHANNELS 1 /* + * libxl_domain_build_info has the u.hvm.max_wp_ram_ranges field. + */ +#define LIBXL_HAVE_BUILDINFO_HVM_MAX_WP_RAM_RANGES 1 + +/* * libxl_domain_build_info has the u.hvm.ms_vm_genid field. */ #define LIBXL_HAVE_BUILDINFO_HVM_MS_VM_GENID 1 diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index 2269998..54173cb 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -288,6 +288,9 @@ static void hvm_set_conf_params(xc_interface *handle, uint32_t domid, libxl_defbool_val(info->u.hvm.nested_hvm)); xc_hvm_param_set(handle, domid, HVM_PARAM_ALTP2M, libxl_defbool_val(info->u.hvm.altp2m)); + if (info->u.hvm.max_wp_ram_ranges > 0) + xc_hvm_param_set(handle, domid, HVM_PARAM_MAX_WP_RAM_RANGES, + info->u.hvm.max_wp_ram_ranges); } int libxl__build_pre(libxl__gc *gc, uint32_t domid, diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index 9ad7eba..9185014 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -518,6 +518,7 @@ libxl_domain_build_info = Struct("domain_build_info",[ ("serial_list", libxl_string_list), ("rdm", libxl_rdm_reserve), ("rdm_mem_boundary_memkb", MemKB), + ("max_wp_ram_ranges", uint64), ])), ("pv", Struct(None, [("kernel", string), ("slack_memkb", MemKB), diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 25507c7..0c19dee 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -35,6 +35,7 @@ #include <inttypes.h> #include <limits.h> #include <xen/hvm/e820.h> +#include <xenctrl.h> #include "libxl.h" #include "libxl_utils.h" @@ -1626,6 +1627,22 @@ static void parse_config_data(const char *config_source, if (!xlu_cfg_get_long (config, "rdm_mem_boundary", &l, 0)) b_info->u.hvm.rdm_mem_boundary_memkb = l * 1024; + + if (!xlu_cfg_get_long (config, "max_wp_ram_ranges", &l, 0)) { + uint64_t nr_pages = (b_info->max_memkb << 10) >> XC_PAGE_SHIFT; + + /* Due to rangeset's ability to combine continuous ranges, this + * parameter shall not be configured with values greater than half + * of the number of VM's page frames. It also shall not exceed 4G, + * because of the limitation from the rangeset side. */ + if (l > (nr_pages / 2) || l > UINT32_MAX) { + fprintf(stderr, "ERROR: Invalid value for \"max_wp_ram_ranges\". " + "Shall not exceed %ld or 4G.\n", nr_pages / 2); + exit(1); + } + b_info->u.hvm.max_wp_ram_ranges = l; + } + break; case LIBXL_DOMAIN_TYPE_PV: { diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index e0d998f..7339833 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -940,6 +940,8 @@ static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s, { unsigned int i; int rc; + unsigned int max_wp_ram_ranges = + s->domain->arch.hvm_domain.params[HVM_PARAM_MAX_WP_RAM_RANGES]; if ( is_default ) goto done; @@ -962,7 +964,10 @@ static int hvm_ioreq_server_alloc_rangesets(struct hvm_ioreq_server *s, if ( !s->range[i] ) goto fail; - rangeset_limit(s->range[i], MAX_NR_IO_RANGES); + rangeset_limit(s->range[i], + ((i == HVMOP_IO_RANGE_WP_MEM) ? + max_wp_ram_ranges : + MAX_NR_IO_RANGES)); } done: diff --git a/xen/include/public/hvm/params.h b/xen/include/public/hvm/params.h index 81f9451..ab3b11d 100644 --- a/xen/include/public/hvm/params.h +++ b/xen/include/public/hvm/params.h @@ -210,6 +210,9 @@ /* Boolean: Enable altp2m */ #define HVM_PARAM_ALTP2M 35 -#define HVM_NR_PARAMS 36 +/* Max write-protected ram ranges to be tracked in one ioreq server rangeset */ +#define HVM_PARAM_MAX_WP_RAM_RANGES 36 + +#define HVM_NR_PARAMS 37 #endif /* __XEN_PUBLIC_HVM_PARAMS_H__ */
A new parameter - max_wp_ram_ranges is added to set the upper limit of write-protected ram ranges to be tracked inside one ioreq server rangeset. Ioreq server uses a group of rangesets to track the I/O or memory resources to be emulated. Default limit of ranges that one rangeset can allocate is set to a small value, due to the fact that these ranges are allocated in xen heap. Yet for the write-protected ram ranges, there are circumstances under which the upper limit inside one rangeset should exceed the default one. E.g. in XenGT, when tracking the PPGTTs( per-process graphic translation tables) on Intel BDW platforms, number of page tables concerned will be of several thousand. For XenGT runing on Intel BDW platform, 8192 is a suggested value for this parameter in most cases. But users who set his item explicitly are also supposed to know the specific scenarios that necessitate this configuration. Especially when this parameter is used: 1> for virtual devices other than vGPUs in XenGT; 2> for XenGT, there also might be some extreme cases, e.g. too many graphic related applications in one VM, which create a great deal of per-process graphic translation tables; 3> for XenGT, future cpu platforms which provide even more per-process graphic translation tables. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> --- docs/man/xl.cfg.pod.5 | 26 ++++++++++++++++++++++++++ tools/libxl/libxl.h | 5 +++++ tools/libxl/libxl_dom.c | 3 +++ tools/libxl/libxl_types.idl | 1 + tools/libxl/xl_cmdimpl.c | 17 +++++++++++++++++ xen/arch/x86/hvm/hvm.c | 7 ++++++- xen/include/public/hvm/params.h | 5 ++++- 7 files changed, 62 insertions(+), 2 deletions(-)