mbox series

[for-4.17,v1,0/2] xenctrl.ml: improve scalability of domain_getinfolist

Message ID cover.1667324874.git.edvin.torok@citrix.com (mailing list archive)
Headers show
Series xenctrl.ml: improve scalability of domain_getinfolist | expand

Message

Edwin Török Nov. 1, 2022, 5:59 p.m. UTC
Pau has performed some performance tests by booting 1000 mirage
unikernels test VMs and shutting them down.
We've noticed on the flamegraphs that a lot of time is spent in Xenctrl
`domain_getinfolist`, 17.7% of overall time
(This needs to be multiplied by 16 because Dom0 100% usage = 16 vCPUs)
In particular time is spent in camlXenctrl___getlist_339
as can be seen from this flamegraph, and it also creates a very deep
call stack:
https://cdn.jsdelivr.net/gh/edwintorok/xen@xenctrl-coverletter/docs/tmp/perf-merge-boot.svg?x=948.9&y=2213

After some algorithmic improvements to the code now the function barely
shows up at all on a flamegraph, taking only 0.02%.
The function is called camlXenctrl___getlist_343, but that is just due
to the changed arguments, still the same function:
https://cdn.jsdelivr.net/gh/edwintorok/xen@xenctrl-coverletter/docs/tmp/perf-xen-boot-1150.svg?x=1188.0&y=1941&s=infolist

It was calling the Xen hypercall ~500*1000 times for 1000 VMs, and
instead it is now calling it "only" 1000 times.

I would suggest to try to take this in 4.17 given the massive
improvement in scalability (number of VMs on a Xen host).

There are further improvements possible here, but they'll be in xenopsd
(part of XAPI) to avoid calling domain_getinfolist and just use
domain_getinfo: the only reason it needs use infolist is that it does
the lookup by VM UUID and not by domid, but it could have a small cache
of UUID->domid mappings and then call just domain_getinfo (or get the
mapping from xenstore if not in the cache), but it looks like that
improvement is not even needed if this function barely registers on a
flamegraph now.

P.S.: the mirage test VM is a very old PV version, at some point we'll
repeat the test with a Solo5 based PVH one.

Edwin Török (2):
  xenctrl.ml: make domain_getinfolist tail recursive
  xenctrl: use larger chunksize in domain_getinfolist

 tools/ocaml/libs/xc/xenctrl.ml | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

Comments

Christian Lindig Nov. 2, 2022, 9:11 a.m. UTC | #1
> On 1 Nov 2022, at 17:59, Edwin Török <edvin.torok@citrix.com> wrote:
> 
> 
> Edwin Török (2):
>  xenctrl.ml: make domain_getinfolist tail recursive
>  xenctrl: use larger chunksize in domain_getinfolist
> 
> tools/ocaml/libs/xc/xenctrl.ml | 25 ++++++++++++++++++-------
> 1 file changed, 18 insertions(+), 7 deletions(-)

Acked-by: Christian Lindig <christian.lindig@citrix.com>


> It was calling the Xen domainfolist hypercall N/2 times.
> Optimize this such that it is called at most 2 times during normal use.
> 
> Implement a tail recursive `rev_concat` equivalent to `concat |> rev`,
> and use it instead of calling `@` multiple times.

Are there any assurances about the order in elements returned by domain_getinfolist? I understand that the change maintains the current behaviour but are we even required to maintain that order? Because otherwise we could return the reverse list and save more work.

— C
Edwin Török Nov. 2, 2022, 9:31 a.m. UTC | #2
> On 2 Nov 2022, at 09:11, Christian Lindig <christian.lindig@citrix.com> wrote:
> 
> 
> 
>> On 1 Nov 2022, at 17:59, Edwin Török <edvin.torok@citrix.com> wrote:
>> 
>> 
>> Edwin Török (2):
>> xenctrl.ml: make domain_getinfolist tail recursive
>> xenctrl: use larger chunksize in domain_getinfolist
>> 
>> tools/ocaml/libs/xc/xenctrl.ml | 25 ++++++++++++++++++-------
>> 1 file changed, 18 insertions(+), 7 deletions(-)
> 
> Acked-by: Christian Lindig <christian.lindig@citrix.com>
> 
> 
>> It was calling the Xen domainfolist hypercall N/2 times.
>> Optimize this such that it is called at most 2 times during normal use.
>> 
>> Implement a tail recursive `rev_concat` equivalent to `concat |> rev`,
>> and use it instead of calling `@` multiple times.
> 
> Are there any assurances about the order in elements returned by domain_getinfolist? I understand that the change maintains the current behaviour but are we even required to maintain that order? Because otherwise we could return the reverse list and save more work.


Maintaining the current (reversed) order in the C stubs is useful to be able to extract the largest domid so we know where to continue.
However we don't necessarily have to maintain the order on the higher level function available in the .mli, it just happens that `rev_concat` gives us the results in the order that we want.

Although perhaps using an array rather a list here might be better, and we could move resizing/reallocing that array into the C stub, and have just a nicer (and perhaps faster) interface in the C/OCaml API,
by producing less garbage than lists (a single Array.t allocation rather than N list elements).
(Although I don't usually like pushing complexity like that into C stubs, for now I'm happy with the performance of the code as is).

There are more issues in these C stubs in xenctrl, I'll be pouring over the code and try to send a few more patches.

Best regards,
--Edwin
Edwin Török Nov. 2, 2022, 9:37 a.m. UTC | #3
> On 2 Nov 2022, at 09:11, Christian Lindig <christian.lindig@citrix.com> wrote:
> 
> 
> 
>> On 1 Nov 2022, at 17:59, Edwin Török <edvin.torok@citrix.com> wrote:
>> 
>> 
>> Edwin Török (2):
>> xenctrl.ml: make domain_getinfolist tail recursive
>> xenctrl: use larger chunksize in domain_getinfolist
>> 
>> tools/ocaml/libs/xc/xenctrl.ml | 25 ++++++++++++++++++-------
>> 1 file changed, 18 insertions(+), 7 deletions(-)
> 
> Acked-by: Christian Lindig <christian.lindig@citrix.com>
> 
> 
>> It was calling the Xen domainfolist hypercall N/2 times.
>> Optimize this such that it is called at most 2 times during normal use.
>> 
>> Implement a tail recursive `rev_concat` equivalent to `concat |> rev`,
>> and use it instead of calling `@` multiple times.
> 
> Are there any assurances about the order in elements returned by domain_getinfolist? I understand that the change maintains the current behaviour but are we even required to maintain that order? Because otherwise we could return the reverse list and save more work.


After some discussion with Andrew Cooper apparently the xenctrl API is broken and cannot be used safely as is, so I'll be rewriting this patch.
Domids can be assigned randomly, or toolstacks can set a custom domid, so it is not guaranteed that new domids always show up as higher numbers,
and the only safe way to use infolist is to either request 1 domid, or request them all (domids are 15-bit, so 32768).
Anything else is prone to race conditions and might give you an inconsistent snapshot.

This is probably better fixed by changing the prototype of the C function in xenctrl to not take a count to force updating all callers
(the callers in the 'xl' toolstack are just as buggy as the one in this OCaml binding, and probably better to fix the API in xenctrl than working around the race condition in each caller).

Stay tuned for more patches...

Best regards,
--Edwin
Andrew Cooper Nov. 2, 2022, 5:26 p.m. UTC | #4
Hi,

To be clear, what is presented here are clear improvements in the status
quo, and qualify for inclusion on their own merits.  And definitely
should be considered.


That said, this is a swamp with future problems, and one rather
fundamental one in Xen which I is not going to be easy to fix.

1) (simple), there are a bunch of stubs, including
stub_xc_domain_getinfo() which don't use
caml_{enter,leave}_blocking_section() when they should.

2) stub_xc_domain_getinfo() reuses xc_domain_getinfolist() meaning that
it uses XEN_SYSCTL_getdomaininfolist rather than
XEN_DOMCTL_getdomaininfo, which is a problem because...

3) While both of these hypercalls have crazy APIs leading to loads of
broken callers, at least the DOMCTL has a fastpath for when you specify
a valid domid.  The SYSCTL (and DOMCTL slowpath) is an O(N) search of
all domains starting from d0 to find the first domain with an id >= the
input request.

The DOMCTL slowpath is useless.  Not a single caller (I've ever
encountered) wants that behaviour, whereas I've needed to bugfix caller
which didn't have an "&& info.domid == domid" to have one, to get the
semantics they wanted.  Cleaning this up will be a good improvement.

4) The (adjusted) algorithm in patch 1 still loops until it gets a
result with no entries.  Meaning that it's still going to make one
hypercall too many, searching the entire domlist, to return no data.  In
principle you could optimise to stop at any hypercall which returns
fewer than the requested number of domains, except...

5) ... if you ever use more than a single hypercall, Xen has dropped and
re-acquired the domlist read lock, meaning that you can't use the result
anyway.  Causality couldn't be broken when domains were allocated
sequentially, but we have a random allocation mode now so you can
observe things out of order.

The only safe action is to ask for all 32k domains in a single go, and
use a single hypercall.

~Andrew