mbox series

[v1,0/7] Implement support for external IPT monitoring

Message ID 1548605014.8764792.1592320576239.JavaMail.zimbra@cert.pl (mailing list archive)
Headers show
Series Implement support for external IPT monitoring | expand

Message

Michał Leszczyński June 16, 2020, 3:16 p.m. UTC
Intel Processor Trace is an architectural extension available in modern Intel family CPUs. It allows recording the detailed trace of activity while the processor executes the code. One might use the recorded trace to reconstruct the code flow. It means, to find out the executed code paths, determine branches taken, and so forth.

The abovementioned feature is described in Intel(R) 64 and IA-32 Architectures Software Developer's Manual Volume 3C: System Programming Guide, Part 3, Chapter 36: "Intel Processor Trace."

This patch series implements an interface that Dom0 could use in order to enable IPT for particular vCPUs in DomU, allowing for external monitoring. Such a feature has numerous applications like malware monitoring, fuzzing, or performance testing.

Michal Leszczynski (7):
  x86/vmx: add Intel PT MSR definitions
  x86/vmx: add IPT cpu feature
  x86/vmx: add ipt_state as part of vCPU state
  x86/vmx: add do_vmtrace_op
  tools/libxc: add xc_ptbuf_* functions
  tools/proctrace: add proctrace tool
  x86/vmx: switch IPT MSRs on vmentry/vmexit

 tools/libxc/include/xenctrl.h               |  59 ++++
 tools/libxc/xc_tbuf.c                       | 108 +++++++
 tools/proctrace/COPYING                     | 339 ++++++++++++++++++++
 tools/proctrace/Makefile                    |  49 +++
 tools/proctrace/proctrace.c                 | 139 ++++++++
 xen/arch/x86/hvm/hvm.c                      | 170 ++++++++++
 xen/arch/x86/hvm/vmx/vmx.c                  |  52 +++
 xen/include/asm-x86/cpufeature.h            |   1 +
 xen/include/asm-x86/hvm/hvm.h               |   9 +
 xen/include/asm-x86/hvm/vmx/vmcs.h          |  11 +
 xen/include/asm-x86/msr-index.h             |  37 +++
 xen/include/public/arch-x86/cpufeatureset.h |   1 +
 xen/include/public/hvm/hvm_op.h             |  27 ++
 13 files changed, 1002 insertions(+)
 create mode 100644 tools/proctrace/COPYING
 create mode 100644 tools/proctrace/Makefile
 create mode 100644 tools/proctrace/proctrace.c

--
2.20.1

Comments

Andrew Cooper June 16, 2020, 6:17 p.m. UTC | #1
On 16/06/2020 16:16, Michał Leszczyński wrote:
> Intel Processor Trace is an architectural extension available in modern Intel family CPUs. It allows recording the detailed trace of activity while the processor executes the code. One might use the recorded trace to reconstruct the code flow. It means, to find out the executed code paths, determine branches taken, and so forth.
>
> The abovementioned feature is described in Intel(R) 64 and IA-32 Architectures Software Developer's Manual Volume 3C: System Programming Guide, Part 3, Chapter 36: "Intel Processor Trace."
>
> This patch series implements an interface that Dom0 could use in order to enable IPT for particular vCPUs in DomU, allowing for external monitoring. Such a feature has numerous applications like malware monitoring, fuzzing, or performance testing.

Hello,

I'm very excited to see support like this appearing.  However, be aware
that we're currently in code freeze for the 4.14 release, so in-depth
reviews will probably be delayed somewhat due to our bug queue and
release activities.

That said, I've had a very quick look through the series, and have a few
general questions first.

AFAICT, this is strictly for external monitoring of the VM, not for the
VM to use itself?  If so, it shouldn't have the H tag here:

XEN_CPUFEATURE(IPT,           5*32+25) /*H  Intel Processor Trace */

because that exposes the feature to the guest, with the implication that
all other parts of the feature work as advertised.


Are there any restrictions on EPT being enabled in the first place?  I'm
not aware of any, and in principle we could use this functionality for
PV guests as well (using the CPL filter).  Therefore, I think it would
be helpful to not tie the functionality to HVM guests, even if that is
the only option enabled to start with.

The buffer mapping and creation logic is fairly problematic.  Instead of
fighting with another opencoded example, take a look at the IOREQ
server's use of "acquire resource" which is a mapping interface which
supports allocating memory on behalf of the guest, outside of the guest
memory, for use by control tools.

I think what this wants is a bit somewhere in domain_create to indicate
that external tracing is used for this domain (and allocate whatever
structures/buffers are necessary), acquire resource to map the buffers
themselves, and a domctl for any necessary runtime controls.


What semantics do you want for the buffer becoming full?  Given that
debugging/tracing is the goal, I presume "pause vcpu on full" is the
preferred behaviour, rather than drop packets on full?


When this subject was broached on xen-devel before, one issue was the
fact that all actions which are intercepted don't end up writing any
appropriate packets.  This is perhaps less of an issue for this example,
where the external agent can see VMExits in the trace, but it still
results in missing information.  (It is a major problem for PT within
the guest, and needs Xen's intercept/emulation framework being updated
to be PT-aware so it can fill in the same packets which hardware would
have done for equivalent actions.)


Thanks,

~Andrew
Michał Leszczyński June 16, 2020, 6:47 p.m. UTC | #2
----- 16 cze 2020 o 20:17, Andrew Cooper andrew.cooper3@citrix.com napisał(a):

> On 16/06/2020 16:16, Michał Leszczyński wrote:
>> Intel Processor Trace is an architectural extension available in modern Intel
>> family CPUs. It allows recording the detailed trace of activity while the
>> processor executes the code. One might use the recorded trace to reconstruct
>> the code flow. It means, to find out the executed code paths, determine
>> branches taken, and so forth.
>>
>> The abovementioned feature is described in Intel(R) 64 and IA-32 Architectures
>> Software Developer's Manual Volume 3C: System Programming Guide, Part 3,
>> Chapter 36: "Intel Processor Trace."
>>
>> This patch series implements an interface that Dom0 could use in order to enable
>> IPT for particular vCPUs in DomU, allowing for external monitoring. Such a
>> feature has numerous applications like malware monitoring, fuzzing, or
>> performance testing.
> 
> Hello,
> 
> I'm very excited to see support like this appearing.  However, be aware
> that we're currently in code freeze for the 4.14 release, so in-depth
> reviews will probably be delayed somewhat due to our bug queue and
> release activities.

Sure, take your time :)


> 
> That said, I've had a very quick look through the series, and have a few
> general questions first.
> 
> AFAICT, this is strictly for external monitoring of the VM, not for the
> VM to use itself?  If so, it shouldn't have the H tag here:
> 
> XEN_CPUFEATURE(IPT,           5*32+25) /*H  Intel Processor Trace */
> 
> because that exposes the feature to the guest, with the implication that
> all other parts of the feature work as advertised.

Ok, I will remove the H tag.


> 
> 
> Are there any restrictions on EPT being enabled in the first place?  I'm
> not aware of any, and in principle we could use this functionality for
> PV guests as well (using the CPL filter).  Therefore, I think it would
> be helpful to not tie the functionality to HVM guests, even if that is
> the only option enabled to start with.

I think at the moment it's not required to have EPT. This patch series doesn't use any translation feature flags, so the output address is always a machine physical address, regardless of context. I will check if it could be easily used with PV.


> 
> The buffer mapping and creation logic is fairly problematic.  Instead of
> fighting with another opencoded example, take a look at the IOREQ
> server's use of "acquire resource" which is a mapping interface which
> supports allocating memory on behalf of the guest, outside of the guest
> memory, for use by control tools.
> 
> I think what this wants is a bit somewhere in domain_create to indicate
> that external tracing is used for this domain (and allocate whatever
> structures/buffers are necessary), acquire resource to map the buffers
> themselves, and a domctl for any necessary runtime controls.
> 

I will check this out, this sounds like a good option as it would remove lots of complexity from the existing ipt_enable domctl.

> 
> What semantics do you want for the buffer becoming full?  Given that
> debugging/tracing is the goal, I presume "pause vcpu on full" is the
> preferred behaviour, rather than drop packets on full?
> 

Right now this is a ring-style buffer and when it would become full it would simply wrap and override the old data.

> 
> When this subject was broached on xen-devel before, one issue was the
> fact that all actions which are intercepted don't end up writing any
> appropriate packets.  This is perhaps less of an issue for this example,
> where the external agent can see VMExits in the trace, but it still
> results in missing information.  (It is a major problem for PT within
> the guest, and needs Xen's intercept/emulation framework being updated
> to be PT-aware so it can fill in the same packets which hardware would
> have done for equivalent actions.)

Ok, this sounds like a hard issue. Could you point out what could be the particular problematic cases? For instance, if something would alter EIP/RIP or CR3 then I belive it would still be recorded in PT trace (i.e. these values will be logged on VM entry).

> 
> 
> Thanks,
> 
> ~Andrew


Best regards,
Michał Leszczyński
CERT Polska
Andrew Cooper June 16, 2020, 8:16 p.m. UTC | #3
On 16/06/2020 19:47, Michał Leszczyński wrote:
> ----- 16 cze 2020 o 20:17, Andrew Cooper andrew.cooper3@citrix.com napisał(a):
>
>> Are there any restrictions on EPT being enabled in the first place?  I'm
>> not aware of any, and in principle we could use this functionality for
>> PV guests as well (using the CPL filter).  Therefore, I think it would
>> be helpful to not tie the functionality to HVM guests, even if that is
>> the only option enabled to start with.
> I think at the moment it's not required to have EPT. This patch series doesn't use any translation feature flags, so the output address is always a machine physical address, regardless of context. I will check if it could be easily used with PV.

If its trivial to add PV support then please do.  If its not, then don't
feel obliged, but please do at least consider how PV support might look
in the eventual feature.

(Generally speaking, considering "how would I make this work in other
modes where it is possible" leads to a better design.)

>> The buffer mapping and creation logic is fairly problematic.  Instead of
>> fighting with another opencoded example, take a look at the IOREQ
>> server's use of "acquire resource" which is a mapping interface which
>> supports allocating memory on behalf of the guest, outside of the guest
>> memory, for use by control tools.
>>
>> I think what this wants is a bit somewhere in domain_create to indicate
>> that external tracing is used for this domain (and allocate whatever
>> structures/buffers are necessary), acquire resource to map the buffers
>> themselves, and a domctl for any necessary runtime controls.
>>
> I will check this out, this sounds like a good option as it would remove lots of complexity from the existing ipt_enable domctl.

Xen has traditionally opted for a "and turn this extra thing on
dynamically" model, but this has caused no end of security issues and
broken corner cases.

You can see this still existing in the difference between
XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
required to chose the number of vcpus for the domain) and we're making
good progress undoing this particular wart (before 4.13, it was
concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
issuing other hypercalls between these two).

There is a lot of settings which should be immutable for the lifetime of
the domain, and external monitoring looks like another one of these. 
Specifying it at createdomain time allows for far better runtime
behaviour (you are no longer in a situation where the first time you try
to turn tracing on, you end up with -ENOMEM because another VM booted in
the meantime and used the remaining memory), and it makes for rather
more simple code in Xen itself (at runtime, you can rely on it having
been set up properly, because a failure setting up will have killed the
domain already).

>> What semantics do you want for the buffer becoming full?  Given that
>> debugging/tracing is the goal, I presume "pause vcpu on full" is the
>> preferred behaviour, rather than drop packets on full?
>>
> Right now this is a ring-style buffer and when it would become full it would simply wrap and override the old data.

How does the consumer spot that the data has wrapped?  What happens if
data starts getting logged, but noone is listening?  What happens if the
consumer exits/crashes/etc and stops listening as a consequence?

It's fine to simply state what will happen, and possibly even "don't do
that then", but the corner cases do at least need thinking about.

>> When this subject was broached on xen-devel before, one issue was the
>> fact that all actions which are intercepted don't end up writing any
>> appropriate packets.  This is perhaps less of an issue for this example,
>> where the external agent can see VMExits in the trace, but it still
>> results in missing information.  (It is a major problem for PT within
>> the guest, and needs Xen's intercept/emulation framework being updated
>> to be PT-aware so it can fill in the same packets which hardware would
>> have done for equivalent actions.)
> Ok, this sounds like a hard issue. Could you point out what could be the particular problematic cases? For instance, if something would alter EIP/RIP or CR3 then I belive it would still be recorded in PT trace (i.e. these values will be logged on VM entry).

One easy case is what happens on a Pstate transition while in the
hypervisor.  That won't be recorded.  (Perhaps this bit of data isn't
terribly interesting.)

More complicated cases exist when you start combining Xen features. 
E.g. with Introspection, a function pointer call which happens to set a
pagetable access bit bit which is write-protected will trap for
emulation, and be completed by the emulator (this is far faster than
pausing the domain, changing EPT permissions, singlestepping the vcpu,
then reinstating reduced EPT permissions).

In this case, no TIP would be generated unless the x86 emulator were
updated to know how to do this.

~Andrew
Tian, Kevin June 17, 2020, 1:35 a.m. UTC | #4
+Luwei, who developed PT for KVM and is the best one who can help
review VMX changes from Intel side. Please include him in future
post or discussion.

> -----Original Message-----
> From: Michał Leszczyński <michal.leszczynski@cert.pl>
> Sent: Wednesday, June 17, 2020 2:48 AM
> To: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Xen-devel <xen-devel@lists.xenproject.org>; Jan Beulich
> <jbeulich@suse.com>; Wei Liu <wl@xen.org>; Roger Pau Monné
> <roger.pau@citrix.com>; Nakajima, Jun <jun.nakajima@intel.com>; Tian,
> Kevin <kevin.tian@intel.com>; George Dunlap <george.dunlap@citrix.com>;
> Ian Jackson <ian.jackson@eu.citrix.com>; Julien Grall <julien@xen.org>;
> Stefano Stabellini <sstabellini@kernel.org>
> Subject: Re: [PATCH v1 0/7] Implement support for external IPT monitoring
> 
> ----- 16 cze 2020 o 20:17, Andrew Cooper andrew.cooper3@citrix.com
> napisał(a):
> 
> > On 16/06/2020 16:16, Michał Leszczyński wrote:
> >> Intel Processor Trace is an architectural extension available in modern
> Intel
> >> family CPUs. It allows recording the detailed trace of activity while the
> >> processor executes the code. One might use the recorded trace to
> reconstruct
> >> the code flow. It means, to find out the executed code paths, determine
> >> branches taken, and so forth.
> >>
> >> The abovementioned feature is described in Intel(R) 64 and IA-32
> Architectures
> >> Software Developer's Manual Volume 3C: System Programming Guide,
> Part 3,
> >> Chapter 36: "Intel Processor Trace."
> >>
> >> This patch series implements an interface that Dom0 could use in order to
> enable
> >> IPT for particular vCPUs in DomU, allowing for external monitoring. Such a
> >> feature has numerous applications like malware monitoring, fuzzing, or
> >> performance testing.
> >
> > Hello,
> >
> > I'm very excited to see support like this appearing.  However, be aware
> > that we're currently in code freeze for the 4.14 release, so in-depth
> > reviews will probably be delayed somewhat due to our bug queue and
> > release activities.
> 
> Sure, take your time :)
> 
> 
> >
> > That said, I've had a very quick look through the series, and have a few
> > general questions first.
> >
> > AFAICT, this is strictly for external monitoring of the VM, not for the
> > VM to use itself?  If so, it shouldn't have the H tag here:
> >
> > XEN_CPUFEATURE(IPT,           5*32+25) /*H  Intel Processor Trace */
> >
> > because that exposes the feature to the guest, with the implication that
> > all other parts of the feature work as advertised.
> 
> Ok, I will remove the H tag.
> 
> 
> >
> >
> > Are there any restrictions on EPT being enabled in the first place?  I'm
> > not aware of any, and in principle we could use this functionality for
> > PV guests as well (using the CPL filter).  Therefore, I think it would
> > be helpful to not tie the functionality to HVM guests, even if that is
> > the only option enabled to start with.
> 
> I think at the moment it's not required to have EPT. This patch series doesn't
> use any translation feature flags, so the output address is always a machine
> physical address, regardless of context. I will check if it could be easily used
> with PV.
> 
> 
> >
> > The buffer mapping and creation logic is fairly problematic.  Instead of
> > fighting with another opencoded example, take a look at the IOREQ
> > server's use of "acquire resource" which is a mapping interface which
> > supports allocating memory on behalf of the guest, outside of the guest
> > memory, for use by control tools.
> >
> > I think what this wants is a bit somewhere in domain_create to indicate
> > that external tracing is used for this domain (and allocate whatever
> > structures/buffers are necessary), acquire resource to map the buffers
> > themselves, and a domctl for any necessary runtime controls.
> >
> 
> I will check this out, this sounds like a good option as it would remove lots of
> complexity from the existing ipt_enable domctl.
> 
> >
> > What semantics do you want for the buffer becoming full?  Given that
> > debugging/tracing is the goal, I presume "pause vcpu on full" is the
> > preferred behaviour, rather than drop packets on full?
> >
> 
> Right now this is a ring-style buffer and when it would become full it would
> simply wrap and override the old data.
> 
> >
> > When this subject was broached on xen-devel before, one issue was the
> > fact that all actions which are intercepted don't end up writing any
> > appropriate packets.  This is perhaps less of an issue for this example,
> > where the external agent can see VMExits in the trace, but it still
> > results in missing information.  (It is a major problem for PT within
> > the guest, and needs Xen's intercept/emulation framework being updated
> > to be PT-aware so it can fill in the same packets which hardware would
> > have done for equivalent actions.)
> 
> Ok, this sounds like a hard issue. Could you point out what could be the
> particular problematic cases? For instance, if something would alter EIP/RIP
> or CR3 then I belive it would still be recorded in PT trace (i.e. these values will
> be logged on VM entry).
> 
> >
> >
> > Thanks,
> >
> > ~Andrew
> 
> 
> Best regards,
> Michał Leszczyński
> CERT Polska
Tamas K Lengyel June 17, 2020, 3:02 a.m. UTC | #5
On Tue, Jun 16, 2020 at 2:17 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 16/06/2020 19:47, Michał Leszczyński wrote:
> > ----- 16 cze 2020 o 20:17, Andrew Cooper andrew.cooper3@citrix.com napisał(a):
> >
> >> Are there any restrictions on EPT being enabled in the first place?  I'm
> >> not aware of any, and in principle we could use this functionality for
> >> PV guests as well (using the CPL filter).  Therefore, I think it would
> >> be helpful to not tie the functionality to HVM guests, even if that is
> >> the only option enabled to start with.
> > I think at the moment it's not required to have EPT. This patch series doesn't use any translation feature flags, so the output address is always a machine physical address, regardless of context. I will check if it could be easily used with PV.
>
> If its trivial to add PV support then please do.  If its not, then don't
> feel obliged, but please do at least consider how PV support might look
> in the eventual feature.
>
> (Generally speaking, considering "how would I make this work in other
> modes where it is possible" leads to a better design.)
>
> >> The buffer mapping and creation logic is fairly problematic.  Instead of
> >> fighting with another opencoded example, take a look at the IOREQ
> >> server's use of "acquire resource" which is a mapping interface which
> >> supports allocating memory on behalf of the guest, outside of the guest
> >> memory, for use by control tools.
> >>
> >> I think what this wants is a bit somewhere in domain_create to indicate
> >> that external tracing is used for this domain (and allocate whatever
> >> structures/buffers are necessary), acquire resource to map the buffers
> >> themselves, and a domctl for any necessary runtime controls.
> >>
> > I will check this out, this sounds like a good option as it would remove lots of complexity from the existing ipt_enable domctl.
>
> Xen has traditionally opted for a "and turn this extra thing on
> dynamically" model, but this has caused no end of security issues and
> broken corner cases.
>
> You can see this still existing in the difference between
> XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
> required to chose the number of vcpus for the domain) and we're making
> good progress undoing this particular wart (before 4.13, it was
> concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
> issuing other hypercalls between these two).
>
> There is a lot of settings which should be immutable for the lifetime of
> the domain, and external monitoring looks like another one of these.
> Specifying it at createdomain time allows for far better runtime
> behaviour (you are no longer in a situation where the first time you try
> to turn tracing on, you end up with -ENOMEM because another VM booted in
> the meantime and used the remaining memory), and it makes for rather
> more simple code in Xen itself (at runtime, you can rely on it having
> been set up properly, because a failure setting up will have killed the
> domain already).

I'm not in favor of this being a flag that gets set during domain
creation time. It could certainly be the case that some users would
want this being on from the start till the end but in other cases you
may want to enable it intermittently only for some time in-between
particular events. If it's an on/off flag during domain creation you
pretty much force that choice on the users and while the overhead of
PT is better than say MTF it's certainly not nothing. In case there is
an OOM situation enabling IPT dynamically the user can always just
pause the VM and wait till memory becomes available.

>
> >> What semantics do you want for the buffer becoming full?  Given that
> >> debugging/tracing is the goal, I presume "pause vcpu on full" is the
> >> preferred behaviour, rather than drop packets on full?
> >>
> > Right now this is a ring-style buffer and when it would become full it would simply wrap and override the old data.
>
> How does the consumer spot that the data has wrapped?  What happens if
> data starts getting logged, but noone is listening?  What happens if the
> consumer exits/crashes/etc and stops listening as a consequence?
>
> It's fine to simply state what will happen, and possibly even "don't do
> that then", but the corner cases do at least need thinking about.

AFAIU the current use-case is predominantly to be used in conjunction
with VMI events where you want to be able to see the trace leading up
to a particular vmexit. So in the case when the buffer is wrapped
in-between events and data is lost that's not really of concern.

Tamas
Luwei Kang June 17, 2020, 6:45 a.m. UTC | #6
> -----Original Message-----
> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Wednesday, June 17, 2020 9:35 AM
> To: Michał Leszczyński <michal.leszczynski@cert.pl>; Andrew Cooper
> <andrew.cooper3@citrix.com>
> Cc: Xen-devel <xen-devel@lists.xenproject.org>; Jan Beulich
> <jbeulich@suse.com>; Wei Liu <wl@xen.org>; Roger Pau Monné
> <roger.pau@citrix.com>; Nakajima, Jun <jun.nakajima@intel.com>; George
> Dunlap <george.dunlap@citrix.com>; Ian Jackson <ian.jackson@eu.citrix.com>;
> Julien Grall <julien@xen.org>; Stefano Stabellini <sstabellini@kernel.org>;
> Kang, Luwei <luwei.kang@intel.com>
> Subject: RE: [PATCH v1 0/7] Implement support for external IPT monitoring
> 
> +Luwei, who developed PT for KVM and is the best one who can help
> review VMX changes from Intel side. Please include him in future post or
> discussion.
> 
> > -----Original Message-----
> > From: Michał Leszczyński <michal.leszczynski@cert.pl>
> > Sent: Wednesday, June 17, 2020 2:48 AM
> > To: Andrew Cooper <andrew.cooper3@citrix.com>
> > Cc: Xen-devel <xen-devel@lists.xenproject.org>; Jan Beulich
> > <jbeulich@suse.com>; Wei Liu <wl@xen.org>; Roger Pau Monné
> > <roger.pau@citrix.com>; Nakajima, Jun <jun.nakajima@intel.com>; Tian,
> > Kevin <kevin.tian@intel.com>; George Dunlap
> > <george.dunlap@citrix.com>; Ian Jackson <ian.jackson@eu.citrix.com>;
> > Julien Grall <julien@xen.org>; Stefano Stabellini
> > <sstabellini@kernel.org>
> > Subject: Re: [PATCH v1 0/7] Implement support for external IPT
> > monitoring
> >
> > ----- 16 cze 2020 o 20:17, Andrew Cooper andrew.cooper3@citrix.com
> > napisał(a):
> >
> > > On 16/06/2020 16:16, Michał Leszczyński wrote:
> > >> Intel Processor Trace is an architectural extension available in
> > >> modern
> > Intel
> > >> family CPUs. It allows recording the detailed trace of activity
> > >> while the processor executes the code. One might use the recorded
> > >> trace to
> > reconstruct
> > >> the code flow. It means, to find out the executed code paths,
> > >> determine branches taken, and so forth.
> > >>
> > >> The abovementioned feature is described in Intel(R) 64 and IA-32
> > Architectures
> > >> Software Developer's Manual Volume 3C: System Programming Guide,
> > Part 3,
> > >> Chapter 36: "Intel Processor Trace."
> > >>
> > >> This patch series implements an interface that Dom0 could use in
> > >> order to
> > enable
> > >> IPT for particular vCPUs in DomU, allowing for external monitoring.
> > >> Such a feature has numerous applications like malware monitoring,
> > >> fuzzing, or performance testing.
> > >
> > > Hello,
> > >
> > > I'm very excited to see support like this appearing.  However, be
> > > aware that we're currently in code freeze for the 4.14 release, so
> > > in-depth reviews will probably be delayed somewhat due to our bug
> > > queue and release activities.
> >
> > Sure, take your time :)
> >
> >
> > >
> > > That said, I've had a very quick look through the series, and have a
> > > few general questions first.
> > >
> > > AFAICT, this is strictly for external monitoring of the VM, not for
> > > the VM to use itself?  If so, it shouldn't have the H tag here:
> > >
> > > XEN_CPUFEATURE(IPT,           5*32+25) /*H  Intel Processor Trace */
> > >
> > > because that exposes the feature to the guest, with the implication
> > > that all other parts of the feature work as advertised.
> >
> > Ok, I will remove the H tag.
> >
> >
> > >
> > >
> > > Are there any restrictions on EPT being enabled in the first place?
> > > I'm not aware of any, and in principle we could use this
> > > functionality for PV guests as well (using the CPL filter).
> > > Therefore, I think it would be helpful to not tie the functionality
> > > to HVM guests, even if that is the only option enabled to start with.
> >
> > I think at the moment it's not required to have EPT. This patch series
> > doesn't use any translation feature flags, so the output address is
> > always a machine physical address, regardless of context. I will check
> > if it could be easily used with PV.
> >
> >
> > >
> > > The buffer mapping and creation logic is fairly problematic.
> > > Instead of fighting with another opencoded example, take a look at
> > > the IOREQ server's use of "acquire resource" which is a mapping
> > > interface which supports allocating memory on behalf of the guest,
> > > outside of the guest memory, for use by control tools.
> > >
> > > I think what this wants is a bit somewhere in domain_create to
> > > indicate that external tracing is used for this domain (and allocate
> > > whatever structures/buffers are necessary), acquire resource to map
> > > the buffers themselves, and a domctl for any necessary runtime controls.
> > >
> >
> > I will check this out, this sounds like a good option as it would
> > remove lots of complexity from the existing ipt_enable domctl.
> >
> > >
> > > What semantics do you want for the buffer becoming full?  Given that
> > > debugging/tracing is the goal, I presume "pause vcpu on full" is the
> > > preferred behaviour, rather than drop packets on full?
> > >
> >
> > Right now this is a ring-style buffer and when it would become full it
> > would simply wrap and override the old data.
> >
> > >
> > > When this subject was broached on xen-devel before, one issue was
> > > the fact that all actions which are intercepted don't end up writing
> > > any appropriate packets.  This is perhaps less of an issue for this
> > > example, where the external agent can see VMExits in the trace, but
> > > it still results in missing information.  (It is a major problem for
> > > PT within the guest, and needs Xen's intercept/emulation framework
> > > being updated to be PT-aware so it can fill in the same packets
> > > which hardware would have done for equivalent actions.)
> >
> > Ok, this sounds like a hard issue. Could you point out what could be
> > the particular problematic cases? For instance, if something would
> > alter EIP/RIP or CR3 then I belive it would still be recorded in PT
> > trace (i.e. these values will be logged on VM entry).

e.g. If a VM exit is taken on a guest write to CR3 (including “MOV CR3” as well as task switches), the PIP packet
normally generated on the CR3 write will be missing. The PIP packet needs to be written to the PT buffer by software. Another example is VM-exit taken on RDTSC. 

For VM introspection, all the Intel PT packets may need to emulated by software. Some description in SDM as below:
If a VMM emulates an element of processor state by taking a VM exit on reads and/or writes to that piece of state, and the state element impacts Intel PT packet generation or values, it may be incumbent upon the VMM to insert or modify the output trace data.

Thanks,
Luwei Kang

> >
> > >
> > >
> > > Thanks,
> > >
> > > ~Andrew
> >
> >
> > Best regards,
> > Michał Leszczyński
> > CERT Polska
Roger Pau Monne June 17, 2020, 9:21 a.m. UTC | #7
On Wed, Jun 17, 2020 at 06:45:22AM +0000, Kang, Luwei wrote:
> > -----Original Message-----
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Wednesday, June 17, 2020 9:35 AM
> > To: Michał Leszczyński <michal.leszczynski@cert.pl>; Andrew Cooper
> > <andrew.cooper3@citrix.com>
> > Cc: Xen-devel <xen-devel@lists.xenproject.org>; Jan Beulich
> > <jbeulich@suse.com>; Wei Liu <wl@xen.org>; Roger Pau Monné
> > <roger.pau@citrix.com>; Nakajima, Jun <jun.nakajima@intel.com>; George
> > Dunlap <george.dunlap@citrix.com>; Ian Jackson <ian.jackson@eu.citrix.com>;
> > Julien Grall <julien@xen.org>; Stefano Stabellini <sstabellini@kernel.org>;
> > Kang, Luwei <luwei.kang@intel.com>
> > Subject: RE: [PATCH v1 0/7] Implement support for external IPT monitoring
> > 
> > +Luwei, who developed PT for KVM and is the best one who can help
> > review VMX changes from Intel side. Please include him in future post or
> > discussion.
> > 
> > > -----Original Message-----
> > > From: Michał Leszczyński <michal.leszczynski@cert.pl>
> > > Sent: Wednesday, June 17, 2020 2:48 AM
> > > To: Andrew Cooper <andrew.cooper3@citrix.com>
> > > Cc: Xen-devel <xen-devel@lists.xenproject.org>; Jan Beulich
> > > <jbeulich@suse.com>; Wei Liu <wl@xen.org>; Roger Pau Monné
> > > <roger.pau@citrix.com>; Nakajima, Jun <jun.nakajima@intel.com>; Tian,
> > > Kevin <kevin.tian@intel.com>; George Dunlap
> > > <george.dunlap@citrix.com>; Ian Jackson <ian.jackson@eu.citrix.com>;
> > > Julien Grall <julien@xen.org>; Stefano Stabellini
> > > <sstabellini@kernel.org>
> > > Subject: Re: [PATCH v1 0/7] Implement support for external IPT
> > > monitoring
> > >
> > > ----- 16 cze 2020 o 20:17, Andrew Cooper andrew.cooper3@citrix.com
> > > napisał(a):
> > >
> > > > On 16/06/2020 16:16, Michał Leszczyński wrote:
> > > > When this subject was broached on xen-devel before, one issue was
> > > > the fact that all actions which are intercepted don't end up writing
> > > > any appropriate packets.  This is perhaps less of an issue for this
> > > > example, where the external agent can see VMExits in the trace, but
> > > > it still results in missing information.  (It is a major problem for
> > > > PT within the guest, and needs Xen's intercept/emulation framework
> > > > being updated to be PT-aware so it can fill in the same packets
> > > > which hardware would have done for equivalent actions.)
> > >
> > > Ok, this sounds like a hard issue. Could you point out what could be
> > > the particular problematic cases? For instance, if something would
> > > alter EIP/RIP or CR3 then I belive it would still be recorded in PT
> > > trace (i.e. these values will be logged on VM entry).
> 
> e.g. If a VM exit is taken on a guest write to CR3 (including “MOV CR3” as well as task switches), the PIP packet
> normally generated on the CR3 write will be missing. The PIP packet needs to be written to the PT buffer by software. Another example is VM-exit taken on RDTSC. 
> 
> For VM introspection, all the Intel PT packets may need to emulated by software. Some description in SDM as below:
> If a VMM emulates an element of processor state by taking a VM exit on reads and/or writes to that piece of state, and the state element impacts Intel PT packet generation or values, it may be incumbent upon the VMM to insert or modify the output trace data.

I got the impression that IPT was mostly useful together with
introspection, as you can then get events from trapped instructions
(and likely emulated) from the introspection interface, while being
able to get the processor trace for non-trapped events.

I'm not sure whether there would be corner cases with trapped
instructions not being handled by the introspection framework.

How does KVM deal with this, do they insert/modify trace packets on
trapped and emulated instructions by the VMM?

Roger.
Luwei Kang June 17, 2020, 12:37 p.m. UTC | #8
> > > -----Original Message-----
> > > From: Tian, Kevin <kevin.tian@intel.com>
> > > Sent: Wednesday, June 17, 2020 9:35 AM
> > > To: Michał Leszczyński <michal.leszczynski@cert.pl>; Andrew Cooper
> > > <andrew.cooper3@citrix.com>
> > > Cc: Xen-devel <xen-devel@lists.xenproject.org>; Jan Beulich
> > > <jbeulich@suse.com>; Wei Liu <wl@xen.org>; Roger Pau Monné
> > > <roger.pau@citrix.com>; Nakajima, Jun <jun.nakajima@intel.com>;
> > > George Dunlap <george.dunlap@citrix.com>; Ian Jackson
> > > <ian.jackson@eu.citrix.com>; Julien Grall <julien@xen.org>; Stefano
> > > Stabellini <sstabellini@kernel.org>; Kang, Luwei
> > > <luwei.kang@intel.com>
> > > Subject: RE: [PATCH v1 0/7] Implement support for external IPT
> > > monitoring
> > >
> > > +Luwei, who developed PT for KVM and is the best one who can help
> > > review VMX changes from Intel side. Please include him in future
> > > post or discussion.
> > >
> > > > -----Original Message-----
> > > > From: Michał Leszczyński <michal.leszczynski@cert.pl>
> > > > Sent: Wednesday, June 17, 2020 2:48 AM
> > > > To: Andrew Cooper <andrew.cooper3@citrix.com>
> > > > Cc: Xen-devel <xen-devel@lists.xenproject.org>; Jan Beulich
> > > > <jbeulich@suse.com>; Wei Liu <wl@xen.org>; Roger Pau Monné
> > > > <roger.pau@citrix.com>; Nakajima, Jun <jun.nakajima@intel.com>;
> > > > Tian, Kevin <kevin.tian@intel.com>; George Dunlap
> > > > <george.dunlap@citrix.com>; Ian Jackson
> > > > <ian.jackson@eu.citrix.com>; Julien Grall <julien@xen.org>;
> > > > Stefano Stabellini <sstabellini@kernel.org>
> > > > Subject: Re: [PATCH v1 0/7] Implement support for external IPT
> > > > monitoring
> > > >
> > > > ----- 16 cze 2020 o 20:17, Andrew Cooper andrew.cooper3@citrix.com
> > > > napisał(a):
> > > >
> > > > > On 16/06/2020 16:16, Michał Leszczyński wrote:
> > > > > When this subject was broached on xen-devel before, one issue
> > > > > was the fact that all actions which are intercepted don't end up
> > > > > writing any appropriate packets.  This is perhaps less of an
> > > > > issue for this example, where the external agent can see VMExits
> > > > > in the trace, but it still results in missing information.  (It
> > > > > is a major problem for PT within the guest, and needs Xen's
> > > > > intercept/emulation framework being updated to be PT-aware so it
> > > > > can fill in the same packets which hardware would have done for
> > > > > equivalent actions.)
> > > >
> > > > Ok, this sounds like a hard issue. Could you point out what could
> > > > be the particular problematic cases? For instance, if something
> > > > would alter EIP/RIP or CR3 then I belive it would still be
> > > > recorded in PT trace (i.e. these values will be logged on VM entry).
> >
> > e.g. If a VM exit is taken on a guest write to CR3 (including “MOV
> > CR3” as well as task switches), the PIP packet normally generated on the CR3
> write will be missing. The PIP packet needs to be written to the PT buffer by
> software. Another example is VM-exit taken on RDTSC.
> >
> > For VM introspection, all the Intel PT packets may need to emulated by
> software. Some description in SDM as below:
> > If a VMM emulates an element of processor state by taking a VM exit on
> reads and/or writes to that piece of state, and the state element impacts Intel
> PT packet generation or values, it may be incumbent upon the VMM to insert
> or modify the output trace data.
> 
> I got the impression that IPT was mostly useful together with introspection, as
> you can then get events from trapped instructions (and likely emulated) from
> the introspection interface, while being able to get the processor trace for non-
> trapped events.
> 
> I'm not sure whether there would be corner cases with trapped instructions
> not being handled by the introspection framework.
> 
> How does KVM deal with this, do they insert/modify trace packets on trapped
> and emulated instructions by the VMM?

The KVM includes instruction decoder and emulator(arch/x86/kvm/emulate.c), and the guest's memory can be set to write-protect as well. But it doesn't support Intel PT packets software emulator. For KVM, the Intel PT feature will be exposed to KVM guest and KVM guest can use Intel PT feature like native.

Thanks,
Luwei Kang
Roger Pau Monne June 17, 2020, 12:53 p.m. UTC | #9
On Wed, Jun 17, 2020 at 12:37:13PM +0000, Kang, Luwei wrote:
> > How does KVM deal with this, do they insert/modify trace packets on trapped
> > and emulated instructions by the VMM?
> 
> The KVM includes instruction decoder and emulator(arch/x86/kvm/emulate.c), and the guest's memory can be set to write-protect as well. But it doesn't support Intel PT packets software emulator. For KVM, the Intel PT feature will be exposed to KVM guest and KVM guest can use Intel PT feature like native.

But if such feature is exposed to the guest for it's own usage, won't
it be missing packets for instructions emulated by the VMM?

Thanks, Roger.
Andrew Cooper June 17, 2020, 4:19 p.m. UTC | #10
On 17/06/2020 04:02, Tamas K Lengyel wrote:
> On Tue, Jun 16, 2020 at 2:17 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 16/06/2020 19:47, Michał Leszczyński wrote:
>>> ----- 16 cze 2020 o 20:17, Andrew Cooper andrew.cooper3@citrix.com napisał(a):
>>>
>>>> Are there any restrictions on EPT being enabled in the first place?  I'm
>>>> not aware of any, and in principle we could use this functionality for
>>>> PV guests as well (using the CPL filter).  Therefore, I think it would
>>>> be helpful to not tie the functionality to HVM guests, even if that is
>>>> the only option enabled to start with.
>>> I think at the moment it's not required to have EPT. This patch series doesn't use any translation feature flags, so the output address is always a machine physical address, regardless of context. I will check if it could be easily used with PV.
>> If its trivial to add PV support then please do.  If its not, then don't
>> feel obliged, but please do at least consider how PV support might look
>> in the eventual feature.
>>
>> (Generally speaking, considering "how would I make this work in other
>> modes where it is possible" leads to a better design.)
>>
>>>> The buffer mapping and creation logic is fairly problematic.  Instead of
>>>> fighting with another opencoded example, take a look at the IOREQ
>>>> server's use of "acquire resource" which is a mapping interface which
>>>> supports allocating memory on behalf of the guest, outside of the guest
>>>> memory, for use by control tools.
>>>>
>>>> I think what this wants is a bit somewhere in domain_create to indicate
>>>> that external tracing is used for this domain (and allocate whatever
>>>> structures/buffers are necessary), acquire resource to map the buffers
>>>> themselves, and a domctl for any necessary runtime controls.
>>>>
>>> I will check this out, this sounds like a good option as it would remove lots of complexity from the existing ipt_enable domctl.
>> Xen has traditionally opted for a "and turn this extra thing on
>> dynamically" model, but this has caused no end of security issues and
>> broken corner cases.
>>
>> You can see this still existing in the difference between
>> XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
>> required to chose the number of vcpus for the domain) and we're making
>> good progress undoing this particular wart (before 4.13, it was
>> concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
>> issuing other hypercalls between these two).
>>
>> There is a lot of settings which should be immutable for the lifetime of
>> the domain, and external monitoring looks like another one of these.
>> Specifying it at createdomain time allows for far better runtime
>> behaviour (you are no longer in a situation where the first time you try
>> to turn tracing on, you end up with -ENOMEM because another VM booted in
>> the meantime and used the remaining memory), and it makes for rather
>> more simple code in Xen itself (at runtime, you can rely on it having
>> been set up properly, because a failure setting up will have killed the
>> domain already).
> I'm not in favor of this being a flag that gets set during domain
> creation time. It could certainly be the case that some users would
> want this being on from the start till the end but in other cases you
> may want to enable it intermittently only for some time in-between
> particular events. If it's an on/off flag during domain creation you
> pretty much force that choice on the users and while the overhead of
> PT is better than say MTF it's certainly not nothing. In case there is
> an OOM situation enabling IPT dynamically the user can always just
> pause the VM and wait till memory becomes available.

There is nothing wrong with having "turn tracing on/off at runtime"
hypercalls.  It is specifically what I suggested two posts up in this
thread, but it should be limited to the TraceEn bit in RTIT_CTL.

What isn't ok is trying to allocate the buffers, write the TOPA, etc on
first-enable or first-map, because the runtime complexity of logic like
this large, and far too easy to get wrong in security relevant ways.

The domain create flag would mean "I wish to use tracing with this
domain", and not "I want tracing enabled from the getgo".

>>>> What semantics do you want for the buffer becoming full?  Given that
>>>> debugging/tracing is the goal, I presume "pause vcpu on full" is the
>>>> preferred behaviour, rather than drop packets on full?
>>>>
>>> Right now this is a ring-style buffer and when it would become full it would simply wrap and override the old data.
>> How does the consumer spot that the data has wrapped?  What happens if
>> data starts getting logged, but noone is listening?  What happens if the
>> consumer exits/crashes/etc and stops listening as a consequence?
>>
>> It's fine to simply state what will happen, and possibly even "don't do
>> that then", but the corner cases do at least need thinking about.
> AFAIU the current use-case is predominantly to be used in conjunction
> with VMI events where you want to be able to see the trace leading up
> to a particular vmexit. So in the case when the buffer is wrapped
> in-between events and data is lost that's not really of concern.

That's all fine.  I imagine the output here is voluminous, and needs
help being cut down as much as possible.

On a tangent, I presume you'd like to include VM-fork eventually, which
ought to include copying the trace buffer on fork?

~Andrew
Tamas K Lengyel June 17, 2020, 4:27 p.m. UTC | #11
On Wed, Jun 17, 2020 at 10:19 AM Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>
> On 17/06/2020 04:02, Tamas K Lengyel wrote:
> > On Tue, Jun 16, 2020 at 2:17 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >> On 16/06/2020 19:47, Michał Leszczyński wrote:
> >>> ----- 16 cze 2020 o 20:17, Andrew Cooper andrew.cooper3@citrix.com napisał(a):
> >>>
> >>>> Are there any restrictions on EPT being enabled in the first place?  I'm
> >>>> not aware of any, and in principle we could use this functionality for
> >>>> PV guests as well (using the CPL filter).  Therefore, I think it would
> >>>> be helpful to not tie the functionality to HVM guests, even if that is
> >>>> the only option enabled to start with.
> >>> I think at the moment it's not required to have EPT. This patch series doesn't use any translation feature flags, so the output address is always a machine physical address, regardless of context. I will check if it could be easily used with PV.
> >> If its trivial to add PV support then please do.  If its not, then don't
> >> feel obliged, but please do at least consider how PV support might look
> >> in the eventual feature.
> >>
> >> (Generally speaking, considering "how would I make this work in other
> >> modes where it is possible" leads to a better design.)
> >>
> >>>> The buffer mapping and creation logic is fairly problematic.  Instead of
> >>>> fighting with another opencoded example, take a look at the IOREQ
> >>>> server's use of "acquire resource" which is a mapping interface which
> >>>> supports allocating memory on behalf of the guest, outside of the guest
> >>>> memory, for use by control tools.
> >>>>
> >>>> I think what this wants is a bit somewhere in domain_create to indicate
> >>>> that external tracing is used for this domain (and allocate whatever
> >>>> structures/buffers are necessary), acquire resource to map the buffers
> >>>> themselves, and a domctl for any necessary runtime controls.
> >>>>
> >>> I will check this out, this sounds like a good option as it would remove lots of complexity from the existing ipt_enable domctl.
> >> Xen has traditionally opted for a "and turn this extra thing on
> >> dynamically" model, but this has caused no end of security issues and
> >> broken corner cases.
> >>
> >> You can see this still existing in the difference between
> >> XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
> >> required to chose the number of vcpus for the domain) and we're making
> >> good progress undoing this particular wart (before 4.13, it was
> >> concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
> >> issuing other hypercalls between these two).
> >>
> >> There is a lot of settings which should be immutable for the lifetime of
> >> the domain, and external monitoring looks like another one of these.
> >> Specifying it at createdomain time allows for far better runtime
> >> behaviour (you are no longer in a situation where the first time you try
> >> to turn tracing on, you end up with -ENOMEM because another VM booted in
> >> the meantime and used the remaining memory), and it makes for rather
> >> more simple code in Xen itself (at runtime, you can rely on it having
> >> been set up properly, because a failure setting up will have killed the
> >> domain already).
> > I'm not in favor of this being a flag that gets set during domain
> > creation time. It could certainly be the case that some users would
> > want this being on from the start till the end but in other cases you
> > may want to enable it intermittently only for some time in-between
> > particular events. If it's an on/off flag during domain creation you
> > pretty much force that choice on the users and while the overhead of
> > PT is better than say MTF it's certainly not nothing. In case there is
> > an OOM situation enabling IPT dynamically the user can always just
> > pause the VM and wait till memory becomes available.
>
> There is nothing wrong with having "turn tracing on/off at runtime"
> hypercalls.  It is specifically what I suggested two posts up in this
> thread, but it should be limited to the TraceEn bit in RTIT_CTL.
>
> What isn't ok is trying to allocate the buffers, write the TOPA, etc on
> first-enable or first-map, because the runtime complexity of logic like
> this large, and far too easy to get wrong in security relevant ways.
>
> The domain create flag would mean "I wish to use tracing with this
> domain", and not "I want tracing enabled from the getgo".

Gotcha, that's reasonable.

>
> >>>> What semantics do you want for the buffer becoming full?  Given that
> >>>> debugging/tracing is the goal, I presume "pause vcpu on full" is the
> >>>> preferred behaviour, rather than drop packets on full?
> >>>>
> >>> Right now this is a ring-style buffer and when it would become full it would simply wrap and override the old data.
> >> How does the consumer spot that the data has wrapped?  What happens if
> >> data starts getting logged, but noone is listening?  What happens if the
> >> consumer exits/crashes/etc and stops listening as a consequence?
> >>
> >> It's fine to simply state what will happen, and possibly even "don't do
> >> that then", but the corner cases do at least need thinking about.
> > AFAIU the current use-case is predominantly to be used in conjunction
> > with VMI events where you want to be able to see the trace leading up
> > to a particular vmexit. So in the case when the buffer is wrapped
> > in-between events and data is lost that's not really of concern.
>
> That's all fine.  I imagine the output here is voluminous, and needs
> help being cut down as much as possible.
>
> On a tangent, I presume you'd like to include VM-fork eventually, which
> ought to include copying the trace buffer on fork?

I would eventually like to use it to reconstruct the branch history so
we can update AFL's coverage map with that instead of having to do the
current breakpoint-singlestep dance. But for that I would only care
about the trace starting after the fork, so copying the parent's PT
buffer is not needed. We'll also probably only use PT if the branch
history is larger than what LBR can hold. I asked Michal to name the
hypercall interface "vmtrace" for this reason so we can add other
stuff like LBR later using the same interface (which I already
implemented in https://github.com/tklengyel/xen/commits/lbr).

Tamas
Andrew Cooper June 17, 2020, 5:23 p.m. UTC | #12
On 17/06/2020 17:27, Tamas K Lengyel wrote:
>>>>>> What semantics do you want for the buffer becoming full?  Given that
>>>>>> debugging/tracing is the goal, I presume "pause vcpu on full" is the
>>>>>> preferred behaviour, rather than drop packets on full?
>>>>>>
>>>>> Right now this is a ring-style buffer and when it would become full it would simply wrap and override the old data.
>>>> How does the consumer spot that the data has wrapped?  What happens if
>>>> data starts getting logged, but noone is listening?  What happens if the
>>>> consumer exits/crashes/etc and stops listening as a consequence?
>>>>
>>>> It's fine to simply state what will happen, and possibly even "don't do
>>>> that then", but the corner cases do at least need thinking about.
>>> AFAIU the current use-case is predominantly to be used in conjunction
>>> with VMI events where you want to be able to see the trace leading up
>>> to a particular vmexit. So in the case when the buffer is wrapped
>>> in-between events and data is lost that's not really of concern.
>> That's all fine.  I imagine the output here is voluminous, and needs
>> help being cut down as much as possible.
>>
>> On a tangent, I presume you'd like to include VM-fork eventually, which
>> ought to include copying the trace buffer on fork?
> I would eventually like to use it to reconstruct the branch history so
> we can update AFL's coverage map with that instead of having to do the
> current breakpoint-singlestep dance. But for that I would only care
> about the trace starting after the fork, so copying the parent's PT
> buffer is not needed. We'll also probably only use PT if the branch
> history is larger than what LBR can hold. I asked Michal to name the
> hypercall interface "vmtrace" for this reason so we can add other
> stuff like LBR later using the same interface (which I already
> implemented in https://github.com/tklengyel/xen/commits/lbr).

I was wondering when someone was going to want LBR data like this. 
Can't you borrow the LBR-stitching tricks from Linux's perf to recover
the call trace even when its deeper than the LBR stack?

What about PEBS?  ISTR there is a fairly complicated matrix of which
features work in combination.


As for naming, we should definitely have something fairly generic. 
AFAICT, it would be applicable to ARM's CoreSight facilities as well.

~Andrew
Michał Leszczyński June 17, 2020, 7:30 p.m. UTC | #13
----- 17 cze 2020 o 18:27, Tamas K Lengyel tamas.k.lengyel@gmail.com napisał(a):

> On Wed, Jun 17, 2020 at 10:19 AM Andrew Cooper
> <andrew.cooper3@citrix.com> wrote:
>>
>> On 17/06/2020 04:02, Tamas K Lengyel wrote:
>> > On Tue, Jun 16, 2020 at 2:17 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> >> On 16/06/2020 19:47, Michał Leszczyński wrote:
>> >>> ----- 16 cze 2020 o 20:17, Andrew Cooper andrew.cooper3@citrix.com napisał(a):
>> >>>
>> >>>> Are there any restrictions on EPT being enabled in the first place?  I'm
>> >>>> not aware of any, and in principle we could use this functionality for
>> >>>> PV guests as well (using the CPL filter).  Therefore, I think it would
>> >>>> be helpful to not tie the functionality to HVM guests, even if that is
>> >>>> the only option enabled to start with.
>> >>> I think at the moment it's not required to have EPT. This patch series doesn't
>> >>> use any translation feature flags, so the output address is always a machine
>> >>> physical address, regardless of context. I will check if it could be easily
>> >>> used with PV.
>> >> If its trivial to add PV support then please do.  If its not, then don't
>> >> feel obliged, but please do at least consider how PV support might look
>> >> in the eventual feature.
>> >>
>> >> (Generally speaking, considering "how would I make this work in other
>> >> modes where it is possible" leads to a better design.)
>> >>
>> >>>> The buffer mapping and creation logic is fairly problematic.  Instead of
>> >>>> fighting with another opencoded example, take a look at the IOREQ
>> >>>> server's use of "acquire resource" which is a mapping interface which
>> >>>> supports allocating memory on behalf of the guest, outside of the guest
>> >>>> memory, for use by control tools.
>> >>>>
>> >>>> I think what this wants is a bit somewhere in domain_create to indicate
>> >>>> that external tracing is used for this domain (and allocate whatever
>> >>>> structures/buffers are necessary), acquire resource to map the buffers
>> >>>> themselves, and a domctl for any necessary runtime controls.
>> >>>>
>> >>> I will check this out, this sounds like a good option as it would remove lots of
>> >>> complexity from the existing ipt_enable domctl.
>> >> Xen has traditionally opted for a "and turn this extra thing on
>> >> dynamically" model, but this has caused no end of security issues and
>> >> broken corner cases.
>> >>
>> >> You can see this still existing in the difference between
>> >> XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
>> >> required to chose the number of vcpus for the domain) and we're making
>> >> good progress undoing this particular wart (before 4.13, it was
>> >> concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
>> >> issuing other hypercalls between these two).
>> >>
>> >> There is a lot of settings which should be immutable for the lifetime of
>> >> the domain, and external monitoring looks like another one of these.
>> >> Specifying it at createdomain time allows for far better runtime
>> >> behaviour (you are no longer in a situation where the first time you try
>> >> to turn tracing on, you end up with -ENOMEM because another VM booted in
>> >> the meantime and used the remaining memory), and it makes for rather
>> >> more simple code in Xen itself (at runtime, you can rely on it having
>> >> been set up properly, because a failure setting up will have killed the
>> >> domain already).
>> > I'm not in favor of this being a flag that gets set during domain
>> > creation time. It could certainly be the case that some users would
>> > want this being on from the start till the end but in other cases you
>> > may want to enable it intermittently only for some time in-between
>> > particular events. If it's an on/off flag during domain creation you
>> > pretty much force that choice on the users and while the overhead of
>> > PT is better than say MTF it's certainly not nothing. In case there is
>> > an OOM situation enabling IPT dynamically the user can always just
>> > pause the VM and wait till memory becomes available.
>>
>> There is nothing wrong with having "turn tracing on/off at runtime"
>> hypercalls.  It is specifically what I suggested two posts up in this
>> thread, but it should be limited to the TraceEn bit in RTIT_CTL.
>>
>> What isn't ok is trying to allocate the buffers, write the TOPA, etc on
>> first-enable or first-map, because the runtime complexity of logic like
>> this large, and far too easy to get wrong in security relevant ways.
>>
>> The domain create flag would mean "I wish to use tracing with this
>> domain", and not "I want tracing enabled from the getgo".
> 
> Gotcha, that's reasonable.
> 


I think I also agree with this, i.e. to alloc buffers on domain creation and just enable/disable the feature in runtime. This would remove some complexity from runtime. I think it's usually (always?) known in advance whether we would like to use external monitoring on a domain or not.

I will try to adapt this approach in patch v2.


>>
>> >>>> What semantics do you want for the buffer becoming full?  Given that
>> >>>> debugging/tracing is the goal, I presume "pause vcpu on full" is the
>> >>>> preferred behaviour, rather than drop packets on full?
>> >>>>
>> >>> Right now this is a ring-style buffer and when it would become full it would
>> >>> simply wrap and override the old data.
>> >> How does the consumer spot that the data has wrapped?  What happens if
>> >> data starts getting logged, but noone is listening?  What happens if the
>> >> consumer exits/crashes/etc and stops listening as a consequence?
>> >>
>> >> It's fine to simply state what will happen, and possibly even "don't do
>> >> that then", but the corner cases do at least need thinking about.
>> > AFAIU the current use-case is predominantly to be used in conjunction
>> > with VMI events where you want to be able to see the trace leading up
>> > to a particular vmexit. So in the case when the buffer is wrapped
>> > in-between events and data is lost that's not really of concern.
>>
>> That's all fine.  I imagine the output here is voluminous, and needs
>> help being cut down as much as possible.
>>
>> On a tangent, I presume you'd like to include VM-fork eventually, which
>> ought to include copying the trace buffer on fork?
> 
> I would eventually like to use it to reconstruct the branch history so
> we can update AFL's coverage map with that instead of having to do the
> current breakpoint-singlestep dance. But for that I would only care
> about the trace starting after the fork, so copying the parent's PT
> buffer is not needed. We'll also probably only use PT if the branch
> history is larger than what LBR can hold. I asked Michal to name the
> hypercall interface "vmtrace" for this reason so we can add other
> stuff like LBR later using the same interface (which I already
> implemented in https://github.com/tklengyel/xen/commits/lbr).
> 
> Tamas
Tamas K Lengyel June 17, 2020, 7:31 p.m. UTC | #14
On Wed, Jun 17, 2020 at 11:23 AM Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>
> On 17/06/2020 17:27, Tamas K Lengyel wrote:
> >>>>>> What semantics do you want for the buffer becoming full?  Given that
> >>>>>> debugging/tracing is the goal, I presume "pause vcpu on full" is the
> >>>>>> preferred behaviour, rather than drop packets on full?
> >>>>>>
> >>>>> Right now this is a ring-style buffer and when it would become full it would simply wrap and override the old data.
> >>>> How does the consumer spot that the data has wrapped?  What happens if
> >>>> data starts getting logged, but noone is listening?  What happens if the
> >>>> consumer exits/crashes/etc and stops listening as a consequence?
> >>>>
> >>>> It's fine to simply state what will happen, and possibly even "don't do
> >>>> that then", but the corner cases do at least need thinking about.
> >>> AFAIU the current use-case is predominantly to be used in conjunction
> >>> with VMI events where you want to be able to see the trace leading up
> >>> to a particular vmexit. So in the case when the buffer is wrapped
> >>> in-between events and data is lost that's not really of concern.
> >> That's all fine.  I imagine the output here is voluminous, and needs
> >> help being cut down as much as possible.
> >>
> >> On a tangent, I presume you'd like to include VM-fork eventually, which
> >> ought to include copying the trace buffer on fork?
> > I would eventually like to use it to reconstruct the branch history so
> > we can update AFL's coverage map with that instead of having to do the
> > current breakpoint-singlestep dance. But for that I would only care
> > about the trace starting after the fork, so copying the parent's PT
> > buffer is not needed. We'll also probably only use PT if the branch
> > history is larger than what LBR can hold. I asked Michal to name the
> > hypercall interface "vmtrace" for this reason so we can add other
> > stuff like LBR later using the same interface (which I already
> > implemented in https://github.com/tklengyel/xen/commits/lbr).
>
> I was wondering when someone was going to want LBR data like this.
> Can't you borrow the LBR-stitching tricks from Linux's perf to recover
> the call trace even when its deeper than the LBR stack?

TBH I only spent like an hour putting it together so I haven't
investigated the topic too much. But thanks for the tip, first I heard
about this LBR-stitching trick ;)

>
> What about PEBS?  ISTR there is a fairly complicated matrix of which
> features work in combination.

There is also BTS.. I would assume it would take some experimentation
to figure out what works and when and in what combination. Right now I
have no plans for doing that experimentation or adding support for
additional tracers.

>
>
> As for naming, we should definitely have something fairly generic.
> AFAICT, it would be applicable to ARM's CoreSight facilities as well.

IMHO XEN_DOMCTL_vmtrace would be a good name for controlling these features.

Tamas
Michał Leszczyński June 17, 2020, 8:20 p.m. UTC | #15
----- 17 cze 2020 o 18:19, Andrew Cooper andrew.cooper3@citrix.com napisał(a):

> On 17/06/2020 04:02, Tamas K Lengyel wrote:
>> On Tue, Jun 16, 2020 at 2:17 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>> On 16/06/2020 19:47, Michał Leszczyński wrote:
>>>> ----- 16 cze 2020 o 20:17, Andrew Cooper andrew.cooper3@citrix.com napisał(a):
>>>>
>>>>> Are there any restrictions on EPT being enabled in the first place?  I'm
>>>>> not aware of any, and in principle we could use this functionality for
>>>>> PV guests as well (using the CPL filter).  Therefore, I think it would
>>>>> be helpful to not tie the functionality to HVM guests, even if that is
>>>>> the only option enabled to start with.
>>>> I think at the moment it's not required to have EPT. This patch series doesn't
>>>> use any translation feature flags, so the output address is always a machine
>>>> physical address, regardless of context. I will check if it could be easily
>>>> used with PV.
>>> If its trivial to add PV support then please do.  If its not, then don't
>>> feel obliged, but please do at least consider how PV support might look
>>> in the eventual feature.
>>>
>>> (Generally speaking, considering "how would I make this work in other
>>> modes where it is possible" leads to a better design.)
>>>
>>>>> The buffer mapping and creation logic is fairly problematic.  Instead of
>>>>> fighting with another opencoded example, take a look at the IOREQ
>>>>> server's use of "acquire resource" which is a mapping interface which
>>>>> supports allocating memory on behalf of the guest, outside of the guest
>>>>> memory, for use by control tools.
>>>>>


One thing that remains unclear to me is the "acquire resource" part. Could you give some more details on that?

Assuming that buffers are allocated right from the domain creation, what mechanism (instead of xc_map_foreign_range) should I use to map the IPT buffers into Dom0?


>>>>> I think what this wants is a bit somewhere in domain_create to indicate
>>>>> that external tracing is used for this domain (and allocate whatever
>>>>> structures/buffers are necessary), acquire resource to map the buffers
>>>>> themselves, and a domctl for any necessary runtime controls.
>>>>>
>>>> I will check this out, this sounds like a good option as it would remove lots of
>>>> complexity from the existing ipt_enable domctl.
>>> Xen has traditionally opted for a "and turn this extra thing on
>>> dynamically" model, but this has caused no end of security issues and
>>> broken corner cases.
>>>
>>> You can see this still existing in the difference between
>>> XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
>>> required to chose the number of vcpus for the domain) and we're making
>>> good progress undoing this particular wart (before 4.13, it was
>>> concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
>>> issuing other hypercalls between these two).
>>>
>>> There is a lot of settings which should be immutable for the lifetime of
>>> the domain, and external monitoring looks like another one of these.
>>> Specifying it at createdomain time allows for far better runtime
>>> behaviour (you are no longer in a situation where the first time you try
>>> to turn tracing on, you end up with -ENOMEM because another VM booted in
>>> the meantime and used the remaining memory), and it makes for rather
>>> more simple code in Xen itself (at runtime, you can rely on it having
>>> been set up properly, because a failure setting up will have killed the
>>> domain already).
>> I'm not in favor of this being a flag that gets set during domain
>> creation time. It could certainly be the case that some users would
>> want this being on from the start till the end but in other cases you
>> may want to enable it intermittently only for some time in-between
>> particular events. If it's an on/off flag during domain creation you
>> pretty much force that choice on the users and while the overhead of
>> PT is better than say MTF it's certainly not nothing. In case there is
>> an OOM situation enabling IPT dynamically the user can always just
>> pause the VM and wait till memory becomes available.
> 
> There is nothing wrong with having "turn tracing on/off at runtime"
> hypercalls.  It is specifically what I suggested two posts up in this
> thread, but it should be limited to the TraceEn bit in RTIT_CTL.
> 
> What isn't ok is trying to allocate the buffers, write the TOPA, etc on
> first-enable or first-map, because the runtime complexity of logic like
> this large, and far too easy to get wrong in security relevant ways.
> 
> The domain create flag would mean "I wish to use tracing with this
> domain", and not "I want tracing enabled from the getgo".
> 
>>>>> What semantics do you want for the buffer becoming full?  Given that
>>>>> debugging/tracing is the goal, I presume "pause vcpu on full" is the
>>>>> preferred behaviour, rather than drop packets on full?
>>>>>
>>>> Right now this is a ring-style buffer and when it would become full it would
>>>> simply wrap and override the old data.
>>> How does the consumer spot that the data has wrapped?  What happens if
>>> data starts getting logged, but noone is listening?  What happens if the
>>> consumer exits/crashes/etc and stops listening as a consequence?
>>>
>>> It's fine to simply state what will happen, and possibly even "don't do
>>> that then", but the corner cases do at least need thinking about.
>> AFAIU the current use-case is predominantly to be used in conjunction
>> with VMI events where you want to be able to see the trace leading up
>> to a particular vmexit. So in the case when the buffer is wrapped
>> in-between events and data is lost that's not really of concern.
> 
> That's all fine.  I imagine the output here is voluminous, and needs
> help being cut down as much as possible.
> 
> On a tangent, I presume you'd like to include VM-fork eventually, which
> ought to include copying the trace buffer on fork?
> 
> ~Andrew
Luwei Kang June 17, 2020, 11:29 p.m. UTC | #16
> > > How does KVM deal with this, do they insert/modify trace packets on
> > > trapped and emulated instructions by the VMM?
> >
> > The KVM includes instruction decoder and
> emulator(arch/x86/kvm/emulate.c), and the guest's memory can be set to
> write-protect as well. But it doesn't support Intel PT packets software emulator.
> For KVM, the Intel PT feature will be exposed to KVM guest and KVM guest can
> use Intel PT feature like native.
> 
> But if such feature is exposed to the guest for it's own usage, won't it be
> missing packets for instructions emulated by the VMM?

If setting the guest's memory write-protect, I think yes. 

Thanks,
Luwei Kang

> 
> Thanks, Roger.
Michał Leszczyński June 18, 2020, 12:56 a.m. UTC | #17
----- 18 cze 2020 o 1:29, Kang, Luwei luwei.kang@intel.com napisał(a):

>> > > How does KVM deal with this, do they insert/modify trace packets on
>> > > trapped and emulated instructions by the VMM?
>> >
>> > The KVM includes instruction decoder and
>> emulator(arch/x86/kvm/emulate.c), and the guest's memory can be set to
>> write-protect as well. But it doesn't support Intel PT packets software
>> emulator.
>> For KVM, the Intel PT feature will be exposed to KVM guest and KVM guest can
>> use Intel PT feature like native.
>> 
>> But if such feature is exposed to the guest for it's own usage, won't it be
>> missing packets for instructions emulated by the VMM?
> 
> If setting the guest's memory write-protect, I think yes.


Thus, I propose to leave it as it is right now. If somebody is purposely altering the VM state then he/she should consult not only the IPT but also understand what was done "in the meantime" by additional features, e.g. when something was altered by vm_event callback. As Tamas said previously, we usually just want to see certain path leading to vmexit.

Please also note that there is a PTWRITE instruction that could be used in the future in order to add custom payloads/hints to the PT trace, when needed.


> 
> Thanks,
> Luwei Kang
> 
>> 
> > Thanks, Roger.
Roger Pau Monne June 18, 2020, 7 a.m. UTC | #18
On Thu, Jun 18, 2020 at 02:56:17AM +0200, Michał Leszczyński wrote:
> ----- 18 cze 2020 o 1:29, Kang, Luwei luwei.kang@intel.com napisał(a):
> 
> >> > > How does KVM deal with this, do they insert/modify trace packets on
> >> > > trapped and emulated instructions by the VMM?
> >> >
> >> > The KVM includes instruction decoder and
> >> emulator(arch/x86/kvm/emulate.c), and the guest's memory can be set to
> >> write-protect as well. But it doesn't support Intel PT packets software
> >> emulator.
> >> For KVM, the Intel PT feature will be exposed to KVM guest and KVM guest can
> >> use Intel PT feature like native.
> >> 
> >> But if such feature is exposed to the guest for it's own usage, won't it be
> >> missing packets for instructions emulated by the VMM?
> > 
> > If setting the guest's memory write-protect, I think yes.
> 
> 
> Thus, I propose to leave it as it is right now. If somebody is purposely altering the VM state then he/she should consult not only the IPT but also understand what was done "in the meantime" by additional features, e.g. when something was altered by vm_event callback. As Tamas said previously, we usually just want to see certain path leading to vmexit.
> 
> Please also note that there is a PTWRITE instruction that could be used in the future in order to add custom payloads/hints to the PT trace, when needed.

Yes, I think the usage of IPT by a third party against a guest is
fine, as such third party can also use introspection and get the
information about the emulated instructions.

OTOH exposing the feature to the guest itself for it's own usage seems
wrong without adding the packets related to the instructions emulated.

I understand the current series only cares about the first option, so
that's perfectly fine.

Roger.
Roger Pau Monne June 18, 2020, 8:25 a.m. UTC | #19
On Wed, Jun 17, 2020 at 10:20:20PM +0200, Michał Leszczyński wrote:
> ----- 17 cze 2020 o 18:19, Andrew Cooper andrew.cooper3@citrix.com napisał(a):
> 
> > On 17/06/2020 04:02, Tamas K Lengyel wrote:
> >> On Tue, Jun 16, 2020 at 2:17 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >>> On 16/06/2020 19:47, Michał Leszczyński wrote:
> >>>> ----- 16 cze 2020 o 20:17, Andrew Cooper andrew.cooper3@citrix.com napisał(a):
> >>>>
> >>>>> Are there any restrictions on EPT being enabled in the first place?  I'm
> >>>>> not aware of any, and in principle we could use this functionality for
> >>>>> PV guests as well (using the CPL filter).  Therefore, I think it would
> >>>>> be helpful to not tie the functionality to HVM guests, even if that is
> >>>>> the only option enabled to start with.
> >>>> I think at the moment it's not required to have EPT. This patch series doesn't
> >>>> use any translation feature flags, so the output address is always a machine
> >>>> physical address, regardless of context. I will check if it could be easily
> >>>> used with PV.
> >>> If its trivial to add PV support then please do.  If its not, then don't
> >>> feel obliged, but please do at least consider how PV support might look
> >>> in the eventual feature.
> >>>
> >>> (Generally speaking, considering "how would I make this work in other
> >>> modes where it is possible" leads to a better design.)
> >>>
> >>>>> The buffer mapping and creation logic is fairly problematic.  Instead of
> >>>>> fighting with another opencoded example, take a look at the IOREQ
> >>>>> server's use of "acquire resource" which is a mapping interface which
> >>>>> supports allocating memory on behalf of the guest, outside of the guest
> >>>>> memory, for use by control tools.
> >>>>>
> 
> 
> One thing that remains unclear to me is the "acquire resource" part. Could you give some more details on that?
> 
> Assuming that buffers are allocated right from the domain creation, what mechanism (instead of xc_map_foreign_range) should I use to map the IPT buffers into Dom0?

Take a look at demu's demu_initialize function [0] (and it's usage of
xenforeignmemory_map_resource), you likely need something similar for
the trace buffers, introducing a new XENMEM_resource_trace_data kind
of resource (naming subject to change), and use the id field in
xen_mem_acquire_resource to signal which vCPU buffer you want to
map.

That's usable by both PV and HVM guests.

Roger.

[0] http://xenbits.xen.org/gitweb/?p=people/pauldu/demu.git;a=blob;f=demu.c;h=f785b394d0cf141dffa05bdddecf338214358aea;hb=refs/heads/master#l453
Michał Leszczyński June 18, 2020, 2:59 p.m. UTC | #20
----- 17 cze 2020 o 18:19, Andrew Cooper andrew.cooper3@citrix.com napisał(a):

> On 17/06/2020 04:02, Tamas K Lengyel wrote:
>> On Tue, Jun 16, 2020 at 2:17 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>> On 16/06/2020 19:47, Michał Leszczyński wrote:
>>>> ----- 16 cze 2020 o 20:17, Andrew Cooper andrew.cooper3@citrix.com napisał(a):
>>>>
>>>>> Are there any restrictions on EPT being enabled in the first place?  I'm
>>>>> not aware of any, and in principle we could use this functionality for
>>>>> PV guests as well (using the CPL filter).  Therefore, I think it would
>>>>> be helpful to not tie the functionality to HVM guests, even if that is
>>>>> the only option enabled to start with.
>>>> I think at the moment it's not required to have EPT. This patch series doesn't
>>>> use any translation feature flags, so the output address is always a machine
>>>> physical address, regardless of context. I will check if it could be easily
>>>> used with PV.
>>> If its trivial to add PV support then please do.  If its not, then don't
>>> feel obliged, but please do at least consider how PV support might look
>>> in the eventual feature.
>>>
>>> (Generally speaking, considering "how would I make this work in other
>>> modes where it is possible" leads to a better design.)
>>>
>>>>> The buffer mapping and creation logic is fairly problematic.  Instead of
>>>>> fighting with another opencoded example, take a look at the IOREQ
>>>>> server's use of "acquire resource" which is a mapping interface which
>>>>> supports allocating memory on behalf of the guest, outside of the guest
>>>>> memory, for use by control tools.
>>>>>
>>>>> I think what this wants is a bit somewhere in domain_create to indicate
>>>>> that external tracing is used for this domain (and allocate whatever
>>>>> structures/buffers are necessary), acquire resource to map the buffers
>>>>> themselves, and a domctl for any necessary runtime controls.
>>>>>
>>>> I will check this out, this sounds like a good option as it would remove lots of
>>>> complexity from the existing ipt_enable domctl.
>>> Xen has traditionally opted for a "and turn this extra thing on
>>> dynamically" model, but this has caused no end of security issues and
>>> broken corner cases.
>>>
>>> You can see this still existing in the difference between
>>> XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
>>> required to chose the number of vcpus for the domain) and we're making
>>> good progress undoing this particular wart (before 4.13, it was
>>> concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
>>> issuing other hypercalls between these two).
>>>
>>> There is a lot of settings which should be immutable for the lifetime of
>>> the domain, and external monitoring looks like another one of these.
>>> Specifying it at createdomain time allows for far better runtime
>>> behaviour (you are no longer in a situation where the first time you try
>>> to turn tracing on, you end up with -ENOMEM because another VM booted in
>>> the meantime and used the remaining memory), and it makes for rather
>>> more simple code in Xen itself (at runtime, you can rely on it having
>>> been set up properly, because a failure setting up will have killed the
>>> domain already).
>> I'm not in favor of this being a flag that gets set during domain
>> creation time. It could certainly be the case that some users would
>> want this being on from the start till the end but in other cases you
>> may want to enable it intermittently only for some time in-between
>> particular events. If it's an on/off flag during domain creation you
>> pretty much force that choice on the users and while the overhead of
>> PT is better than say MTF it's certainly not nothing. In case there is
>> an OOM situation enabling IPT dynamically the user can always just
>> pause the VM and wait till memory becomes available.
> 
> There is nothing wrong with having "turn tracing on/off at runtime"
> hypercalls.  It is specifically what I suggested two posts up in this
> thread, but it should be limited to the TraceEn bit in RTIT_CTL.
> 
> What isn't ok is trying to allocate the buffers, write the TOPA, etc on
> first-enable or first-map, because the runtime complexity of logic like
> this large, and far too easy to get wrong in security relevant ways.
> 
> The domain create flag would mean "I wish to use tracing with this
> domain", and not "I want tracing enabled from the getgo".
> 


I'm trying to implement this suggestion right now. I've already switched to acquire_resource and now I want to make buffers statically allocated on domain creation.

I think it would be reasonable to add an option like "vmtrace_ipt_size" in xl.cfg. By default it would be 0 (meaning "disabled"), and when it's set to non-zero value by the user, the IPT buffers of given size will be allocated for each vCPU upon domain creation.

Could you give some hints about how to correctly add a new option in xl.cfg in a way that it's accessible on the hypervisor part? This part related to configuration parsing/processing is what I don't understand yet.


>>>>> What semantics do you want for the buffer becoming full?  Given that
>>>>> debugging/tracing is the goal, I presume "pause vcpu on full" is the
>>>>> preferred behaviour, rather than drop packets on full?
>>>>>
>>>> Right now this is a ring-style buffer and when it would become full it would
>>>> simply wrap and override the old data.
>>> How does the consumer spot that the data has wrapped?  What happens if
>>> data starts getting logged, but noone is listening?  What happens if the
>>> consumer exits/crashes/etc and stops listening as a consequence?
>>>
>>> It's fine to simply state what will happen, and possibly even "don't do
>>> that then", but the corner cases do at least need thinking about.
>> AFAIU the current use-case is predominantly to be used in conjunction
>> with VMI events where you want to be able to see the trace leading up
>> to a particular vmexit. So in the case when the buffer is wrapped
>> in-between events and data is lost that's not really of concern.
> 
> That's all fine.  I imagine the output here is voluminous, and needs
> help being cut down as much as possible.
> 
> On a tangent, I presume you'd like to include VM-fork eventually, which
> ought to include copying the trace buffer on fork?
> 
> ~Andrew