[v2,4/7] x86/vmx: add do_vmtrace_op

Message ID	1005194077.9820950.1592523663199.JavaMail.zimbra@cert.pl (mailing list archive)
State	Superseded
Headers	show Return-Path: <SRS0=CsTU=77=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 99FDD207D8 Date: Fri, 19 Jun 2020 01:41:03 +0200 (CEST) From: =?utf-8?q?Micha=C5=82_Leszczy=C5=84ski?= <michal.leszczynski@cert.pl> To: Xen-devel <xen-devel@lists.xenproject.org> Message-ID: <1005194077.9820950.1592523663199.JavaMail.zimbra@cert.pl> In-Reply-To: <122238637.9820857.1592523264685.JavaMail.zimbra@cert.pl> References: <122238637.9820857.1592523264685.JavaMail.zimbra@cert.pl> Subject: [PATCH v2 4/7] x86/vmx: add do_vmtrace_op MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Thread-Topic: x86/vmx: add do_vmtrace_op Thread-Index: hSzh8Vr462omVBiCuz/GiNtdRkOdyEuhVFaU Precedence: list Cc: Kevin Tian <kevin.tian@intel.com>, Stefano Stabellini <sstabellini@kernel.org>, Julien Grall <julien@xen.org>, Jun Nakajima <jun.nakajima@intel.com>, Wei Liu <wl@xen.org>, Andrew Cooper <andrew.cooper3@citrix.com>, Ian Jackson <ian.jackson@eu.citrix.com>, George Dunlap <george.dunlap@citrix.com>, Jan Beulich <jbeulich@suse.com>, Tamas K Lengyel <tamas.k.lengyel@gmail.com>, "Kang, Luwei" <luwei.kang@intel.com>, Roger Pau =?utf-8?b?TW9ubsOp?= <roger.pau@citrix.com> Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	Implement support for external IPT monitoring \| expand [v2,0/7] Implement support for external IPT monitoring [v2,1/7] xen/mm: lift 32 item limit from mfn/gfn arrays [v2,2/7] x86/vmx: add Intel PT MSR definitions [v2,3/7] x86/vmx: add IPT cpu feature [v2,4/7] x86/vmx: add do_vmtrace_op [v2,5/7] tools/libxc: add xc_vmtrace_* functions [v2,6/7] tools/libxl: add vmtrace_pt_size parameter [v2,7/7] tools/proctrace: add proctrace tool

Michał Leszczyński June 18, 2020, 11:41 p.m. UTC

Provide an interface for privileged domains to manage
external IPT monitoring. Guest IPT state will be preserved
across vmentry/vmexit using ipt_state structure.

Signed-off-by: Michal Leszczynski <michal.leszczynski@cert.pl>
---
 xen/arch/x86/hvm/hvm.c             | 167 +++++++++++++++++++++++++++++
 xen/arch/x86/hvm/vmx/vmx.c         |  24 +++++
 xen/arch/x86/mm.c                  |  37 +++++++
 xen/common/domain.c                |   1 +
 xen/include/asm-x86/hvm/vmx/vmcs.h |  16 +++
 xen/include/public/hvm/hvm_op.h    |  23 ++++
 xen/include/public/hvm/params.h    |   5 +-
 xen/include/public/memory.h        |   1 +
 xen/include/xen/sched.h            |   3 +
 9 files changed, 276 insertions(+), 1 deletion(-)

Michał Leszczyński June 19, 2020, 12:46 a.m. UTC | #1

----- 19 cze 2020 o 1:41, Michał Leszczyński michal.leszczynski@cert.pl napisał(a):

> Provide an interface for privileged domains to manage
> external IPT monitoring. Guest IPT state will be preserved
> across vmentry/vmexit using ipt_state structure.
> 
> Signed-off-by: Michal Leszczynski <michal.leszczynski@cert.pl>
> ---

...

> +struct xen_hvm_vmtrace_op {
> +    /* IN variable */
> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
> +    uint32_t cmd;
> +/* Enable/disable external vmtrace for given domain */
> +#define HVMOP_vmtrace_ipt_enable      1
> +#define HVMOP_vmtrace_ipt_disable     2
> +#define HVMOP_vmtrace_ipt_get_offset  3
> +    domid_t domain;
> +    uint32_t vcpu;
> +    uint64_t size;
> +
> +    /* OUT variable */
> +    uint64_t offset;
> +};
> +typedef struct xen_hvm_vmtrace_op xen_hvm_vmtrace_op_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_hvm_vmtrace_op_t);
> +

I've forgotten about the padding thing here. This will be fixed in the next patch version, sorry.

ml

Roger Pau Monné June 19, 2020, 3:30 p.m. UTC | #2

On Fri, Jun 19, 2020 at 01:41:03AM +0200, Michał Leszczyński wrote:
> Provide an interface for privileged domains to manage
> external IPT monitoring. Guest IPT state will be preserved
> across vmentry/vmexit using ipt_state structure.

Thanks! I have some comments below, some of them are cosmetic coding
style issues. Sorry the Xen coding style is quite different from other
projects, and it takes a bit to get used to.

> Signed-off-by: Michal Leszczynski <michal.leszczynski@cert.pl>
> ---
>  xen/arch/x86/hvm/hvm.c             | 167 +++++++++++++++++++++++++++++
>  xen/arch/x86/hvm/vmx/vmx.c         |  24 +++++
>  xen/arch/x86/mm.c                  |  37 +++++++
>  xen/common/domain.c                |   1 +
>  xen/include/asm-x86/hvm/vmx/vmcs.h |  16 +++
>  xen/include/public/hvm/hvm_op.h    |  23 ++++
>  xen/include/public/hvm/params.h    |   5 +-
>  xen/include/public/memory.h        |   1 +
>  xen/include/xen/sched.h            |   3 +
>  9 files changed, 276 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 5bb47583b3..145ad053d2 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -1612,6 +1612,24 @@ int hvm_vcpu_initialise(struct vcpu *v)
>      return rc;
>  }
>  
> +void hvm_vmtrace_destroy(struct vcpu *v)
> +{
> +    unsigned int i;
> +    struct page_info *pg;
> +    struct ipt_state *ipt = v->arch.hvm.vmx.ipt_state;
> +    mfn_t buf_mfn = ipt->output_base >> PAGE_SHIFT;

Does this build? I think you are missing a _mfn(...) here?

> +    size_t buf_size = ipt->output_mask.size + 1;
> +
> +    xfree(ipt);
> +    v->arch.hvm.vmx.ipt_state = NULL;
> +
> +    for ( i = 0; i < (buf_size >> PAGE_SHIFT); i++ )
> +    {
> +        pg = mfn_to_page(_mfn(mfn_add(buf_mfn, i)));
> +        free_domheap_page(pg);

There's no need for the loop, just use free_domheap_pages and free the
whole allocated area at once. Also you can use maddr_to_page to get
the page from output_base.

> +    }
> +}
> +
>  void hvm_vcpu_destroy(struct vcpu *v)
>  {
>      viridian_vcpu_deinit(v);
> @@ -1631,6 +1649,8 @@ void hvm_vcpu_destroy(struct vcpu *v)
>      vlapic_destroy(v);
>  
>      hvm_vcpu_cacheattr_destroy(v);
> +
> +    hvm_vmtrace_destroy(v);
>  }
>  
>  void hvm_vcpu_down(struct vcpu *v)
> @@ -4066,6 +4086,51 @@ static int hvmop_set_evtchn_upcall_vector(
>      return 0;
>  }
>  
> +static int hvm_set_vmtrace_pt_size(struct domain *d, uint64_t value)

No need for the hvm_ prefix since it's a local function, and then I
would name it:

vmtrace_alloc_buffers(struct domain *d, uint64_t size)

I would name the variable size, so it's purpose it's clearer.

> +{
> +    void *buf;
> +    unsigned int buf_order;
> +    struct page_info *pg;
> +    struct ipt_state *ipt;
> +    struct vcpu *v;
> +
> +    if ( value < PAGE_SIZE ||
> +         value > GB(4) ||
> +         ( value & (value - 1) ) ) {

Extra spaces, and no need to place each condition on a new line as
long as you don't exceed the 80 character limit. Also thee opening
brace should be on a newline:

    if ( value < PAGE_SIZE || value > GB(4) || (value & (value - 1)) )
    {


> +        /* we don't accept trace buffer size smaller than single page
> +         * and the upper bound is defined as 4GB in the specification */

Comment style is wrong, according to CODING_STYLE this should be:

        /*
	 * We don't accept trace buffer size smaller than single page
         * and the upper bound is defined as 4GB in the specification.
	 */

Since you mention the limitations of the buffer size, I would also
mention that size must be a power of 2.

> +        return -EINVAL;
> +    }
> +
> +    for_each_vcpu ( d, v )
> +    {
> +        buf_order = get_order_from_bytes(value);

There's no need for the buf_order variable, just place the call inside
alloc_domheap_pages.

> +        pg = alloc_domheap_pages(d, buf_order, MEMF_no_refcount);

You can define the variable here:

struct page_info *pg = alloc_domheap_pages(d, get_order_from_bytes(value),
                                           MEMF_no_refcount);

ipt variable can also be defined here to reduce it's scope.

> +
> +        if ( !pg )
> +            return -EFAULT;

-ENOMEM

> +
> +        buf = page_to_virt(pg);

Why do you need buf?

You seem to get the virtual address of the page(s) just to use it in
virt_to_mfn, but you can directly use page_to_maddr and remove the buf
variable (and the shift done below).

> +
> +        if ( vmx_add_host_load_msr(v, MSR_RTIT_CTL, 0) )
> +            return -EFAULT;

You leak the pg allocation here...

> +
> +        ipt = xmalloc(struct ipt_state);

If you use xzalloc you can avoid setting the fields to 0 below.

> +
> +        if ( !ipt )
> +            return -EFAULT;

... and here in the failure paths. This should be -ENOMEM also.

> +
> +        ipt->output_base = virt_to_mfn(buf) << PAGE_SHIFT;
> +        ipt->output_mask.raw = value - 1;
> +        ipt->status = 0;
> +        ipt->active = 0;
> +
> +        v->arch.hvm.vmx.ipt_state = ipt;
> +    }
> +
> +    return 0;
> +}
> +
>  static int hvm_allow_set_param(struct domain *d,
>                                 uint32_t index,
>                                 uint64_t new_value)
> @@ -4127,6 +4192,7 @@ static int hvm_allow_set_param(struct domain *d,
>      case HVM_PARAM_NR_IOREQ_SERVER_PAGES:
>      case HVM_PARAM_ALTP2M:
>      case HVM_PARAM_MCA_CAP:
> +    case HVM_PARAM_VMTRACE_PT_SIZE:
>          if ( value != 0 && new_value != value )
>              rc = -EEXIST;
>          break;
> @@ -4328,6 +4394,9 @@ static int hvm_set_param(struct domain *d, uint32_t index, uint64_t value)
>      case HVM_PARAM_MCA_CAP:
>          rc = vmce_enable_mca_cap(d, value);
>          break;
> +    case HVM_PARAM_VMTRACE_PT_SIZE:
> +        rc = hvm_set_vmtrace_pt_size(d, value);
> +        break;
>      }
>  
>      if ( !rc )
> @@ -4949,6 +5018,100 @@ static int compat_altp2m_op(
>      return rc;
>  }
>  
> +static int do_vmtrace_op(XEN_GUEST_HANDLE_PARAM(void) arg)
> +{
> +    struct xen_hvm_vmtrace_op a;
> +    struct domain *d;
> +    int rc;
> +    struct vcpu *v;
> +    struct ipt_state *ipt;
> +
> +    if ( !hvm_pt_supported() )
> +        return -EOPNOTSUPP;
> +
> +    if ( copy_from_guest(&a, arg, 1) )
> +        return -EFAULT;
> +
> +    if ( a.version != HVMOP_VMTRACE_INTERFACE_VERSION )
> +        return -EINVAL;
> +
> +    d = rcu_lock_domain_by_any_id(a.domain);
> +    spin_lock(&d->vmtrace_lock);

This must be moved after you have checked d is not NULL.

> +
> +    if ( d == NULL )
> +        return -ESRCH;
> +
> +    if ( !is_hvm_domain(d) )

You don't need to hold vmtrace_lock to check this...

> +    {
> +        rc = -EOPNOTSUPP;
> +        goto out;
> +    }
> +
> +    domain_pause(d);

Do you really need to pause the whole domain, or just the vcpu you are
avoid to modify the parameters?

Also for HVMOP_vmtrace_ipt_get_offset there's no need to pause
anything?

> +
> +    if ( a.vcpu >= d->max_vcpus )

... neither this. I think you can reduce the locked section a bit.

> +    {
> +        rc = -EINVAL;
> +        goto out;
> +    }
> +
> +    v = d->vcpu[a.vcpu];
> +    ipt = v->arch.hvm.vmx.ipt_state;
> +
> +    if ( !ipt )
> +    {
> +        /*
> +	 * PT must be first initialized upon domain creation.
> +	 */

You have a couple of hard tab in the lines above. Also single line
comments are /* ... */.

> +        rc = -EINVAL;
> +        goto out;
> +    }
> +
> +    switch ( a.cmd )
> +    {
> +    case HVMOP_vmtrace_ipt_enable:
> +        if ( vmx_add_guest_msr(v, MSR_RTIT_CTL,
> +                               RTIT_CTL_TRACEEN | RTIT_CTL_OS |
> +                               RTIT_CTL_USR | RTIT_CTL_BRANCH_EN) )
> +        {
> +            rc = -EFAULT;

vmx_add_guest_msr already returns a valid error code, please don't
replace it with -EFAULT unconditionally (here and below).

> +            goto out;
> +        }
> +
> +        ipt->active = 1;
> +        break;

Please add an empty newline after break;.

> +    case HVMOP_vmtrace_ipt_disable:
> +        if ( vmx_add_guest_msr(v, MSR_RTIT_CTL, 0) )
> +        {
> +            rc = -EFAULT;
> +            goto out;
> +        }
> +
> +        ipt->active = 0;
> +        break;
> +    case HVMOP_vmtrace_ipt_get_offset:
> +        a.offset = ipt->output_mask.offset;
> +        break;
> +    default:
> +        rc = -EOPNOTSUPP;
> +        goto out;
> +    }
> +
> +    rc = -EFAULT;
> +    if ( __copy_to_guest(arg, &a, 1) )
> +      goto out;
> +    rc = 0;
> +
> + out:
> +    domain_unpause(d);
> +    spin_unlock(&d->vmtrace_lock);
> +    rcu_unlock_domain(d);
> +
> +    return rc;
> +}
> +
> +DEFINE_XEN_GUEST_HANDLE(compat_hvm_vmtrace_op_t);
> +
>  static int hvmop_get_mem_type(
>      XEN_GUEST_HANDLE_PARAM(xen_hvm_get_mem_type_t) arg)
>  {
> @@ -5101,6 +5264,10 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg)
>          rc = current->hcall_compat ? compat_altp2m_op(arg) : do_altp2m_op(arg);
>          break;
>  
> +    case HVMOP_vmtrace:
> +        rc = do_vmtrace_op(arg);
> +        break;
> +
>      default:
>      {
>          gdprintk(XENLOG_DEBUG, "Bad HVM op %ld.\n", op);
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index ab19d9424e..51f0046483 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -508,11 +508,25 @@ static void vmx_restore_host_msrs(void)
>  
>  static void vmx_save_guest_msrs(struct vcpu *v)
>  {
> +    uint64_t rtit_ctl;
> +
>      /*
>       * We cannot cache SHADOW_GS_BASE while the VCPU runs, as it can
>       * be updated at any time via SWAPGS, which we cannot trap.
>       */
>      v->arch.hvm.vmx.shadow_gs = rdgsshadow();
> +
> +    if ( unlikely(v->arch.hvm.vmx.ipt_state && v->arch.hvm.vmx.ipt_state->active) )
> +    {
> +        smp_rmb();

If this barrier is required I think it warrants a comment about why
it's needed.

> +        rdmsrl(MSR_RTIT_CTL, rtit_ctl);
> +
> +        if ( rtit_ctl & RTIT_CTL_TRACEEN )
> +            BUG();

BUG_ON(rtit_ctl & RTIT_CTL_TRACEEN);

> +
> +        rdmsrl(MSR_RTIT_STATUS, v->arch.hvm.vmx.ipt_state->status);
> +        rdmsrl(MSR_RTIT_OUTPUT_MASK, v->arch.hvm.vmx.ipt_state->output_mask.raw);
> +    }
>  }
>  
>  static void vmx_restore_guest_msrs(struct vcpu *v)
> @@ -524,6 +538,16 @@ static void vmx_restore_guest_msrs(struct vcpu *v)
>  
>      if ( cpu_has_msr_tsc_aux )
>          wrmsr_tsc_aux(v->arch.msrs->tsc_aux);
> +
> +    if ( unlikely(v->arch.hvm.vmx.ipt_state && v->arch.hvm.vmx.ipt_state->active) )

Line is too long.

> +    {
> +        wrmsrl(MSR_RTIT_OUTPUT_BASE,
> +            v->arch.hvm.vmx.ipt_state->output_base);
> +        wrmsrl(MSR_RTIT_OUTPUT_MASK,
> +            v->arch.hvm.vmx.ipt_state->output_mask.raw);
> +        wrmsrl(MSR_RTIT_STATUS,
> +            v->arch.hvm.vmx.ipt_state->status);

Can you please align to the start of the parentheses:

        wrmsrl(MSR_RTIT_STATUS,
               v->arch.hvm.vmx.ipt_state->status);

> +    }
>  }
>  
>  void vmx_update_cpu_exec_control(struct vcpu *v)
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index e376fc7e8f..e4658dc27f 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -4624,6 +4624,43 @@ int arch_acquire_resource(struct domain *d, unsigned int type,
>          }
>          break;
>      }
> +
> +    case XENMEM_resource_vmtrace_buf:
> +    {
> +        mfn_t mfn;
> +        unsigned int i;
> +        struct ipt_state *ipt;
> +
> +        if ( id >= d->max_vcpus )
> +        {
> +            rc = -EINVAL;

You can just set rc = -EINVAL at the top and then avoid setting it on
every error case.

> +            break;
> +        }
> +
> +        ipt = d->vcpu[id]->arch.hvm.vmx.ipt_state;
> +
> +        if ( !ipt )
> +        {
> +            rc = -EINVAL;
> +            break;
> +        }
> +
> +        mfn = mfn_x(ipt->output_base >> PAGE_SHIFT);
> +
> +        if (nr_frames > ( ( ipt->output_mask.size + 1 ) >> PAGE_SHIFT ))

Spaces are wrongly positioned. ie:

if ( frame + nr_frames > ((ipt->output_mask.size + 1) >> PAGE_SHIFT) )

> +        {
> +            rc = -EINVAL;
> +            break;
> +        }
> +
> +        rc = 0;
> +        for ( i = 0; i < nr_frames; i++ )

I think you should init i to the the 'frame' field from
xen_mem_acquire_resource, since it's possible to select the offset you
want to map from a resource. You will also need to take it into
account, because IMO that's where I would store the last processed
frame when dealing with continuations.

> +        {
> +            mfn_list[i] = mfn_add(mfn, i);
> +        }
> +
> +        break;
> +    }
>  #endif
>  
>      default:
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 7cc9526139..6f6458cd5b 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -414,6 +414,7 @@ struct domain *domain_create(domid_t domid,
>      d->shutdown_code = SHUTDOWN_CODE_INVALID;
>  
>      spin_lock_init(&d->pbuf_lock);
> +    spin_lock_init(&d->vmtrace_lock);
>  
>      rwlock_init(&d->vnuma_rwlock);
>  
> diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
> index 4c81093aba..9fc4653777 100644
> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> @@ -104,6 +104,19 @@ struct pi_blocking_vcpu {
>      spinlock_t           *lock;
>  };
>  
> +struct ipt_state {
> +    uint64_t active;

Could this be a boolean?

> +    uint64_t status;
> +    uint64_t output_base;
> +    union {
> +        uint64_t raw;
> +        struct {
> +            uint32_t size;
> +            uint32_t offset;
> +        };
> +    } output_mask;
> +};
> +
>  struct vmx_vcpu {
>      /* Physical address of VMCS. */
>      paddr_t              vmcs_pa;
> @@ -186,6 +199,9 @@ struct vmx_vcpu {
>       * pCPU and wakeup the related vCPU.
>       */
>      struct pi_blocking_vcpu pi_blocking;
> +
> +    /* State of Intel Processor Trace feature */
> +    struct ipt_state     *ipt_state;
>  };
>  
>  int vmx_create_vmcs(struct vcpu *v);
> diff --git a/xen/include/public/hvm/hvm_op.h b/xen/include/public/hvm/hvm_op.h
> index 870ec52060..8cd0b6ea49 100644
> --- a/xen/include/public/hvm/hvm_op.h
> +++ b/xen/include/public/hvm/hvm_op.h
> @@ -382,6 +382,29 @@ struct xen_hvm_altp2m_op {
>  typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
>  
> +/* HVMOP_vmtrace: Perform VM tracing related operation */
> +#define HVMOP_vmtrace 26
> +
> +#define HVMOP_VMTRACE_INTERFACE_VERSION 0x00000001
> +
> +struct xen_hvm_vmtrace_op {
> +    /* IN variable */
> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
> +    uint32_t cmd;
> +/* Enable/disable external vmtrace for given domain */
> +#define HVMOP_vmtrace_ipt_enable      1
> +#define HVMOP_vmtrace_ipt_disable     2
> +#define HVMOP_vmtrace_ipt_get_offset  3
> +    domid_t domain;
> +    uint32_t vcpu;
> +    uint64_t size;
> +
> +    /* OUT variable */
> +    uint64_t offset;
> +};
> +typedef struct xen_hvm_vmtrace_op xen_hvm_vmtrace_op_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_hvm_vmtrace_op_t);
> +
>  #endif /* __XEN_PUBLIC_HVM_HVM_OP_H__ */
>  
>  /*
> diff --git a/xen/include/public/hvm/params.h b/xen/include/public/hvm/params.h
> index 0a91bfa749..adbc7e5d08 100644
> --- a/xen/include/public/hvm/params.h
> +++ b/xen/include/public/hvm/params.h
> @@ -300,6 +300,9 @@
>  #define XEN_HVM_MCA_CAP_LMCE   (xen_mk_ullong(1) << 0)
>  #define XEN_HVM_MCA_CAP_MASK   XEN_HVM_MCA_CAP_LMCE
>  
> -#define HVM_NR_PARAMS 39
> +/* VM trace capabilities */
> +#define HVM_PARAM_VMTRACE_PT_SIZE 39

I think it was recommended that IPT should be set at domain creation,
but using a HVM param still allows you to init this after the domain
has been created. I think it might be best to just add a new field to
xen_domctl_createdomain, as it seems like the interface could also be
helpful for other arches?

Thanks, Roger.

Jan Beulich June 19, 2020, 3:50 p.m. UTC | #3

On 19.06.2020 17:30, Roger Pau Monné wrote:
> On Fri, Jun 19, 2020 at 01:41:03AM +0200, Michał Leszczyński wrote:
>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -1612,6 +1612,24 @@ int hvm_vcpu_initialise(struct vcpu *v)
>>      return rc;
>>  }
>>  
>> +void hvm_vmtrace_destroy(struct vcpu *v)
>> +{
>> +    unsigned int i;
>> +    struct page_info *pg;
>> +    struct ipt_state *ipt = v->arch.hvm.vmx.ipt_state;
>> +    mfn_t buf_mfn = ipt->output_base >> PAGE_SHIFT;
> 
> Does this build? I think you are missing a _mfn(...) here?

This as well as ...

>> +    size_t buf_size = ipt->output_mask.size + 1;
>> +
>> +    xfree(ipt);
>> +    v->arch.hvm.vmx.ipt_state = NULL;
>> +
>> +    for ( i = 0; i < (buf_size >> PAGE_SHIFT); i++ )
>> +    {
>> +        pg = mfn_to_page(_mfn(mfn_add(buf_mfn, i)));

... the extra _mfn() here suggest the code was only ever built in
release mode so far.

Jan

Michał Leszczyński June 22, 2020, 2:45 a.m. UTC | #4

----- 19 cze 2020 o 17:50, Jan Beulich jbeulich@suse.com napisał(a):

> On 19.06.2020 17:30, Roger Pau Monné wrote:
>> On Fri, Jun 19, 2020 at 01:41:03AM +0200, Michał Leszczyński wrote:
>>> --- a/xen/arch/x86/hvm/hvm.c
>>> +++ b/xen/arch/x86/hvm/hvm.c
>>> @@ -1612,6 +1612,24 @@ int hvm_vcpu_initialise(struct vcpu *v)
>>>      return rc;
>>>  }
>>>  
>>> +void hvm_vmtrace_destroy(struct vcpu *v)
>>> +{
>>> +    unsigned int i;
>>> +    struct page_info *pg;
>>> +    struct ipt_state *ipt = v->arch.hvm.vmx.ipt_state;
>>> +    mfn_t buf_mfn = ipt->output_base >> PAGE_SHIFT;
>> 
>> Does this build? I think you are missing a _mfn(...) here?
> 
> This as well as ...
> 
>>> +    size_t buf_size = ipt->output_mask.size + 1;
>>> +
>>> +    xfree(ipt);
>>> +    v->arch.hvm.vmx.ipt_state = NULL;
>>> +
>>> +    for ( i = 0; i < (buf_size >> PAGE_SHIFT); i++ )
>>> +    {
>>> +        pg = mfn_to_page(_mfn(mfn_add(buf_mfn, i)));
> 
> ... the extra _mfn() here suggest the code was only ever built in
> release mode so far.
> 
> Jan


Ah, I forgot to enable developer checks. This will be corrected in v3.

ml

Michał Leszczyński June 22, 2020, 2:56 a.m. UTC | #5

----- 19 cze 2020 o 1:41, Michał Leszczyński michal.leszczynski@cert.pl napisał(a):

> Provide an interface for privileged domains to manage
> external IPT monitoring. Guest IPT state will be preserved
> across vmentry/vmexit using ipt_state structure.
> 
> Signed-off-by: Michal Leszczynski <michal.leszczynski@cert.pl>
> ---
> xen/arch/x86/hvm/hvm.c             | 167 +++++++++++++++++++++++++++++
> xen/arch/x86/hvm/vmx/vmx.c         |  24 +++++
> xen/arch/x86/mm.c                  |  37 +++++++
> xen/common/domain.c                |   1 +
> xen/include/asm-x86/hvm/vmx/vmcs.h |  16 +++
> xen/include/public/hvm/hvm_op.h    |  23 ++++
> xen/include/public/hvm/params.h    |   5 +-
> xen/include/public/memory.h        |   1 +
> xen/include/xen/sched.h            |   3 +
> 9 files changed, 276 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 5bb47583b3..145ad053d2 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -1612,6 +1612,24 @@ int hvm_vcpu_initialise(struct vcpu *v)
>     return rc;
> }
> 
> +void hvm_vmtrace_destroy(struct vcpu *v)
> +{
> +    unsigned int i;
> +    struct page_info *pg;
> +    struct ipt_state *ipt = v->arch.hvm.vmx.ipt_state;
> +    mfn_t buf_mfn = ipt->output_base >> PAGE_SHIFT;
> +    size_t buf_size = ipt->output_mask.size + 1;
> +
> +    xfree(ipt);
> +    v->arch.hvm.vmx.ipt_state = NULL;
> +
> +    for ( i = 0; i < (buf_size >> PAGE_SHIFT); i++ )
> +    {
> +        pg = mfn_to_page(_mfn(mfn_add(buf_mfn, i)));
> +        free_domheap_page(pg);
> +    }
> +}
> +
> void hvm_vcpu_destroy(struct vcpu *v)
> {
>     viridian_vcpu_deinit(v);
> @@ -1631,6 +1649,8 @@ void hvm_vcpu_destroy(struct vcpu *v)
>     vlapic_destroy(v);
> 
>     hvm_vcpu_cacheattr_destroy(v);
> +
> +    hvm_vmtrace_destroy(v);
> }
> 
> void hvm_vcpu_down(struct vcpu *v)
> @@ -4066,6 +4086,51 @@ static int hvmop_set_evtchn_upcall_vector(
>     return 0;
> }
> 
> +static int hvm_set_vmtrace_pt_size(struct domain *d, uint64_t value)
> +{
> +    void *buf;
> +    unsigned int buf_order;
> +    struct page_info *pg;
> +    struct ipt_state *ipt;
> +    struct vcpu *v;
> +
> +    if ( value < PAGE_SIZE ||
> +         value > GB(4) ||
> +         ( value & (value - 1) ) ) {
> +        /* we don't accept trace buffer size smaller than single page
> +         * and the upper bound is defined as 4GB in the specification */
> +        return -EINVAL;
> +    }
> +
> +    for_each_vcpu ( d, v )
> +    {
> +        buf_order = get_order_from_bytes(value);
> +        pg = alloc_domheap_pages(d, buf_order, MEMF_no_refcount);
> +
> +        if ( !pg )
> +            return -EFAULT;
> +
> +        buf = page_to_virt(pg);
> +
> +        if ( vmx_add_host_load_msr(v, MSR_RTIT_CTL, 0) )
> +            return -EFAULT;
> +
> +        ipt = xmalloc(struct ipt_state);
> +
> +        if ( !ipt )
> +            return -EFAULT;
> +
> +        ipt->output_base = virt_to_mfn(buf) << PAGE_SHIFT;
> +        ipt->output_mask.raw = value - 1;
> +        ipt->status = 0;
> +        ipt->active = 0;
> +
> +        v->arch.hvm.vmx.ipt_state = ipt;
> +    }
> +
> +    return 0;
> +}
> +
> static int hvm_allow_set_param(struct domain *d,
>                                uint32_t index,
>                                uint64_t new_value)
> @@ -4127,6 +4192,7 @@ static int hvm_allow_set_param(struct domain *d,
>     case HVM_PARAM_NR_IOREQ_SERVER_PAGES:
>     case HVM_PARAM_ALTP2M:
>     case HVM_PARAM_MCA_CAP:
> +    case HVM_PARAM_VMTRACE_PT_SIZE:
>         if ( value != 0 && new_value != value )
>             rc = -EEXIST;
>         break;
> @@ -4328,6 +4394,9 @@ static int hvm_set_param(struct domain *d, uint32_t index,
> uint64_t value)
>     case HVM_PARAM_MCA_CAP:
>         rc = vmce_enable_mca_cap(d, value);
>         break;
> +    case HVM_PARAM_VMTRACE_PT_SIZE:
> +        rc = hvm_set_vmtrace_pt_size(d, value);
> +        break;
>     }
> 
>     if ( !rc )
> @@ -4949,6 +5018,100 @@ static int compat_altp2m_op(
>     return rc;
> }
> 
> +static int do_vmtrace_op(XEN_GUEST_HANDLE_PARAM(void) arg)
> +{
> +    struct xen_hvm_vmtrace_op a;
> +    struct domain *d;
> +    int rc;
> +    struct vcpu *v;
> +    struct ipt_state *ipt;
> +
> +    if ( !hvm_pt_supported() )
> +        return -EOPNOTSUPP;
> +
> +    if ( copy_from_guest(&a, arg, 1) )
> +        return -EFAULT;
> +
> +    if ( a.version != HVMOP_VMTRACE_INTERFACE_VERSION )
> +        return -EINVAL;
> +
> +    d = rcu_lock_domain_by_any_id(a.domain);
> +    spin_lock(&d->vmtrace_lock);
> +
> +    if ( d == NULL )
> +        return -ESRCH;
> +
> +    if ( !is_hvm_domain(d) )
> +    {
> +        rc = -EOPNOTSUPP;
> +        goto out;
> +    }
> +
> +    domain_pause(d);
> +
> +    if ( a.vcpu >= d->max_vcpus )
> +    {
> +        rc = -EINVAL;
> +        goto out;
> +    }
> +
> +    v = d->vcpu[a.vcpu];
> +    ipt = v->arch.hvm.vmx.ipt_state;
> +
> +    if ( !ipt )
> +    {
> +        /*
> +	 * PT must be first initialized upon domain creation.
> +	 */
> +        rc = -EINVAL;
> +        goto out;
> +    }
> +
> +    switch ( a.cmd )
> +    {
> +    case HVMOP_vmtrace_ipt_enable:
> +        if ( vmx_add_guest_msr(v, MSR_RTIT_CTL,
> +                               RTIT_CTL_TRACEEN | RTIT_CTL_OS |
> +                               RTIT_CTL_USR | RTIT_CTL_BRANCH_EN) )
> +        {
> +            rc = -EFAULT;
> +            goto out;
> +        }
> +
> +        ipt->active = 1;
> +        break;
> +    case HVMOP_vmtrace_ipt_disable:
> +        if ( vmx_add_guest_msr(v, MSR_RTIT_CTL, 0) )
> +        {
> +            rc = -EFAULT;
> +            goto out;
> +        }
> +
> +        ipt->active = 0;
> +        break;
> +    case HVMOP_vmtrace_ipt_get_offset:
> +        a.offset = ipt->output_mask.offset;
> +        break;
> +    default:
> +        rc = -EOPNOTSUPP;
> +        goto out;
> +    }
> +
> +    rc = -EFAULT;
> +    if ( __copy_to_guest(arg, &a, 1) )
> +      goto out;
> +    rc = 0;
> +
> + out:
> +    domain_unpause(d);
> +    spin_unlock(&d->vmtrace_lock);
> +    rcu_unlock_domain(d);
> +
> +    return rc;
> +}
> +
> +DEFINE_XEN_GUEST_HANDLE(compat_hvm_vmtrace_op_t);
> +
> static int hvmop_get_mem_type(
>     XEN_GUEST_HANDLE_PARAM(xen_hvm_get_mem_type_t) arg)
> {
> @@ -5101,6 +5264,10 @@ long do_hvm_op(unsigned long op,
> XEN_GUEST_HANDLE_PARAM(void) arg)
>         rc = current->hcall_compat ? compat_altp2m_op(arg) : do_altp2m_op(arg);
>         break;
> 
> +    case HVMOP_vmtrace:
> +        rc = do_vmtrace_op(arg);
> +        break;
> +
>     default:
>     {
>         gdprintk(XENLOG_DEBUG, "Bad HVM op %ld.\n", op);
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index ab19d9424e..51f0046483 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -508,11 +508,25 @@ static void vmx_restore_host_msrs(void)
> 
> static void vmx_save_guest_msrs(struct vcpu *v)
> {
> +    uint64_t rtit_ctl;
> +
>     /*
>      * We cannot cache SHADOW_GS_BASE while the VCPU runs, as it can
>      * be updated at any time via SWAPGS, which we cannot trap.
>      */
>     v->arch.hvm.vmx.shadow_gs = rdgsshadow();
> +
> +    if ( unlikely(v->arch.hvm.vmx.ipt_state &&
> v->arch.hvm.vmx.ipt_state->active) )
> +    {
> +        smp_rmb();
> +        rdmsrl(MSR_RTIT_CTL, rtit_ctl);
> +
> +        if ( rtit_ctl & RTIT_CTL_TRACEEN )
> +            BUG();
> +
> +        rdmsrl(MSR_RTIT_STATUS, v->arch.hvm.vmx.ipt_state->status);
> +        rdmsrl(MSR_RTIT_OUTPUT_MASK,
> v->arch.hvm.vmx.ipt_state->output_mask.raw);
> +    }
> }


It was suggested to move this piece of code from vm-entry/vm-exit handling to
vmx_save_guest_msrs/vmx_restore_guest_msrs functions.

Please note that we do need to periodically read MSR_RTIT_OUTPUT_MASK in order
to know how much data was written into the buffer by the processor. During tests,
I've spotted that in some cases vCPUs get out of scope very rarely.

For instance: when IPT gets enabled, xc_vmtrace_ipt_get_offset() is returning zero
value for the first few seconds, because it's relying on the value of
v->arch.hvm.vmx.ipt_state->output_mask which we only read in vmx_save_guest_msrs()
and in some cases this occurs very rarely.

Could you guys suggest some solution to the problem? If we would leave this value
dirty in hardware, AFAIK we have no possibility to directly access it,
but observing this particular value is very important in the runtime.


Best regards,
Michał Leszczyński
CERT Polska


> 
> static void vmx_restore_guest_msrs(struct vcpu *v)
> @@ -524,6 +538,16 @@ static void vmx_restore_guest_msrs(struct vcpu *v)
> 
>     if ( cpu_has_msr_tsc_aux )
>         wrmsr_tsc_aux(v->arch.msrs->tsc_aux);
> +
> +    if ( unlikely(v->arch.hvm.vmx.ipt_state &&
> v->arch.hvm.vmx.ipt_state->active) )
> +    {
> +        wrmsrl(MSR_RTIT_OUTPUT_BASE,
> +            v->arch.hvm.vmx.ipt_state->output_base);
> +        wrmsrl(MSR_RTIT_OUTPUT_MASK,
> +            v->arch.hvm.vmx.ipt_state->output_mask.raw);
> +        wrmsrl(MSR_RTIT_STATUS,
> +            v->arch.hvm.vmx.ipt_state->status);
> +    }
> }
> 
> void vmx_update_cpu_exec_control(struct vcpu *v)
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index e376fc7e8f..e4658dc27f 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -4624,6 +4624,43 @@ int arch_acquire_resource(struct domain *d, unsigned int
> type,
>         }
>         break;
>     }
> +
> +    case XENMEM_resource_vmtrace_buf:
> +    {
> +        mfn_t mfn;
> +        unsigned int i;
> +        struct ipt_state *ipt;
> +
> +        if ( id >= d->max_vcpus )
> +        {
> +            rc = -EINVAL;
> +            break;
> +        }
> +
> +        ipt = d->vcpu[id]->arch.hvm.vmx.ipt_state;
> +
> +        if ( !ipt )
> +        {
> +            rc = -EINVAL;
> +            break;
> +        }
> +
> +        mfn = mfn_x(ipt->output_base >> PAGE_SHIFT);
> +
> +        if (nr_frames > ( ( ipt->output_mask.size + 1 ) >> PAGE_SHIFT ))
> +        {
> +            rc = -EINVAL;
> +            break;
> +        }
> +
> +        rc = 0;
> +        for ( i = 0; i < nr_frames; i++ )
> +        {
> +            mfn_list[i] = mfn_add(mfn, i);
> +        }
> +
> +        break;
> +    }
> #endif
> 
>     default:
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 7cc9526139..6f6458cd5b 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -414,6 +414,7 @@ struct domain *domain_create(domid_t domid,
>     d->shutdown_code = SHUTDOWN_CODE_INVALID;
> 
>     spin_lock_init(&d->pbuf_lock);
> +    spin_lock_init(&d->vmtrace_lock);
> 
>     rwlock_init(&d->vnuma_rwlock);
> 
> diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h
> b/xen/include/asm-x86/hvm/vmx/vmcs.h
> index 4c81093aba..9fc4653777 100644
> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> @@ -104,6 +104,19 @@ struct pi_blocking_vcpu {
>     spinlock_t           *lock;
> };
> 
> +struct ipt_state {
> +    uint64_t active;
> +    uint64_t status;
> +    uint64_t output_base;
> +    union {
> +        uint64_t raw;
> +        struct {
> +            uint32_t size;
> +            uint32_t offset;
> +        };
> +    } output_mask;
> +};
> +
> struct vmx_vcpu {
>     /* Physical address of VMCS. */
>     paddr_t              vmcs_pa;
> @@ -186,6 +199,9 @@ struct vmx_vcpu {
>      * pCPU and wakeup the related vCPU.
>      */
>     struct pi_blocking_vcpu pi_blocking;
> +
> +    /* State of Intel Processor Trace feature */
> +    struct ipt_state     *ipt_state;
> };
> 
> int vmx_create_vmcs(struct vcpu *v);
> diff --git a/xen/include/public/hvm/hvm_op.h b/xen/include/public/hvm/hvm_op.h
> index 870ec52060..8cd0b6ea49 100644
> --- a/xen/include/public/hvm/hvm_op.h
> +++ b/xen/include/public/hvm/hvm_op.h
> @@ -382,6 +382,29 @@ struct xen_hvm_altp2m_op {
> typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
> DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
> 
> +/* HVMOP_vmtrace: Perform VM tracing related operation */
> +#define HVMOP_vmtrace 26
> +
> +#define HVMOP_VMTRACE_INTERFACE_VERSION 0x00000001
> +
> +struct xen_hvm_vmtrace_op {
> +    /* IN variable */
> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
> +    uint32_t cmd;
> +/* Enable/disable external vmtrace for given domain */
> +#define HVMOP_vmtrace_ipt_enable      1
> +#define HVMOP_vmtrace_ipt_disable     2
> +#define HVMOP_vmtrace_ipt_get_offset  3
> +    domid_t domain;
> +    uint32_t vcpu;
> +    uint64_t size;
> +
> +    /* OUT variable */
> +    uint64_t offset;
> +};
> +typedef struct xen_hvm_vmtrace_op xen_hvm_vmtrace_op_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_hvm_vmtrace_op_t);
> +
> #endif /* __XEN_PUBLIC_HVM_HVM_OP_H__ */
> 
> /*
> diff --git a/xen/include/public/hvm/params.h b/xen/include/public/hvm/params.h
> index 0a91bfa749..adbc7e5d08 100644
> --- a/xen/include/public/hvm/params.h
> +++ b/xen/include/public/hvm/params.h
> @@ -300,6 +300,9 @@
> #define XEN_HVM_MCA_CAP_LMCE   (xen_mk_ullong(1) << 0)
> #define XEN_HVM_MCA_CAP_MASK   XEN_HVM_MCA_CAP_LMCE
> 
> -#define HVM_NR_PARAMS 39
> +/* VM trace capabilities */
> +#define HVM_PARAM_VMTRACE_PT_SIZE 39
> +
> +#define HVM_NR_PARAMS 40
> 
> #endif /* __XEN_PUBLIC_HVM_PARAMS_H__ */
> diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
> index dbd35305df..f823c784c3 100644
> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -620,6 +620,7 @@ struct xen_mem_acquire_resource {
> 
> #define XENMEM_resource_ioreq_server 0
> #define XENMEM_resource_grant_table 1
> +#define XENMEM_resource_vmtrace_buf 2
> 
>     /*
>      * IN - a type-specific resource identifier, which must be zero
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index ac53519d7f..b3a36f3788 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -457,6 +457,9 @@ struct domain
>     unsigned    pbuf_idx;
>     spinlock_t  pbuf_lock;
> 
> +    /* Used by vmtrace domctls */
> +    spinlock_t  vmtrace_lock;
> +
>     /* OProfile support. */
>     struct xenoprof *xenoprof;
> 
> --
> 2.20.1

Jan Beulich June 22, 2020, 8:39 a.m. UTC | #6

On 22.06.2020 04:56, Michał Leszczyński wrote:
> ----- 19 cze 2020 o 1:41, Michał Leszczyński michal.leszczynski@cert.pl napisał(a):
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -508,11 +508,25 @@ static void vmx_restore_host_msrs(void)
>>
>> static void vmx_save_guest_msrs(struct vcpu *v)
>> {
>> +    uint64_t rtit_ctl;
>> +
>>     /*
>>      * We cannot cache SHADOW_GS_BASE while the VCPU runs, as it can
>>      * be updated at any time via SWAPGS, which we cannot trap.
>>      */
>>     v->arch.hvm.vmx.shadow_gs = rdgsshadow();
>> +
>> +    if ( unlikely(v->arch.hvm.vmx.ipt_state &&
>> v->arch.hvm.vmx.ipt_state->active) )
>> +    {
>> +        smp_rmb();
>> +        rdmsrl(MSR_RTIT_CTL, rtit_ctl);
>> +
>> +        if ( rtit_ctl & RTIT_CTL_TRACEEN )
>> +            BUG();
>> +
>> +        rdmsrl(MSR_RTIT_STATUS, v->arch.hvm.vmx.ipt_state->status);
>> +        rdmsrl(MSR_RTIT_OUTPUT_MASK,
>> v->arch.hvm.vmx.ipt_state->output_mask.raw);
>> +    }
>> }
> 
> 
> It was suggested to move this piece of code from vm-entry/vm-exit handling to
> vmx_save_guest_msrs/vmx_restore_guest_msrs functions.
> 
> Please note that we do need to periodically read MSR_RTIT_OUTPUT_MASK in order
> to know how much data was written into the buffer by the processor. During tests,
> I've spotted that in some cases vCPUs get out of scope very rarely.
> 
> For instance: when IPT gets enabled, xc_vmtrace_ipt_get_offset() is returning zero
> value for the first few seconds, because it's relying on the value of
> v->arch.hvm.vmx.ipt_state->output_mask which we only read in vmx_save_guest_msrs()
> and in some cases this occurs very rarely.
> 
> Could you guys suggest some solution to the problem? If we would leave this value
> dirty in hardware, AFAIK we have no possibility to directly access it,
> but observing this particular value is very important in the runtime.

Much depends on what state the vCPU in question is in when you need
to "periodically read" the value. If it's paused, you may need to
invoke sync_vcpu_execstate(). If it's actively running you can't
get at the value anyway except when you're on the CPU that this vCPU
is running on (and then you can RDMSR it quite easily). Any
intermediate state between paused and running is prone to change
immediately after you've established it anyway, so you need to get
the vCPU into one of the two "stable" states in order to get at
the register.

Also (and I think I said so before) - please trim your replies.

Jan

Jan Beulich June 22, 2020, 1:25 p.m. UTC | #7

On 19.06.2020 01:41, Michał Leszczyński wrote:
> @@ -1631,6 +1649,8 @@ void hvm_vcpu_destroy(struct vcpu *v)
>      vlapic_destroy(v);
>  
>      hvm_vcpu_cacheattr_destroy(v);
> +
> +    hvm_vmtrace_destroy(v);
>  }

Whenever possible resource cleanup should occur from
hvm_domain_relinquish_resources(). Is there anything preventing
this here?

> @@ -4066,6 +4086,51 @@ static int hvmop_set_evtchn_upcall_vector(
>      return 0;
>  }
>  
> +static int hvm_set_vmtrace_pt_size(struct domain *d, uint64_t value)
> +{
> +    void *buf;
> +    unsigned int buf_order;
> +    struct page_info *pg;
> +    struct ipt_state *ipt;
> +    struct vcpu *v;
> +
> +    if ( value < PAGE_SIZE ||
> +         value > GB(4) ||
> +         ( value & (value - 1) ) ) {
> +        /* we don't accept trace buffer size smaller than single page
> +         * and the upper bound is defined as 4GB in the specification */
> +        return -EINVAL;
> +    }
> +
> +    for_each_vcpu ( d, v )
> +    {
> +        buf_order = get_order_from_bytes(value);
> +        pg = alloc_domheap_pages(d, buf_order, MEMF_no_refcount);
> +
> +        if ( !pg )
> +            return -EFAULT;
> +
> +        buf = page_to_virt(pg);

In addition to what Roger has said here, just fyi that you can't
use page_to_virt() on anything returned from alloc_domheap_pages(),
unless you suitably restrict the address range such that the
result is guaranteed to have a direct mapping.

> @@ -4949,6 +5018,100 @@ static int compat_altp2m_op(
>      return rc;
>  }
>  
> +static int do_vmtrace_op(XEN_GUEST_HANDLE_PARAM(void) arg)
> +{
> +    struct xen_hvm_vmtrace_op a;
> +    struct domain *d;
> +    int rc;
> +    struct vcpu *v;
> +    struct ipt_state *ipt;
> +
> +    if ( !hvm_pt_supported() )
> +        return -EOPNOTSUPP;
> +
> +    if ( copy_from_guest(&a, arg, 1) )
> +        return -EFAULT;
> +
> +    if ( a.version != HVMOP_VMTRACE_INTERFACE_VERSION )
> +        return -EINVAL;
> +
> +    d = rcu_lock_domain_by_any_id(a.domain);
> +    spin_lock(&d->vmtrace_lock);
> +
> +    if ( d == NULL )
> +        return -ESRCH;
> +
> +    if ( !is_hvm_domain(d) )
> +    {
> +        rc = -EOPNOTSUPP;
> +        goto out;
> +    }
> +
> +    domain_pause(d);

Who's the intended caller of this interface? You making it a hvm-op
suggests the guest may itself call this. But of course a guest
can't pause itself. If this is supposed to be a tools-only interface,
then you should frame it suitably in the public header and of course
you need to enforce this here (which would e.g. mean you shouldn't
use rcu_lock_domain_by_any_id()).

Also please take a look at hvm/ioreq.c, which makes quite a bit of
use of domain_pause(). In particular I think you want to acquire
the lock only after having paused the domain.

> +    if ( a.vcpu >= d->max_vcpus )
> +    {
> +        rc = -EINVAL;
> +        goto out;
> +    }
> +
> +    v = d->vcpu[a.vcpu];
> +    ipt = v->arch.hvm.vmx.ipt_state;
> +
> +    if ( !ipt )
> +    {
> +        /*
> +	 * PT must be first initialized upon domain creation.
> +	 */
> +        rc = -EINVAL;
> +        goto out;
> +    }
> +
> +    switch ( a.cmd )
> +    {
> +    case HVMOP_vmtrace_ipt_enable:
> +        if ( vmx_add_guest_msr(v, MSR_RTIT_CTL,
> +                               RTIT_CTL_TRACEEN | RTIT_CTL_OS |
> +                               RTIT_CTL_USR | RTIT_CTL_BRANCH_EN) )
> +        {
> +            rc = -EFAULT;
> +            goto out;
> +        }
> +
> +        ipt->active = 1;
> +        break;
> +    case HVMOP_vmtrace_ipt_disable:
> +        if ( vmx_add_guest_msr(v, MSR_RTIT_CTL, 0) )

Shouldn't you rather remove the MSR from the load list here?

Is any of what you do in this switch() actually legitimate without
hvm_set_vmtrace_pt_size() having got called for the guest? From
remarks elsewhere I imply you expect the param that you currently
use to be set upon domain creation time, but at the very least the
potentially big buffer should imo not get allocated up front, but
only when tracing is to actually be enabled.

Also please have blank lines between individual case blocks.

> +        {
> +            rc = -EFAULT;
> +            goto out;
> +        }
> +
> +        ipt->active = 0;
> +        break;
> +    case HVMOP_vmtrace_ipt_get_offset:
> +        a.offset = ipt->output_mask.offset;
> +        break;
> +    default:
> +        rc = -EOPNOTSUPP;
> +        goto out;
> +    }
> +
> +    rc = -EFAULT;
> +    if ( __copy_to_guest(arg, &a, 1) )
> +      goto out;

Only HVMOP_vmtrace_ipt_get_offset needs this - please avoid copying
back on other paths. Even there you could in principle copy back
just the one field; the function taking XEN_GUEST_HANDLE_PARAM(void)
gets in the way of this, though.

The last line above also has an indentation issue, but the use of
"goto" in this case can be avoided anyway.

> +    rc = 0;
> +
> + out:
> +    domain_unpause(d);
> +    spin_unlock(&d->vmtrace_lock);
> +    rcu_unlock_domain(d);
> +
> +    return rc;
> +}
> +
> +DEFINE_XEN_GUEST_HANDLE(compat_hvm_vmtrace_op_t);

I don't see any use of this handle further down - why do you force
it being declared?

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -4624,6 +4624,43 @@ int arch_acquire_resource(struct domain *d, unsigned int type,
>          }
>          break;
>      }
> +
> +    case XENMEM_resource_vmtrace_buf:
> +    {
> +        mfn_t mfn;
> +        unsigned int i;
> +        struct ipt_state *ipt;
> +
> +        if ( id >= d->max_vcpus )
> +        {
> +            rc = -EINVAL;
> +            break;
> +        }
> +
> +        ipt = d->vcpu[id]->arch.hvm.vmx.ipt_state;

Please use domain_vcpu() here.

> +        if ( !ipt )
> +        {
> +            rc = -EINVAL;
> +            break;
> +        }
> +
> +        mfn = mfn_x(ipt->output_base >> PAGE_SHIFT);

maddr_to_mfn()?

> +        if (nr_frames > ( ( ipt->output_mask.size + 1 ) >> PAGE_SHIFT ))
> +        {
> +            rc = -EINVAL;
> +            break;
> +        }
> +
> +        rc = 0;
> +        for ( i = 0; i < nr_frames; i++ )
> +        {
> +            mfn_list[i] = mfn_add(mfn, i);
> +        }

Please omit the braces in cases like this one.

> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> @@ -104,6 +104,19 @@ struct pi_blocking_vcpu {
>      spinlock_t           *lock;
>  };
>  
> +struct ipt_state {
> +    uint64_t active;
> +    uint64_t status;
> +    uint64_t output_base;
> +    union {
> +        uint64_t raw;
> +        struct {
> +            uint32_t size;
> +            uint32_t offset;
> +        };
> +    } output_mask;
> +};

While referenced ...

>  struct vmx_vcpu {
>      /* Physical address of VMCS. */
>      paddr_t              vmcs_pa;
> @@ -186,6 +199,9 @@ struct vmx_vcpu {
>       * pCPU and wakeup the related vCPU.
>       */
>      struct pi_blocking_vcpu pi_blocking;
> +
> +    /* State of Intel Processor Trace feature */
> +    struct ipt_state     *ipt_state;

... here, the struct declaration itself doesn't really belong in this
header, as not being VMCS-related. The better place likely is vmx.h.

> --- a/xen/include/public/hvm/hvm_op.h
> +++ b/xen/include/public/hvm/hvm_op.h
> @@ -382,6 +382,29 @@ struct xen_hvm_altp2m_op {
>  typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
>  
> +/* HVMOP_vmtrace: Perform VM tracing related operation */
> +#define HVMOP_vmtrace 26
> +
> +#define HVMOP_VMTRACE_INTERFACE_VERSION 0x00000001

I'm unconvinced we want to introduce yet another versioned interface.
In any event, as hinted at earlier, this suggests it wants to be a
tools-only interface instead (which, at least for the time being, is
not required to be a stable interface then, but that's also something
we apparently want to move away from, and hence you may better not
try to rely on it not needing to be stable).

> +struct xen_hvm_vmtrace_op {
> +    /* IN variable */
> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
> +    uint32_t cmd;
> +/* Enable/disable external vmtrace for given domain */
> +#define HVMOP_vmtrace_ipt_enable      1
> +#define HVMOP_vmtrace_ipt_disable     2
> +#define HVMOP_vmtrace_ipt_get_offset  3
> +    domid_t domain;
> +    uint32_t vcpu;
> +    uint64_t size;
> +
> +    /* OUT variable */
> +    uint64_t offset;

If this is to be a tools-only interface, please use uint64_aligned_t.

You also want to add an entry to xen/include/xlat.lst and use the
resulting macro to prove that the struct layout is the same for
native and compat callers.

> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -457,6 +457,9 @@ struct domain
>      unsigned    pbuf_idx;
>      spinlock_t  pbuf_lock;
>  
> +    /* Used by vmtrace domctls */
> +    spinlock_t  vmtrace_lock;

There's no domctl invovled here anymore, I think.

Jan

Michał Leszczyński June 22, 2020, 2:35 p.m. UTC | #8

----- 22 cze 2020 o 15:25, Jan Beulich jbeulich@suse.com napisał(a):

> On 19.06.2020 01:41, Michał Leszczyński wrote:
>> +
>> +    domain_pause(d);
> 
> Who's the intended caller of this interface? You making it a hvm-op
> suggests the guest may itself call this. But of course a guest
> can't pause itself. If this is supposed to be a tools-only interface,
> then you should frame it suitably in the public header and of course
> you need to enforce this here (which would e.g. mean you shouldn't
> use rcu_lock_domain_by_any_id()).
> 

What should I use instead of rcu_lock_domain_by_and_id()?

> Also please take a look at hvm/ioreq.c, which makes quite a bit of
> use of domain_pause(). In particular I think you want to acquire
> the lock only after having paused the domain.
> 

This domain_pause() will be changed to vcpu_pause().

> Shouldn't you rather remove the MSR from the load list here?
> 

This will be fixed.

> Is any of what you do in this switch() actually legitimate without
> hvm_set_vmtrace_pt_size() having got called for the guest? From
> remarks elsewhere I imply you expect the param that you currently
> use to be set upon domain creation time, but at the very least the
> potentially big buffer should imo not get allocated up front, but
> only when tracing is to actually be enabled.

Wait... so you want to allocate these buffers in runtime?
Previously we were talking that there is too much runtime logic
and these enable/disable hypercalls should be stripped to absolute
minimum.


>> --- a/xen/include/public/hvm/hvm_op.h
>> +++ b/xen/include/public/hvm/hvm_op.h
>> @@ -382,6 +382,29 @@ struct xen_hvm_altp2m_op {
>>  typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
>>  DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
>>  
>> +/* HVMOP_vmtrace: Perform VM tracing related operation */
>> +#define HVMOP_vmtrace 26
>> +
>> +#define HVMOP_VMTRACE_INTERFACE_VERSION 0x00000001
> 
> I'm unconvinced we want to introduce yet another versioned interface.
> In any event, as hinted at earlier, this suggests it wants to be a
> tools-only interface instead (which, at least for the time being, is
> not required to be a stable interface then, but that's also something
> we apparently want to move away from, and hence you may better not
> try to rely on it not needing to be stable).

Ok. I will remove the interface version.

> 
>> +struct xen_hvm_vmtrace_op {
>> +    /* IN variable */
>> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
>> +    uint32_t cmd;
>> +/* Enable/disable external vmtrace for given domain */
>> +#define HVMOP_vmtrace_ipt_enable      1
>> +#define HVMOP_vmtrace_ipt_disable     2
>> +#define HVMOP_vmtrace_ipt_get_offset  3
>> +    domid_t domain;
>> +    uint32_t vcpu;
>> +    uint64_t size;
>> +
>> +    /* OUT variable */
>> +    uint64_t offset;
> 
> If this is to be a tools-only interface, please use uint64_aligned_t.
> 

This type is not defined within hvm_op.h header. What should I do about it?

> You also want to add an entry to xen/include/xlat.lst and use the
> resulting macro to prove that the struct layout is the same for
> native and compat callers.

Could you tell a little bit more about this? What are "native" and
"compat" callers and what is the purpose of this file?


Best regards,
Michał Leszczyński
CERT Polska

Jan Beulich June 22, 2020, 3:22 p.m. UTC | #9

On 22.06.2020 16:35, Michał Leszczyński wrote:
> ----- 22 cze 2020 o 15:25, Jan Beulich jbeulich@suse.com napisał(a):
>> On 19.06.2020 01:41, Michał Leszczyński wrote:
>>> +
>>> +    domain_pause(d);
>>
>> Who's the intended caller of this interface? You making it a hvm-op
>> suggests the guest may itself call this. But of course a guest
>> can't pause itself. If this is supposed to be a tools-only interface,
>> then you should frame it suitably in the public header and of course
>> you need to enforce this here (which would e.g. mean you shouldn't
>> use rcu_lock_domain_by_any_id()).
>>
> 
> What should I use instead of rcu_lock_domain_by_and_id()?

Please take a look at the header where its declaration lives. It's
admittedly not the usual thing in Xen, but there are even comments
describing the differences between the four related by-id functions.
I guess rcu_lock_live_remote_domain_by_id() is the one you want to
use, despite being puzzled by there being surprisingly little uses
elsewhere.

>> Also please take a look at hvm/ioreq.c, which makes quite a bit of
>> use of domain_pause(). In particular I think you want to acquire
>> the lock only after having paused the domain.
> 
> This domain_pause() will be changed to vcpu_pause().

And you understand that my comment then still applies?

>> Shouldn't you rather remove the MSR from the load list here?
> 
> This will be fixed.

Thanks for trimming your reply, but I think you've gone too far:
Context should still be such that one can see what the comments
are about without having to go back to the original mail. Please
try to find some middle ground.

>> Is any of what you do in this switch() actually legitimate without
>> hvm_set_vmtrace_pt_size() having got called for the guest? From
>> remarks elsewhere I imply you expect the param that you currently
>> use to be set upon domain creation time, but at the very least the
>> potentially big buffer should imo not get allocated up front, but
>> only when tracing is to actually be enabled.
> 
> Wait... so you want to allocate these buffers in runtime?
> Previously we were talking that there is too much runtime logic
> and these enable/disable hypercalls should be stripped to absolute
> minimum.

Basic arrangements can be made at domain creation time. I don't
think though that it would be a good use of memory if you
allocated perhaps many gigabytes of memory just for possibly
wanting to enable tracing on a guest. 

>>> +struct xen_hvm_vmtrace_op {
>>> +    /* IN variable */
>>> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
>>> +    uint32_t cmd;
>>> +/* Enable/disable external vmtrace for given domain */
>>> +#define HVMOP_vmtrace_ipt_enable      1
>>> +#define HVMOP_vmtrace_ipt_disable     2
>>> +#define HVMOP_vmtrace_ipt_get_offset  3
>>> +    domid_t domain;
>>> +    uint32_t vcpu;
>>> +    uint64_t size;
>>> +
>>> +    /* OUT variable */
>>> +    uint64_t offset;
>>
>> If this is to be a tools-only interface, please use uint64_aligned_t.
>>
> 
> This type is not defined within hvm_op.h header. What should I do about it?

It gets defined by xen.h, so should be available here. Its
definitions live in a

#if defined(__XEN__) || defined(__XEN_TOOLS__)

section, which is what I did recommend to put your interface in
as well. Unless you want this to be exposed to the guest itself,
at which point further constraints would arise.

>> You also want to add an entry to xen/include/xlat.lst and use the
>> resulting macro to prove that the struct layout is the same for
>> native and compat callers.
> 
> Could you tell a little bit more about this? What are "native" and
> "compat" callers and what is the purpose of this file?

A native caller is one whose bitness matches Xen's, i.e. for x86
a guest running in 64-bit mode. A compat guest is one running in
32-bit mode. Interface structure layout, at least for historical
reasons, can differ between the two. If it does, these
structures need translation. In the case here the layouts look
to match, which we still want to be verified at build time. If
you add a suitable line to xlat.lst starting with a ?, a macro
will be generated that you can then invoke somewhere near the
code that actually handles this sub-hypercall. See e.g. the top
of xen/common/hypfs.c for a relatively recent addition of such.

Jan

Michał Leszczyński June 22, 2020, 4:02 p.m. UTC | #10

----- 22 cze 2020 o 17:22, Jan Beulich jbeulich@suse.com napisał(a):

> On 22.06.2020 16:35, Michał Leszczyński wrote:
>> ----- 22 cze 2020 o 15:25, Jan Beulich jbeulich@suse.com napisał(a):
>>> On 19.06.2020 01:41, Michał Leszczyński wrote:
>>>> +
>>>> +    domain_pause(d);
>>>
>>> Who's the intended caller of this interface? You making it a hvm-op
>>> suggests the guest may itself call this. But of course a guest
>>> can't pause itself. If this is supposed to be a tools-only interface,
>>> then you should frame it suitably in the public header and of course
>>> you need to enforce this here (which would e.g. mean you shouldn't
>>> use rcu_lock_domain_by_any_id()).
>>>
>> 
>> What should I use instead of rcu_lock_domain_by_and_id()?
> 
> Please take a look at the header where its declaration lives. It's
> admittedly not the usual thing in Xen, but there are even comments
> describing the differences between the four related by-id functions.
> I guess rcu_lock_live_remote_domain_by_id() is the one you want to
> use, despite being puzzled by there being surprisingly little uses
> elsewhere.
> 

Ok, I will correct this.

>>> Also please take a look at hvm/ioreq.c, which makes quite a bit of
>>> use of domain_pause(). In particular I think you want to acquire
>>> the lock only after having paused the domain.
>> 
>> This domain_pause() will be changed to vcpu_pause().
> 
> And you understand that my comment then still applies?

If you mean that we should first call vcpu_pause()
and then acquire spinlock, then yes, this will be corrected in v3.

>>> Is any of what you do in this switch() actually legitimate without
>>> hvm_set_vmtrace_pt_size() having got called for the guest? From
>>> remarks elsewhere I imply you expect the param that you currently
>>> use to be set upon domain creation time, but at the very least the
>>> potentially big buffer should imo not get allocated up front, but
>>> only when tracing is to actually be enabled.
>> 
>> Wait... so you want to allocate these buffers in runtime?
>> Previously we were talking that there is too much runtime logic
>> and these enable/disable hypercalls should be stripped to absolute
>> minimum.
> 
> Basic arrangements can be made at domain creation time. I don't
> think though that it would be a good use of memory if you
> allocated perhaps many gigabytes of memory just for possibly
> wanting to enable tracing on a guest.
> 

From our previous conversations I thought that you want to have
as much logic moved to the domain creation as possible.

Thus, a parameter "vmtrace_pt_size" was introduced. By default it's
zero (= disabled), if you set it to a non-zero value, then trace buffers
of given size will be allocated for the domain and you have possibility
to use ipt_enable/ipt_disable at any moment.

This way the runtime logic is as thin as possible. I assume user knows
in advance whether he/she would want to use external monitoring with IPT
or not.

It's also not "many gigabytes". In most use cases a buffer of 16/32/64 MB
would suffice, I think.

If we want to fall back to the scenario where the trace buffer is
allocated dynamically, then we basically get back to patch v1
implementation.

>>>> +struct xen_hvm_vmtrace_op {
>>>> +    /* IN variable */
>>>> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
>>>> +    uint32_t cmd;
>>>> +/* Enable/disable external vmtrace for given domain */
>>>> +#define HVMOP_vmtrace_ipt_enable      1
>>>> +#define HVMOP_vmtrace_ipt_disable     2
>>>> +#define HVMOP_vmtrace_ipt_get_offset  3
>>>> +    domid_t domain;
>>>> +    uint32_t vcpu;
>>>> +    uint64_t size;
>>>> +
>>>> +    /* OUT variable */
>>>> +    uint64_t offset;
>>>
>>> If this is to be a tools-only interface, please use uint64_aligned_t.
>>>
>> 
>> This type is not defined within hvm_op.h header. What should I do about it?
> 
> It gets defined by xen.h, so should be available here. Its
> definitions live in a
> 
> #if defined(__XEN__) || defined(__XEN_TOOLS__)
> 
> section, which is what I did recommend to put your interface in
> as well. Unless you want this to be exposed to the guest itself,
> at which point further constraints would arise.
> 
>>> You also want to add an entry to xen/include/xlat.lst and use the
>>> resulting macro to prove that the struct layout is the same for
>>> native and compat callers.
>> 
>> Could you tell a little bit more about this? What are "native" and
>> "compat" callers and what is the purpose of this file?
> 
> A native caller is one whose bitness matches Xen's, i.e. for x86
> a guest running in 64-bit mode. A compat guest is one running in
> 32-bit mode. Interface structure layout, at least for historical
> reasons, can differ between the two. If it does, these
> structures need translation. In the case here the layouts look
> to match, which we still want to be verified at build time. If
> you add a suitable line to xlat.lst starting with a ?, a macro
> will be generated that you can then invoke somewhere near the
> code that actually handles this sub-hypercall. See e.g. the top
> of xen/common/hypfs.c for a relatively recent addition of such.
> 

Thanks for this explanation, I will try to address this.


Best regards,
Michał Leszczyński
CERT Polska

Jan Beulich June 22, 2020, 4:16 p.m. UTC | #11

On 22.06.2020 18:02, Michał Leszczyński wrote:
> ----- 22 cze 2020 o 17:22, Jan Beulich jbeulich@suse.com napisał(a):
>> On 22.06.2020 16:35, Michał Leszczyński wrote:
>>> ----- 22 cze 2020 o 15:25, Jan Beulich jbeulich@suse.com napisał(a):
>>>> Is any of what you do in this switch() actually legitimate without
>>>> hvm_set_vmtrace_pt_size() having got called for the guest? From
>>>> remarks elsewhere I imply you expect the param that you currently
>>>> use to be set upon domain creation time, but at the very least the
>>>> potentially big buffer should imo not get allocated up front, but
>>>> only when tracing is to actually be enabled.
>>>
>>> Wait... so you want to allocate these buffers in runtime?
>>> Previously we were talking that there is too much runtime logic
>>> and these enable/disable hypercalls should be stripped to absolute
>>> minimum.
>>
>> Basic arrangements can be made at domain creation time. I don't
>> think though that it would be a good use of memory if you
>> allocated perhaps many gigabytes of memory just for possibly
>> wanting to enable tracing on a guest.
>>
> 
> From our previous conversations I thought that you want to have
> as much logic moved to the domain creation as possible.
> 
> Thus, a parameter "vmtrace_pt_size" was introduced. By default it's
> zero (= disabled), if you set it to a non-zero value, then trace buffers
> of given size will be allocated for the domain and you have possibility
> to use ipt_enable/ipt_disable at any moment.
> 
> This way the runtime logic is as thin as possible. I assume user knows
> in advance whether he/she would want to use external monitoring with IPT
> or not.

Andrew - I think you requested movement to domain_create(). Could
you clarify if indeed you mean to also allocate the big buffers
this early?

> It's also not "many gigabytes". In most use cases a buffer of 16/32/64 MB
> would suffice, I think.

But that one such buffer per vCPU, isn't it? Plus these buffers
need to be physically contiguous, which is an additional possibly
severe constraint.

Jan

Michał Leszczyński June 22, 2020, 4:22 p.m. UTC | #12

----- 22 cze 2020 o 18:16, Jan Beulich jbeulich@suse.com napisał(a):

> On 22.06.2020 18:02, Michał Leszczyński wrote:
>> ----- 22 cze 2020 o 17:22, Jan Beulich jbeulich@suse.com napisał(a):
>>> On 22.06.2020 16:35, Michał Leszczyński wrote:
>>>> ----- 22 cze 2020 o 15:25, Jan Beulich jbeulich@suse.com napisał(a):
>>>>> Is any of what you do in this switch() actually legitimate without
>>>>> hvm_set_vmtrace_pt_size() having got called for the guest? From
>>>>> remarks elsewhere I imply you expect the param that you currently
>>>>> use to be set upon domain creation time, but at the very least the
>>>>> potentially big buffer should imo not get allocated up front, but
>>>>> only when tracing is to actually be enabled.
>>>>
>>>> Wait... so you want to allocate these buffers in runtime?
>>>> Previously we were talking that there is too much runtime logic
>>>> and these enable/disable hypercalls should be stripped to absolute
>>>> minimum.
>>>
>>> Basic arrangements can be made at domain creation time. I don't
>>> think though that it would be a good use of memory if you
>>> allocated perhaps many gigabytes of memory just for possibly
>>> wanting to enable tracing on a guest.
>>>
>> 
>> From our previous conversations I thought that you want to have
>> as much logic moved to the domain creation as possible.
>> 
>> Thus, a parameter "vmtrace_pt_size" was introduced. By default it's
>> zero (= disabled), if you set it to a non-zero value, then trace buffers
>> of given size will be allocated for the domain and you have possibility
>> to use ipt_enable/ipt_disable at any moment.
>> 
>> This way the runtime logic is as thin as possible. I assume user knows
>> in advance whether he/she would want to use external monitoring with IPT
>> or not.
> 
> Andrew - I think you requested movement to domain_create(). Could
> you clarify if indeed you mean to also allocate the big buffers
> this early?
> 
>> It's also not "many gigabytes". In most use cases a buffer of 16/32/64 MB
>> would suffice, I think.
> 
> But that one such buffer per vCPU, isn't it? Plus these buffers
> need to be physically contiguous, which is an additional possibly
> severe constraint.

Yes. For my use case (VMI stuff) I estimate 16-64 MB per vCPU and for fuzzing
I think it would be even less.

And also yes - these buffers need to be physically contigous and aligned
because otherwise CPU would refuse to use them.


Best regards,
Michał Leszczyński
CERT Polska

Roger Pau Monné June 22, 2020, 4:25 p.m. UTC | #13

On Mon, Jun 22, 2020 at 06:16:57PM +0200, Jan Beulich wrote:
> On 22.06.2020 18:02, Michał Leszczyński wrote:
> > ----- 22 cze 2020 o 17:22, Jan Beulich jbeulich@suse.com napisał(a):
> >> On 22.06.2020 16:35, Michał Leszczyński wrote:
> >>> ----- 22 cze 2020 o 15:25, Jan Beulich jbeulich@suse.com napisał(a):
> > It's also not "many gigabytes". In most use cases a buffer of 16/32/64 MB
> > would suffice, I think.
> 
> But that one such buffer per vCPU, isn't it? Plus these buffers
> need to be physically contiguous, which is an additional possibly
> severe constraint.

FTR, from my reading of the SDM you can use a mode called ToPA where
the buffer is some kind of linked list of tables that map to output
regions. That would be nice, but IMO it should be implemented in a
next iteration after the basic support is in.

Roger.

Michał Leszczyński June 22, 2020, 4:33 p.m. UTC | #14

----- 22 cze 2020 o 18:25, Roger Pau Monné roger.pau@citrix.com napisał(a):

> On Mon, Jun 22, 2020 at 06:16:57PM +0200, Jan Beulich wrote:
>> On 22.06.2020 18:02, Michał Leszczyński wrote:
>> > ----- 22 cze 2020 o 17:22, Jan Beulich jbeulich@suse.com napisał(a):
>> >> On 22.06.2020 16:35, Michał Leszczyński wrote:
>> >>> ----- 22 cze 2020 o 15:25, Jan Beulich jbeulich@suse.com napisał(a):
>> > It's also not "many gigabytes". In most use cases a buffer of 16/32/64 MB
>> > would suffice, I think.
>> 
>> But that one such buffer per vCPU, isn't it? Plus these buffers
>> need to be physically contiguous, which is an additional possibly
>> severe constraint.
> 
> FTR, from my reading of the SDM you can use a mode called ToPA where
> the buffer is some kind of linked list of tables that map to output
> regions. That would be nice, but IMO it should be implemented in a
> next iteration after the basic support is in.
> 
> Roger.

Yes. I keep that in mind but right now I would like to go for the
minimum viable implementation, while ToPA could be added in the next
patch series.


Best regards,
Michał Leszczyński
CERT Polska

Michał Leszczyński June 22, 2020, 5:05 p.m. UTC | #15

>>>>> +struct xen_hvm_vmtrace_op {
>>>>> +    /* IN variable */
>>>>> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
>>>>> +    uint32_t cmd;
>>>>> +/* Enable/disable external vmtrace for given domain */
>>>>> +#define HVMOP_vmtrace_ipt_enable      1
>>>>> +#define HVMOP_vmtrace_ipt_disable     2
>>>>> +#define HVMOP_vmtrace_ipt_get_offset  3
>>>>> +    domid_t domain;
>>>>> +    uint32_t vcpu;
>>>>> +    uint64_t size;
>>>>> +
>>>>> +    /* OUT variable */
>>>>> +    uint64_t offset;
>>>>
>>>> If this is to be a tools-only interface, please use uint64_aligned_t.
>>>>
>>> 
>>> This type is not defined within hvm_op.h header. What should I do about it?
>> 
>> It gets defined by xen.h, so should be available here. Its
>> definitions live in a
>> 
>> #if defined(__XEN__) || defined(__XEN_TOOLS__)
>> 
>> section, which is what I did recommend to put your interface in
>> as well. Unless you want this to be exposed to the guest itself,
>> at which point further constraints would arise.
>> 

When I've putted it into #if defined(__XEN__) || defined(__XEN_TOOLS__)
then it complains about uint64_aligned_compat_t type missing.

I also can't spot any single instance of uint64_aligned_t within
this file.


ml

Michał Leszczyński June 23, 2020, 1:04 a.m. UTC | #16

----- 22 cze 2020 o 18:16, Jan Beulich jbeulich@suse.com napisał(a):

> On 22.06.2020 18:02, Michał Leszczyński wrote:
>> ----- 22 cze 2020 o 17:22, Jan Beulich jbeulich@suse.com napisał(a):
>>> On 22.06.2020 16:35, Michał Leszczyński wrote:
>>>> ----- 22 cze 2020 o 15:25, Jan Beulich jbeulich@suse.com napisał(a):
>>>>> Is any of what you do in this switch() actually legitimate without
>>>>> hvm_set_vmtrace_pt_size() having got called for the guest? From
>>>>> remarks elsewhere I imply you expect the param that you currently
>>>>> use to be set upon domain creation time, but at the very least the
>>>>> potentially big buffer should imo not get allocated up front, but
>>>>> only when tracing is to actually be enabled.
>>>>
>>>> Wait... so you want to allocate these buffers in runtime?
>>>> Previously we were talking that there is too much runtime logic
>>>> and these enable/disable hypercalls should be stripped to absolute
>>>> minimum.
>>>
>>> Basic arrangements can be made at domain creation time. I don't
>>> think though that it would be a good use of memory if you
>>> allocated perhaps many gigabytes of memory just for possibly
>>> wanting to enable tracing on a guest.
>>>
>> 
>> From our previous conversations I thought that you want to have
>> as much logic moved to the domain creation as possible.
>> 
>> Thus, a parameter "vmtrace_pt_size" was introduced. By default it's
>> zero (= disabled), if you set it to a non-zero value, then trace buffers
>> of given size will be allocated for the domain and you have possibility
>> to use ipt_enable/ipt_disable at any moment.
>> 
>> This way the runtime logic is as thin as possible. I assume user knows
>> in advance whether he/she would want to use external monitoring with IPT
>> or not.
> 
> Andrew - I think you requested movement to domain_create(). Could
> you clarify if indeed you mean to also allocate the big buffers
> this early?

I would like to recall what Andrew wrote few days ago:

----- 16 cze 2020 o 22:16, Andrew Cooper andrew.cooper3@citrix.com wrote:
> Xen has traditionally opted for a "and turn this extra thing on
> dynamically" model, but this has caused no end of security issues and
> broken corner cases.
> 
> You can see this still existing in the difference between
> XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
> required to chose the number of vcpus for the domain) and we're making
> good progress undoing this particular wart (before 4.13, it was
> concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
> issuing other hypercalls between these two).
> 
> There is a lot of settings which should be immutable for the lifetime of
> the domain, and external monitoring looks like another one of these.
> Specifying it at createdomain time allows for far better runtime
> behaviour (you are no longer in a situation where the first time you try
> to turn tracing on, you end up with -ENOMEM because another VM booted in
> the meantime and used the remaining memory), and it makes for rather
> more simple code in Xen itself (at runtime, you can rely on it having
> been set up properly, because a failure setting up will have killed the
> domain already).
> 
> ...
> 
> ~Andrew

according to this quote I've moved buffer allocation to the create_domain,
the updated version was already sent to the list as patch v3.

Best regards,
Michał Leszczyński
CERT Polska

Jan Beulich June 23, 2020, 8:49 a.m. UTC | #17

On 22.06.2020 19:05, Michał Leszczyński wrote:
>>>>>> +struct xen_hvm_vmtrace_op {
>>>>>> +    /* IN variable */
>>>>>> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
>>>>>> +    uint32_t cmd;
>>>>>> +/* Enable/disable external vmtrace for given domain */
>>>>>> +#define HVMOP_vmtrace_ipt_enable      1
>>>>>> +#define HVMOP_vmtrace_ipt_disable     2
>>>>>> +#define HVMOP_vmtrace_ipt_get_offset  3
>>>>>> +    domid_t domain;
>>>>>> +    uint32_t vcpu;
>>>>>> +    uint64_t size;
>>>>>> +
>>>>>> +    /* OUT variable */
>>>>>> +    uint64_t offset;
>>>>>
>>>>> If this is to be a tools-only interface, please use uint64_aligned_t.
>>>>>
>>>>
>>>> This type is not defined within hvm_op.h header. What should I do about it?
>>>
>>> It gets defined by xen.h, so should be available here. Its
>>> definitions live in a
>>>
>>> #if defined(__XEN__) || defined(__XEN_TOOLS__)
>>>
>>> section, which is what I did recommend to put your interface in
>>> as well. Unless you want this to be exposed to the guest itself,
>>> at which point further constraints would arise.
>>>
> 
> When I've putted it into #if defined(__XEN__) || defined(__XEN_TOOLS__)
> then it complains about uint64_aligned_compat_t type missing.

One the interface is tools-only, the requirement for compat
checking isn't as strict anymore. As you can see from domctl
and sysctl, there's no checking there, but to compensate we
use uint64_aligned_t instead (arranging for struct layouts to
match).

> I also can't spot any single instance of uint64_aligned_t within
> this file.

That's unrelated.

Jan

Jan Beulich June 23, 2020, 8:51 a.m. UTC | #18

On 23.06.2020 03:04, Michał Leszczyński wrote:
> ----- 22 cze 2020 o 18:16, Jan Beulich jbeulich@suse.com napisał(a):
> 
>> On 22.06.2020 18:02, Michał Leszczyński wrote:
>>> ----- 22 cze 2020 o 17:22, Jan Beulich jbeulich@suse.com napisał(a):
>>>> On 22.06.2020 16:35, Michał Leszczyński wrote:
>>>>> ----- 22 cze 2020 o 15:25, Jan Beulich jbeulich@suse.com napisał(a):
>>>>>> Is any of what you do in this switch() actually legitimate without
>>>>>> hvm_set_vmtrace_pt_size() having got called for the guest? From
>>>>>> remarks elsewhere I imply you expect the param that you currently
>>>>>> use to be set upon domain creation time, but at the very least the
>>>>>> potentially big buffer should imo not get allocated up front, but
>>>>>> only when tracing is to actually be enabled.
>>>>>
>>>>> Wait... so you want to allocate these buffers in runtime?
>>>>> Previously we were talking that there is too much runtime logic
>>>>> and these enable/disable hypercalls should be stripped to absolute
>>>>> minimum.
>>>>
>>>> Basic arrangements can be made at domain creation time. I don't
>>>> think though that it would be a good use of memory if you
>>>> allocated perhaps many gigabytes of memory just for possibly
>>>> wanting to enable tracing on a guest.
>>>>
>>>
>>> From our previous conversations I thought that you want to have
>>> as much logic moved to the domain creation as possible.
>>>
>>> Thus, a parameter "vmtrace_pt_size" was introduced. By default it's
>>> zero (= disabled), if you set it to a non-zero value, then trace buffers
>>> of given size will be allocated for the domain and you have possibility
>>> to use ipt_enable/ipt_disable at any moment.
>>>
>>> This way the runtime logic is as thin as possible. I assume user knows
>>> in advance whether he/she would want to use external monitoring with IPT
>>> or not.
>>
>> Andrew - I think you requested movement to domain_create(). Could
>> you clarify if indeed you mean to also allocate the big buffers
>> this early?
> 
> I would like to recall what Andrew wrote few days ago:
> 
> ----- 16 cze 2020 o 22:16, Andrew Cooper andrew.cooper3@citrix.com wrote:
>> Xen has traditionally opted for a "and turn this extra thing on
>> dynamically" model, but this has caused no end of security issues and
>> broken corner cases.
>>
>> You can see this still existing in the difference between
>> XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
>> required to chose the number of vcpus for the domain) and we're making
>> good progress undoing this particular wart (before 4.13, it was
>> concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
>> issuing other hypercalls between these two).
>>
>> There is a lot of settings which should be immutable for the lifetime of
>> the domain, and external monitoring looks like another one of these.
>> Specifying it at createdomain time allows for far better runtime
>> behaviour (you are no longer in a situation where the first time you try
>> to turn tracing on, you end up with -ENOMEM because another VM booted in
>> the meantime and used the remaining memory), and it makes for rather
>> more simple code in Xen itself (at runtime, you can rely on it having
>> been set up properly, because a failure setting up will have killed the
>> domain already).
>>
>> ...
>>
>> ~Andrew
> 
> according to this quote I've moved buffer allocation to the create_domain,
> the updated version was already sent to the list as patch v3.

I'd still like to see an explicit confirmation by him that this
use of memory is indeed what he has intended. There are much smaller
amounts of memory which we allocate on demand, just to avoid
allocating some without then ever using it.

Jan

Andrew Cooper June 23, 2020, 5:24 p.m. UTC | #19

On 23/06/2020 09:51, Jan Beulich wrote:
> On 23.06.2020 03:04, Michał Leszczyński wrote:
>> ----- 22 cze 2020 o 18:16, Jan Beulich jbeulich@suse.com napisał(a):
>>
>>> On 22.06.2020 18:02, Michał Leszczyński wrote:
>>>> ----- 22 cze 2020 o 17:22, Jan Beulich jbeulich@suse.com napisał(a):
>>>>> On 22.06.2020 16:35, Michał Leszczyński wrote:
>>>>>> ----- 22 cze 2020 o 15:25, Jan Beulich jbeulich@suse.com napisał(a):
>>>>>>> Is any of what you do in this switch() actually legitimate without
>>>>>>> hvm_set_vmtrace_pt_size() having got called for the guest? From
>>>>>>> remarks elsewhere I imply you expect the param that you currently
>>>>>>> use to be set upon domain creation time, but at the very least the
>>>>>>> potentially big buffer should imo not get allocated up front, but
>>>>>>> only when tracing is to actually be enabled.
>>>>>> Wait... so you want to allocate these buffers in runtime?
>>>>>> Previously we were talking that there is too much runtime logic
>>>>>> and these enable/disable hypercalls should be stripped to absolute
>>>>>> minimum.
>>>>> Basic arrangements can be made at domain creation time. I don't
>>>>> think though that it would be a good use of memory if you
>>>>> allocated perhaps many gigabytes of memory just for possibly
>>>>> wanting to enable tracing on a guest.
>>>>>
>>>> From our previous conversations I thought that you want to have
>>>> as much logic moved to the domain creation as possible.
>>>>
>>>> Thus, a parameter "vmtrace_pt_size" was introduced. By default it's
>>>> zero (= disabled), if you set it to a non-zero value, then trace buffers
>>>> of given size will be allocated for the domain and you have possibility
>>>> to use ipt_enable/ipt_disable at any moment.
>>>>
>>>> This way the runtime logic is as thin as possible. I assume user knows
>>>> in advance whether he/she would want to use external monitoring with IPT
>>>> or not.
>>> Andrew - I think you requested movement to domain_create(). Could
>>> you clarify if indeed you mean to also allocate the big buffers
>>> this early?
>> I would like to recall what Andrew wrote few days ago:
>>
>> ----- 16 cze 2020 o 22:16, Andrew Cooper andrew.cooper3@citrix.com wrote:
>>> Xen has traditionally opted for a "and turn this extra thing on
>>> dynamically" model, but this has caused no end of security issues and
>>> broken corner cases.
>>>
>>> You can see this still existing in the difference between
>>> XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
>>> required to chose the number of vcpus for the domain) and we're making
>>> good progress undoing this particular wart (before 4.13, it was
>>> concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
>>> issuing other hypercalls between these two).
>>>
>>> There is a lot of settings which should be immutable for the lifetime of
>>> the domain, and external monitoring looks like another one of these.
>>> Specifying it at createdomain time allows for far better runtime
>>> behaviour (you are no longer in a situation where the first time you try
>>> to turn tracing on, you end up with -ENOMEM because another VM booted in
>>> the meantime and used the remaining memory), and it makes for rather
>>> more simple code in Xen itself (at runtime, you can rely on it having
>>> been set up properly, because a failure setting up will have killed the
>>> domain already).
>>>
>>> ...
>>>
>>> ~Andrew
>> according to this quote I've moved buffer allocation to the create_domain,
>> the updated version was already sent to the list as patch v3.
> I'd still like to see an explicit confirmation by him that this
> use of memory is indeed what he has intended. There are much smaller
> amounts of memory which we allocate on demand, just to avoid
> allocating some without then ever using it.

PT is a debug/diagnostic tool.  Its not something you'd run in
production against a production VM.

It's off by default (by virtue of having to explicitly ask to use it in
the first place), and those who've asked for it don't want to be finding
-ENOMEM after the domain has been running for a few seconds (or midway
through the vcpus), when they inveterately want to map the rings.

Those who request buffers in the first place and forget about them are
not semantically different from those who ask for a silly shadow memory
limit, or typo the guest memory and give it too much.  Its a admin
error, not a safety/correctness issue.

~Andrew

Jan Beulich June 24, 2020, 10:03 a.m. UTC | #20

On 23.06.2020 19:24, Andrew Cooper wrote:
> On 23/06/2020 09:51, Jan Beulich wrote:
>> On 23.06.2020 03:04, Michał Leszczyński wrote:
>>> ----- 22 cze 2020 o 18:16, Jan Beulich jbeulich@suse.com napisał(a):
>>>
>>>> On 22.06.2020 18:02, Michał Leszczyński wrote:
>>>>> ----- 22 cze 2020 o 17:22, Jan Beulich jbeulich@suse.com napisał(a):
>>>>>> On 22.06.2020 16:35, Michał Leszczyński wrote:
>>>>>>> ----- 22 cze 2020 o 15:25, Jan Beulich jbeulich@suse.com napisał(a):
>>>>>>>> Is any of what you do in this switch() actually legitimate without
>>>>>>>> hvm_set_vmtrace_pt_size() having got called for the guest? From
>>>>>>>> remarks elsewhere I imply you expect the param that you currently
>>>>>>>> use to be set upon domain creation time, but at the very least the
>>>>>>>> potentially big buffer should imo not get allocated up front, but
>>>>>>>> only when tracing is to actually be enabled.
>>>>>>> Wait... so you want to allocate these buffers in runtime?
>>>>>>> Previously we were talking that there is too much runtime logic
>>>>>>> and these enable/disable hypercalls should be stripped to absolute
>>>>>>> minimum.
>>>>>> Basic arrangements can be made at domain creation time. I don't
>>>>>> think though that it would be a good use of memory if you
>>>>>> allocated perhaps many gigabytes of memory just for possibly
>>>>>> wanting to enable tracing on a guest.
>>>>>>
>>>>> From our previous conversations I thought that you want to have
>>>>> as much logic moved to the domain creation as possible.
>>>>>
>>>>> Thus, a parameter "vmtrace_pt_size" was introduced. By default it's
>>>>> zero (= disabled), if you set it to a non-zero value, then trace buffers
>>>>> of given size will be allocated for the domain and you have possibility
>>>>> to use ipt_enable/ipt_disable at any moment.
>>>>>
>>>>> This way the runtime logic is as thin as possible. I assume user knows
>>>>> in advance whether he/she would want to use external monitoring with IPT
>>>>> or not.
>>>> Andrew - I think you requested movement to domain_create(). Could
>>>> you clarify if indeed you mean to also allocate the big buffers
>>>> this early?
>>> I would like to recall what Andrew wrote few days ago:
>>>
>>> ----- 16 cze 2020 o 22:16, Andrew Cooper andrew.cooper3@citrix.com wrote:
>>>> Xen has traditionally opted for a "and turn this extra thing on
>>>> dynamically" model, but this has caused no end of security issues and
>>>> broken corner cases.
>>>>
>>>> You can see this still existing in the difference between
>>>> XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
>>>> required to chose the number of vcpus for the domain) and we're making
>>>> good progress undoing this particular wart (before 4.13, it was
>>>> concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
>>>> issuing other hypercalls between these two).
>>>>
>>>> There is a lot of settings which should be immutable for the lifetime of
>>>> the domain, and external monitoring looks like another one of these.
>>>> Specifying it at createdomain time allows for far better runtime
>>>> behaviour (you are no longer in a situation where the first time you try
>>>> to turn tracing on, you end up with -ENOMEM because another VM booted in
>>>> the meantime and used the remaining memory), and it makes for rather
>>>> more simple code in Xen itself (at runtime, you can rely on it having
>>>> been set up properly, because a failure setting up will have killed the
>>>> domain already).
>>>>
>>>> ...
>>>>
>>>> ~Andrew
>>> according to this quote I've moved buffer allocation to the create_domain,
>>> the updated version was already sent to the list as patch v3.
>> I'd still like to see an explicit confirmation by him that this
>> use of memory is indeed what he has intended. There are much smaller
>> amounts of memory which we allocate on demand, just to avoid
>> allocating some without then ever using it.
> 
> PT is a debug/diagnostic tool.  Its not something you'd run in
> production against a production VM.

Well, the suggested use with introspection made me assume differently.
If this is (now and forever) truly debug/diag only, so be it then.

Jan

Michał Leszczyński June 24, 2020, 12:23 p.m. UTC | #21

----- 23 cze 2020 o 19:24, Andrew Cooper andrew.cooper3@citrix.com napisał(a):
> On 23/06/2020 09:51, Jan Beulich wrote:
>> I'd still like to see an explicit confirmation by him that this
>> use of memory is indeed what he has intended. There are much smaller
>> amounts of memory which we allocate on demand, just to avoid
>> allocating some without then ever using it.
> 
> PT is a debug/diagnostic tool.  Its not something you'd run in
> production against a production VM.
> 
> It's off by default (by virtue of having to explicitly ask to use it in
> the first place), and those who've asked for it don't want to be finding
> -ENOMEM after the domain has been running for a few seconds (or midway
> through the vcpus), when they inveterately want to map the rings.
> 
> Those who request buffers in the first place and forget about them are
> not semantically different from those who ask for a silly shadow memory
> limit, or typo the guest memory and give it too much.  Its a admin
> error, not a safety/correctness issue.
> 
> ~Andrew


Absolutely +1.

Assuming that somebody wants to perform some advanced scenario and is trying
to run many domains (e.g. 20), it's much better to have 19 domains
working fine and 1 prematurely crashing because of -ENOMEM,
rather than have all 20 domains randomly crashing in runtime because
it turned out there is a shortage of memory.


Best regards,
Michał Leszczyński
CERT Polska

Andrew Cooper June 24, 2020, 12:40 p.m. UTC | #22

On 24/06/2020 11:03, Jan Beulich wrote:
> On 23.06.2020 19:24, Andrew Cooper wrote:
>> On 23/06/2020 09:51, Jan Beulich wrote:
>>> On 23.06.2020 03:04, Michał Leszczyński wrote:
>>>> ----- 22 cze 2020 o 18:16, Jan Beulich jbeulich@suse.com napisał(a):
>>>>
>>>>> On 22.06.2020 18:02, Michał Leszczyński wrote:
>>>>>> ----- 22 cze 2020 o 17:22, Jan Beulich jbeulich@suse.com napisał(a):
>>>>>>> On 22.06.2020 16:35, Michał Leszczyński wrote:
>>>>>>>> ----- 22 cze 2020 o 15:25, Jan Beulich jbeulich@suse.com napisał(a):
>>>>>>>>> Is any of what you do in this switch() actually legitimate without
>>>>>>>>> hvm_set_vmtrace_pt_size() having got called for the guest? From
>>>>>>>>> remarks elsewhere I imply you expect the param that you currently
>>>>>>>>> use to be set upon domain creation time, but at the very least the
>>>>>>>>> potentially big buffer should imo not get allocated up front, but
>>>>>>>>> only when tracing is to actually be enabled.
>>>>>>>> Wait... so you want to allocate these buffers in runtime?
>>>>>>>> Previously we were talking that there is too much runtime logic
>>>>>>>> and these enable/disable hypercalls should be stripped to absolute
>>>>>>>> minimum.
>>>>>>> Basic arrangements can be made at domain creation time. I don't
>>>>>>> think though that it would be a good use of memory if you
>>>>>>> allocated perhaps many gigabytes of memory just for possibly
>>>>>>> wanting to enable tracing on a guest.
>>>>>>>
>>>>>> From our previous conversations I thought that you want to have
>>>>>> as much logic moved to the domain creation as possible.
>>>>>>
>>>>>> Thus, a parameter "vmtrace_pt_size" was introduced. By default it's
>>>>>> zero (= disabled), if you set it to a non-zero value, then trace buffers
>>>>>> of given size will be allocated for the domain and you have possibility
>>>>>> to use ipt_enable/ipt_disable at any moment.
>>>>>>
>>>>>> This way the runtime logic is as thin as possible. I assume user knows
>>>>>> in advance whether he/she would want to use external monitoring with IPT
>>>>>> or not.
>>>>> Andrew - I think you requested movement to domain_create(). Could
>>>>> you clarify if indeed you mean to also allocate the big buffers
>>>>> this early?
>>>> I would like to recall what Andrew wrote few days ago:
>>>>
>>>> ----- 16 cze 2020 o 22:16, Andrew Cooper andrew.cooper3@citrix.com wrote:
>>>>> Xen has traditionally opted for a "and turn this extra thing on
>>>>> dynamically" model, but this has caused no end of security issues and
>>>>> broken corner cases.
>>>>>
>>>>> You can see this still existing in the difference between
>>>>> XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
>>>>> required to chose the number of vcpus for the domain) and we're making
>>>>> good progress undoing this particular wart (before 4.13, it was
>>>>> concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
>>>>> issuing other hypercalls between these two).
>>>>>
>>>>> There is a lot of settings which should be immutable for the lifetime of
>>>>> the domain, and external monitoring looks like another one of these.
>>>>> Specifying it at createdomain time allows for far better runtime
>>>>> behaviour (you are no longer in a situation where the first time you try
>>>>> to turn tracing on, you end up with -ENOMEM because another VM booted in
>>>>> the meantime and used the remaining memory), and it makes for rather
>>>>> more simple code in Xen itself (at runtime, you can rely on it having
>>>>> been set up properly, because a failure setting up will have killed the
>>>>> domain already).
>>>>>
>>>>> ...
>>>>>
>>>>> ~Andrew
>>>> according to this quote I've moved buffer allocation to the create_domain,
>>>> the updated version was already sent to the list as patch v3.
>>> I'd still like to see an explicit confirmation by him that this
>>> use of memory is indeed what he has intended. There are much smaller
>>> amounts of memory which we allocate on demand, just to avoid
>>> allocating some without then ever using it.
>> PT is a debug/diagnostic tool.  Its not something you'd run in
>> production against a production VM.
> Well, the suggested use with introspection made me assume differently.
> If this is (now and forever) truly debug/diag only, so be it then.

At the end of the day, it is a fine grain profiling tool, meaning that
it is not used in the common case.

The usecase presented is for fuzzing, using the execution trace
generated by the CPU, rather than the current scheme which allegedly
involves shuffling breakpoints around.

There is a big disclaimer with PT saying "there is a perf hit from using
this".  This is a consequence of the core suddenly becoming far more
memory bound, and almost certainly being capable of generating trace
data faster than can be written out.


Even if it does find its way into some fairly custom production
scenarios (fuzzing as a service?), we're still in the position where the
only people who enable this will be the people intending to use it.

~Andrew

Tamas K Lengyel June 24, 2020, 12:52 p.m. UTC | #23

On Wed, Jun 24, 2020 at 6:40 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 24/06/2020 11:03, Jan Beulich wrote:
> > On 23.06.2020 19:24, Andrew Cooper wrote:
> >> On 23/06/2020 09:51, Jan Beulich wrote:
> >>> On 23.06.2020 03:04, Michał Leszczyński wrote:
> >>>> ----- 22 cze 2020 o 18:16, Jan Beulich jbeulich@suse.com napisał(a):
> >>>>
> >>>>> On 22.06.2020 18:02, Michał Leszczyński wrote:
> >>>>>> ----- 22 cze 2020 o 17:22, Jan Beulich jbeulich@suse.com napisał(a):
> >>>>>>> On 22.06.2020 16:35, Michał Leszczyński wrote:
> >>>>>>>> ----- 22 cze 2020 o 15:25, Jan Beulich jbeulich@suse.com napisał(a):
> >>>>>>>>> Is any of what you do in this switch() actually legitimate without
> >>>>>>>>> hvm_set_vmtrace_pt_size() having got called for the guest? From
> >>>>>>>>> remarks elsewhere I imply you expect the param that you currently
> >>>>>>>>> use to be set upon domain creation time, but at the very least the
> >>>>>>>>> potentially big buffer should imo not get allocated up front, but
> >>>>>>>>> only when tracing is to actually be enabled.
> >>>>>>>> Wait... so you want to allocate these buffers in runtime?
> >>>>>>>> Previously we were talking that there is too much runtime logic
> >>>>>>>> and these enable/disable hypercalls should be stripped to absolute
> >>>>>>>> minimum.
> >>>>>>> Basic arrangements can be made at domain creation time. I don't
> >>>>>>> think though that it would be a good use of memory if you
> >>>>>>> allocated perhaps many gigabytes of memory just for possibly
> >>>>>>> wanting to enable tracing on a guest.
> >>>>>>>
> >>>>>> From our previous conversations I thought that you want to have
> >>>>>> as much logic moved to the domain creation as possible.
> >>>>>>
> >>>>>> Thus, a parameter "vmtrace_pt_size" was introduced. By default it's
> >>>>>> zero (= disabled), if you set it to a non-zero value, then trace buffers
> >>>>>> of given size will be allocated for the domain and you have possibility
> >>>>>> to use ipt_enable/ipt_disable at any moment.
> >>>>>>
> >>>>>> This way the runtime logic is as thin as possible. I assume user knows
> >>>>>> in advance whether he/she would want to use external monitoring with IPT
> >>>>>> or not.
> >>>>> Andrew - I think you requested movement to domain_create(). Could
> >>>>> you clarify if indeed you mean to also allocate the big buffers
> >>>>> this early?
> >>>> I would like to recall what Andrew wrote few days ago:
> >>>>
> >>>> ----- 16 cze 2020 o 22:16, Andrew Cooper andrew.cooper3@citrix.com wrote:
> >>>>> Xen has traditionally opted for a "and turn this extra thing on
> >>>>> dynamically" model, but this has caused no end of security issues and
> >>>>> broken corner cases.
> >>>>>
> >>>>> You can see this still existing in the difference between
> >>>>> XEN_DOMCTL_createdomain and XEN_DOMCTL_max_vcpus, (the latter being
> >>>>> required to chose the number of vcpus for the domain) and we're making
> >>>>> good progress undoing this particular wart (before 4.13, it was
> >>>>> concerning easy to get Xen to fall over a NULL d->vcpu[] pointer by
> >>>>> issuing other hypercalls between these two).
> >>>>>
> >>>>> There is a lot of settings which should be immutable for the lifetime of
> >>>>> the domain, and external monitoring looks like another one of these.
> >>>>> Specifying it at createdomain time allows for far better runtime
> >>>>> behaviour (you are no longer in a situation where the first time you try
> >>>>> to turn tracing on, you end up with -ENOMEM because another VM booted in
> >>>>> the meantime and used the remaining memory), and it makes for rather
> >>>>> more simple code in Xen itself (at runtime, you can rely on it having
> >>>>> been set up properly, because a failure setting up will have killed the
> >>>>> domain already).
> >>>>>
> >>>>> ...
> >>>>>
> >>>>> ~Andrew
> >>>> according to this quote I've moved buffer allocation to the create_domain,
> >>>> the updated version was already sent to the list as patch v3.
> >>> I'd still like to see an explicit confirmation by him that this
> >>> use of memory is indeed what he has intended. There are much smaller
> >>> amounts of memory which we allocate on demand, just to avoid
> >>> allocating some without then ever using it.
> >> PT is a debug/diagnostic tool.  Its not something you'd run in
> >> production against a production VM.
> > Well, the suggested use with introspection made me assume differently.
> > If this is (now and forever) truly debug/diag only, so be it then.
>
> At the end of the day, it is a fine grain profiling tool, meaning that
> it is not used in the common case.
>
> The usecase presented is for fuzzing, using the execution trace
> generated by the CPU, rather than the current scheme which allegedly
> involves shuffling breakpoints around.

That's indeed the usecase we are looking to use it for. But there is
also some experimental malware analysis scenarios it is targeted for
which is what I believe the CERT folks are mostly interested in, like
https://www.vmray.com/cyber-security-blog/back-to-the-past-using-intels-processor-trace-for-enhanced-analysis.
I could also imagine this to be useful for IDS/IPS solutions like what
Bitdefender is building but I'm just speculating since the overhead
might be prohibitive for that use-case. I would consider all these
scenarios experimental at this point and absolutely not something to
enable by default. If someone presents a need for this to be supported
on production VMs that require turning it on/off at runtime not
knowing whether the feature will be used when the domain is created we
can certainly revisit the topic. But I don't expect that to happen any
time soon, if at all.

Tamas

[v2,4/7] x86/vmx: add do_vmtrace_op

Commit Message

Comments

Patch