diff mbox

[RFC,1/6] kvm: add device control API

Message ID 1360820960-12537-2-git-send-email-scottwood@freescale.com (mailing list archive)
State New, archived
Headers show

Commit Message

Scott Wood Feb. 14, 2013, 5:49 a.m. UTC
Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 Documentation/virtual/kvm/api.txt        |   76 ++++++++++++++++++
 Documentation/virtual/kvm/devices/README |    1 +
 include/linux/kvm_host.h                 |   21 +++++
 include/uapi/linux/kvm.h                 |   25 ++++++
 virt/kvm/kvm_main.c                      |  127 ++++++++++++++++++++++++++++++
 5 files changed, 250 insertions(+)
 create mode 100644 Documentation/virtual/kvm/devices/README

Comments

Gleb Natapov Feb. 18, 2013, 12:21 p.m. UTC | #1
Copying Christoffer since ARM has in kernel irq chip too.

On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
You are not only provide different way to create in kernel irq chip you
also use an alternate way to trigger interrupt lines. Before going into
interface specifics lets think about whether it is really worth it? x86
obviously support old way and will have to for some, very long, time.
ARM vGIC code, that is ready to go upstream, uses old way too. So it will
be 2 archs against one. Christoffer do you think the proposed way it
better for your needs. Are you willing to make vGIC use it?

Scott, what other devices are you planning to support with this
interface?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoffer Dall Feb. 19, 2013, 12:43 a.m. UTC | #2
On Mon, Feb 18, 2013 at 3:01 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
>>
>> Copying Christoffer since ARM has in kernel irq chip too.
>>
>> On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
>> > Currently, devices that are emulated inside KVM are configured in a
>> > hardcoded manner based on an assumption that any given architecture
>> > only has one way to do it.  If there's any need to access device state,
>> > it is done through inflexible one-purpose-only IOCTLs (e.g.
>> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
>> > cumbersome and depletes a limited numberspace.
>> >
>> > This API provides a mechanism to instantiate a device of a certain
>> > type, returning an ID that can be used to set/get attributes of the
>> > device.  Attributes may include configuration parameters (e.g.
>> > register base address), device state, operational commands, etc.  It
>> > is similar to the ONE_REG API, except that it acts on devices rather
>> > than vcpus.
>> You are not only provide different way to create in kernel irq chip you
>> also use an alternate way to trigger interrupt lines. Before going into
>> interface specifics lets think about whether it is really worth it?
>
>
> Which "it" do you mean here?
>
> The ability to set/get attributes is needed.  Sorry, but "get or set one
> blob of data, up to 512 bytes, for the entire irqchip" is just not good
> enough -- assuming you don't want us to start sticking pointers and commands
> in *that* data. :-)
>
> If you mean the way to inject interrupts, it's simpler this way.  Why go out
> of our way to inject common glue code into a communication path between
> hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM?  Or rather, why
> make that common glue be specific to this one function when we could reuse
> the same communication glue used for other things, such as device state?
>
> And that's just for regular interrupts.  MSIs are vastly simpler on MPIC
> than what x86 does.
>
>
>> x86 obviously support old way and will have to for some, very long, time.
>
>
> Sure.
>
>
>> ARM vGIC code, that is ready to go upstream, uses old way too. So it will
>> be 2 archs against one.
>
>
> I wasn't aware that that's how it worked. :-P
>
> I was trying to be considerate by not making the entire thing gratuitously
> PPC or MPIC specific, as some others seem inclined to do (e.g. see irqfd and
> APIC).  We already had a discussion on ARM's "set address" ioctl and rather
> than extend *that* interface, they preferred to just stick something
> ARM-specific in ASAP with the understanding that it would be replaced (or
> more accurately, kept around as a thin wrapper around the new stuff) later.
>
>
>> Christoffer do you think the proposed way it
>> better for your needs. Are you willing to make vGIC use it?
>>

Well it won't improve functionality much from the current hardware
point of view, but the proposed interface is superior to what we have
now. Adding and coordinating new interfaces is indeed a pain, so a
generic interface which is flexible enough to cater for a certain
group of needs, is welcome imho. And this does seem to fit the bill.

I can imagine that if there's support for a future ARM gic version or
if we add in-kernel support for other stuff on ARM, then this
interface will be useful, and in fact using the current interface to
support two separate, but similar, interrupt controllers could get
messy from a user space point of view.

I am definitely willing to change to use this interface, the agreement
on the KVM_ARM_SET_DEVICE_ADDR ioctl was exactly because of this.

I had some nits on the RFC, which I'll send separately.

>> Scott, what other devices are you planning to support with this
>> interface?
>
>
> At the moment I do not have plans for other devices, though what does it
> hurt for the capability to be there?
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoffer Dall Feb. 19, 2013, 12:44 a.m. UTC | #3
On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood <scottwood@freescale.com> wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
>
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
>
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
>
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
>  Documentation/virtual/kvm/api.txt        |   76 ++++++++++++++++++
>  Documentation/virtual/kvm/devices/README |    1 +
>  include/linux/kvm_host.h                 |   21 +++++
>  include/uapi/linux/kvm.h                 |   25 ++++++
>  virt/kvm/kvm_main.c                      |  127 ++++++++++++++++++++++++++++++
>  5 files changed, 250 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/devices/README
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index c2534c3..5bcdb42 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2122,6 +2122,82 @@ header; first `n_valid' valid entries with contents from the data
>  written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +          be instantiated multiple times
> +  ENOSPC: Too many devices have been created
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Creates an emulated device in the kernel.  The returned handle
> +can be used with KVM_SET/GET_DEVICE_ATTR.
> +
> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> +device type is supported (not necessarily whether it can be created
> +in the current vm).
> +
> +Individual devices should not define flags.  Attributes should be used
> +for specifying any behavior that is not implied by the device type
> +number.
> +
> +struct kvm_create_device {
> +       __u32   type;   /* in: KVM_DEV_TYPE_xxx */
> +       __u32   id;     /* out: device handle */
> +       __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device id is invalid
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +  EPERM:  The attribute cannot (currently) be accessed this way
> +          (e.g. read-only attribute, or attribute that only makes
> +          sense when the device is in a different state)
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Gets/sets a specified piece of device configuration and/or state.  The
> +semantics are device-specific except for certain global attributes.  See
> +individual device documentation in the "devices" directory.  As with
> +ONE_REG, the size of the data transferred is defined by the particular
> +attribute.
> +
> +Attributes in group KVM_DEV_ATTR_COMMON are not device-specific:
> +   KVM_DEV_ATTR_TYPE (ro, 32-bit): the device type passed to KVM_CREATE_DEVICE
> +
> +struct kvm_device_attr {
> +       __u32   dev;            /* id from KVM_CREATE_DEVICE */
> +       __u32   group;          /* KVM_DEV_ATTR_COMMON or device-defined */
> +       __u64   attr;           /* group-defined */
> +       __u64   addr;           /* userspace address of attr data */
> +};
> +
> +4.81 KVM_HAS_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device id is invalid
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +
> +Tests whether a device supports a particular attribute.  A successful
> +return indicates the attribute is implemented.  It does not necessarily
> +indicate that the attribute can be read or written in the device's
> +current state.  "addr" is ignored.
>
>  5. The kvm_run structure
>  ------------------------
> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
> new file mode 100644
> index 0000000..34a6983
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/README
> @@ -0,0 +1 @@
> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0350e0d..dbaf012 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -335,6 +335,25 @@ struct kvm_memslots {
>         short id_to_index[KVM_MEM_SLOTS_NUM];
>  };
>
> +/*
> + * The worst case number of simultaneous devices will likely be very low
> + * (usually zero or one) for the forseeable future.  If the worst case
> + * exceeds this, then it can be increased, or we can convert to idr.
> + */

This comment is on the heavy side (if at all needed). If you want to
remind people of idr, just put that in a single line. A define is a
define is a define.

> +#define KVM_MAX_DEVICES 4
> +
> +struct kvm_device {
> +       u32 type;
> +
> +       int (*set_attr)(struct kvm *kvm, struct kvm_device *dev,
> +                       struct kvm_device_attr *attr);
> +       int (*get_attr)(struct kvm *kvm, struct kvm_device *dev,
> +                       struct kvm_device_attr *attr);
> +       int (*has_attr)(struct kvm *kvm, struct kvm_device *dev,
> +                       struct kvm_device_attr *attr);
> +       void (*destroy)(struct kvm *kvm, struct kvm_device *dev);
> +};
> +
>  struct kvm {
>         spinlock_t mmu_lock;
>         struct mutex slots_lock;
> @@ -385,6 +404,8 @@ struct kvm {
>         long mmu_notifier_count;
>  #endif
>         long tlbs_dirty;
> +       struct kvm_device *devices[KVM_MAX_DEVICES];
> +       unsigned int num_devices;
>  };
>
>  #define kvm_err(fmt, ...) \
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 9a2db57..1f348e0 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -662,6 +662,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_HTAB_FD 84
>  #define KVM_CAP_S390_CSS_SUPPORT 85
>  #define KVM_CAP_PPC_EPR 86
> +#define KVM_CAP_DEVICE_CTRL 87
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -890,6 +891,30 @@ struct kvm_s390_ucas_mapping {
>  /* Available with KVM_CAP_PPC_HTAB_FD */
>  #define KVM_PPC_GET_HTAB_FD      _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
>
> +/* Available with KVM_CAP_DEVICE_CTRL */
> +#define KVM_CREATE_DEVICE_TEST         1
> +
> +struct kvm_create_device {
> +       __u32   type;   /* in: KVM_DEV_TYPE_xxx */
> +       __u32   id;     /* out: device handle */
> +       __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +struct kvm_device_attr {
> +       __u32   dev;            /* id from KVM_CREATE_DEVICE */
> +       __u32   group;          /* KVM_DEV_ATTR_COMMON or device-defined */
> +       __u64   attr;           /* group-defined */
> +       __u64   addr;           /* userspace address of attr data */
> +};
> +
> +#define KVM_DEV_ATTR_COMMON            0
> +#define   KVM_DEV_ATTR_TYPE            0 /* 32-bit */
> +
> +#define KVM_CREATE_DEVICE        _IOWR(KVMIO,  0xac, struct kvm_create_device)
> +#define KVM_SET_DEVICE_ATTR      _IOW(KVMIO,  0xad, struct kvm_device_attr)
> +#define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xae, struct kvm_device_attr)

_IOWR ?

> +#define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xaf, struct kvm_device_attr)
> +
>  /*
>   * ioctls for vcpu fds
>   */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2e93630..baf8481 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -580,6 +580,18 @@ void kvm_free_physmem(struct kvm *kvm)
>         kfree(kvm->memslots);
>  }
>
> +static void kvm_destroy_devices(struct kvm *kvm)
> +{
> +       int i;
> +
> +       for (i = 0; i < kvm->num_devices; i++) {
> +               kvm->devices[i]->destroy(kvm, kvm->devices[i]);
> +               kvm->devices[i] = NULL;
> +       }
> +
> +       kvm->num_devices = 0;
> +}
> +
>  static void kvm_destroy_vm(struct kvm *kvm)
>  {
>         int i;
> @@ -590,6 +602,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>         list_del(&kvm->vm_list);
>         raw_spin_unlock(&kvm_lock);
>         kvm_free_irq_routing(kvm);
> +       kvm_destroy_devices(kvm);
>         for (i = 0; i < KVM_NR_BUSES; i++)
>                 kvm_io_bus_destroy(kvm->buses[i]);
>         kvm_coalesced_mmio_free(kvm);
> @@ -2178,6 +2191,86 @@ out:
>  }
>  #endif
>
> +static int kvm_ioctl_create_device(struct kvm *kvm,
> +                                  struct kvm_create_device *cd)
> +{
> +       struct kvm_device *dev = NULL;
> +       bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> +       int id;
> +       int r;
> +
> +       mutex_lock(&kvm->lock);
> +
> +       id = kvm->num_devices;
> +       if (id >= KVM_MAX_DEVICES && !test) {
> +               r = -ENOSPC;
> +               goto out;
> +       }
> +
> +       switch (cd->type) {
> +       default:
> +               r = -ENODEV;
> +               goto out;
> +       }

do we really believe that there will be any arch-generic recognition
of types? shouldn't this be a call to an arch-specific function
instead. Which makes me wonder whether the device type IDs should be
arch specific as well...

> +
> +       if (test) {
> +               WARN_ON_ONCE(dev);
> +               goto out;
> +       }
> +
> +       if (r == 0) {
> +               WARN_ON_ONCE(dev->type != cd->type);
> +
> +               kvm->devices[id] = dev;
> +               cd->id = id;
> +               kvm->num_devices++;
> +       }
> +
> +out:
> +       mutex_unlock(&kvm->lock);
> +       return r;
> +}
> +
> +static int kvm_ioctl_device_attr(struct kvm *kvm, int ioctl,
> +                                struct kvm_device_attr *attr)
> +{
> +       struct kvm_device *dev;
> +       int (*accessor)(struct kvm *kvm, struct kvm_device *dev,
> +                       struct kvm_device_attr *attr);
> +
> +       if (attr->dev >= KVM_MAX_DEVICES)
> +               return -ENODEV;
> +
> +       dev = kvm->devices[attr->dev];
> +       if (!dev)
> +               return -ENODEV;
> +
> +       switch (ioctl) {
> +       case KVM_SET_DEVICE_ATTR:
> +               if (attr->group == KVM_DEV_ATTR_COMMON &&
> +                   attr->attr == KVM_DEV_ATTR_TYPE)
> +                       return -EPERM;
> +
> +               accessor = dev->set_attr;
> +               break;
> +       case KVM_GET_DEVICE_ATTR:
> +               if (attr->group == KVM_DEV_ATTR_COMMON &&
> +                   attr->attr == KVM_DEV_ATTR_TYPE)
> +                       return dev->type;
> +
> +               accessor = dev->get_attr;
> +               break;
> +       case KVM_HAS_DEVICE_ATTR:
> +               accessor = dev->has_attr;
> +               break;
> +       }
> +
> +       if (!accessor)
> +               return -EPERM;
> +
> +       return accessor(kvm, dev, attr);
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,
>                            unsigned int ioctl, unsigned long arg)
>  {
> @@ -2292,6 +2385,40 @@ static long kvm_vm_ioctl(struct file *filp,
>                 break;
>         }
>  #endif
> +       case KVM_CREATE_DEVICE: {
> +               struct kvm_create_device cd;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&cd, argp, sizeof(cd)))
> +                       goto out;
> +
> +               r = kvm_ioctl_create_device(kvm, &cd);
> +               if (r)
> +                       goto out;
> +
> +               r = -EFAULT;
> +               if (copy_to_user(argp, &cd, sizeof(cd)))
> +                       goto out;
> +
> +               r = 0;
> +               break;
> +       }
> +       case KVM_SET_DEVICE_ATTR:
> +       case KVM_GET_DEVICE_ATTR:
> +       case KVM_HAS_DEVICE_ATTR: {
> +               struct kvm_device_attr attr;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&attr, argp, sizeof(attr)))
> +                       goto out;
> +
> +               r = kvm_ioctl_device_attr(kvm, ioctl, &attr);
> +               if (r)
> +                       goto out;
> +
> +               r = 0;
> +               break;
> +       }
>         default:
>                 r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>                 if (r == -ENOTTY)
> --
> 1.7.9.5
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoffer Dall Feb. 19, 2013, 5:50 a.m. UTC | #4
On Mon, Feb 18, 2013 at 4:53 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:
>>
>> On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood <scottwood@freescale.com>
>> wrote:
>> > index 0350e0d..dbaf012 100644
>> > --- a/include/linux/kvm_host.h
>> > +++ b/include/linux/kvm_host.h
>> > @@ -335,6 +335,25 @@ struct kvm_memslots {
>> >         short id_to_index[KVM_MEM_SLOTS_NUM];
>> >  };
>> >
>> > +/*
>> > + * The worst case number of simultaneous devices will likely be very
>> > low
>> > + * (usually zero or one) for the forseeable future.  If the worst case
>> > + * exceeds this, then it can be increased, or we can convert to idr.
>> > + */
>>
>> This comment is on the heavy side (if at all needed). If you want to
>> remind people of idr, just put that in a single line. A define is a
>> define is a define.
>
>
> OK.
>
>
>> > +#define KVM_CREATE_DEVICE        _IOWR(KVMIO,  0xac, struct
>> > kvm_create_device)
>> > +#define KVM_SET_DEVICE_ATTR      _IOW(KVMIO,  0xad, struct
>> > kvm_device_attr)
>> > +#define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xae, struct
>> > kvm_device_attr)
>>
>> _IOWR ?
>
>
> struct kvm_device_attr itself is write-only, though the data pointed to by
> the addr field goes the other way for GET.  ONE_REG is in the same situation
> and also uses _IOW for both.
>
>

ok.

Btw., what about the size of the attr? implicitly defined through the attr id?

>> > +static int kvm_ioctl_create_device(struct kvm *kvm,
>> > +                                  struct kvm_create_device *cd)
>> > +{
>> > +       struct kvm_device *dev = NULL;
>> > +       bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
>> > +       int id;
>> > +       int r;
>> > +
>> > +       mutex_lock(&kvm->lock);
>> > +
>> > +       id = kvm->num_devices;
>> > +       if (id >= KVM_MAX_DEVICES && !test) {
>> > +               r = -ENOSPC;
>> > +               goto out;
>> > +       }
>> > +
>> > +       switch (cd->type) {
>> > +       default:
>> > +               r = -ENODEV;
>> > +               goto out;
>> > +       }
>>
>> do we really believe that there will be any arch-generic recognition
>> of types? shouldn't this be a call to an arch-specific function
>> instead. Which makes me wonder whether the device type IDs should be
>> arch specific as well...
>
>
> I prefer to look at it from the other direction -- is there any reason why
> this *should* be architecture specific?  What will that make easier?
>

The fact that you don't have to create static inlines for the
architectures that don't define the functions that get called or have
to similar #ifdef tricks, and I also think it's easier to read the
arch-specific bits of the code that way, instead of some arbitrary
function that you have to trace through to figure out where it's
called from.

> By doing device recognition here we don't need a separate copy of this per
> arch (including some #ifdef or modifying every arch at once -- including ARM
> which I can't modify yet because it's not merged), and *if* we should end up
> with an in-kernel-emulated device that gets used across multiple
> architectures, it would be annoying to have to modify all relevant
> architectures (and worse to deal with per-arch numberspaces).

I would say that's exactly what you're going to need with your approach:

switch (cd->type) {
case KVM_ARM_VGIC_V2_0:
    kvm_arm_vgic_v2_0_create(...);
}


are you going to ifdef here in this function, or? I think it's cleaner
to have the single arch-specific hooks and handle the cases there.

The use case of having a single device which is so central to the
system that we emulate it inside the kernel and is shared across
multiple archs is pretty far fetched to me.

However, this is internal and can always be changed, so if everyone
agrees on the overall API, whichever way you implement it is fine with
me.

>
> -Scott
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Feb. 19, 2013, 12:24 p.m. UTC | #5
On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> >Copying Christoffer since ARM has in kernel irq chip too.
> >
> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> >> Currently, devices that are emulated inside KVM are configured in a
> >> hardcoded manner based on an assumption that any given architecture
> >> only has one way to do it.  If there's any need to access device
> >state,
> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> >> cumbersome and depletes a limited numberspace.
> >>
> >> This API provides a mechanism to instantiate a device of a certain
> >> type, returning an ID that can be used to set/get attributes of the
> >> device.  Attributes may include configuration parameters (e.g.
> >> register base address), device state, operational commands, etc.  It
> >> is similar to the ONE_REG API, except that it acts on devices rather
> >> than vcpus.
> >You are not only provide different way to create in kernel irq
> >chip you
> >also use an alternate way to trigger interrupt lines. Before going
> >into
> >interface specifics lets think about whether it is really worth it?
> 
> Which "it" do you mean here?
> 
"It" is adding of a new interface if it will have only one user while
existing one can be adjusted for your needs. If ARM people are on board
I feel much better about it. The question is how on board they are :)
are they willing to make vGIC use it before upstream merge? vGIC is
merged separately from KVM code, so it will not affect merging of KVM
itself.

> The ability to set/get attributes is needed.  Sorry, but "get or set
> one blob of data, up to 512 bytes, for the entire irqchip" is just
> not good enough -- assuming you don't want us to start sticking
> pointers and commands in *that* data. :-)
> 
Proposed interface sticks pointers into ioctl data, so why doing the same
for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile. For signaling irqs (I
think this is what you mean by "commands") we have KVM_IRQ_LINE.

> If you mean the way to inject interrupts, it's simpler this way.
> Why go out of our way to inject common glue code into a
> communication path between hw/kvm/mpic.c in QEMU and
> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
> glue be specific to this one function when we could reuse the same
> communication glue used for other things, such as device state?
You will need glue anyway and I do no see how amount of it is much
different one way or the other. Gluing qemu_set_irq() to
ioctl(KVM_IRQ_LINE) or ioctl(KVM_SET_DEVICE_ATTR) is not much different.

Of course, since the interface you propose is not irq chip specific we
need non irq chip specific way to talk to it. But how do you propose
to model things like KVM_IRQ_LINE_STATUS with KVM_SET_DEVICE_ATTR?
KVM_SET_DEVICE_ATTR needs to return data back and getting data back from
"set" ioctl is strange. Other devices may get other commands that need
response, so if we design generic interface we should take it into
account. I think using KVM_SET_DEVICE_ATTR to inject interrupts is a
misnomer, you do not set internal device attribute, you toggle external
input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.

> 
> And that's just for regular interrupts.  MSIs are vastly simpler on
> MPIC than what x86 does.
> 
> >x86 obviously support old way and will have to for some, very
> >long, time.
> 
> Sure.
> 
> >ARM vGIC code, that is ready to go upstream, uses old way too. So
> >it will
> >be 2 archs against one.
> 
> I wasn't aware that that's how it worked. :-P
> 
What worked? That vGIC uses existing interface or that non generic
interface used by many arches wins generic one used by only one arch?

> I was trying to be considerate by not making the entire thing
> gratuitously PPC or MPIC specific, as some others seem inclined to
> do (e.g. see irqfd and APIC).  We already had a discussion on ARM's
> "set address" ioctl and rather than extend *that* interface, they
> preferred to just stick something ARM-specific in ASAP with the
> understanding that it would be replaced (or more accurately, kept
> around as a thin wrapper around the new stuff) later.
> 
I am not against generic interfaces in general and proposed one in
particular (I have comments about it but this is for other emails),
I am trying to make sure it will be used by more than one user before
committing to it. APIs are easy to add and impossible to remove.

> >Christoffer do you think the proposed way it
> >better for your needs. Are you willing to make vGIC use it?
> >
> >Scott, what other devices are you planning to support with this
> >interface?
> 
> At the moment I do not have plans for other devices, though what
> does it hurt for the capability to be there?
> 
> -Scott

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Feb. 19, 2013, 12:45 p.m. UTC | #6
On Mon, Feb 18, 2013 at 09:50:44PM -0800, Christoffer Dall wrote:
> >> > +static int kvm_ioctl_create_device(struct kvm *kvm,
> >> > +                                  struct kvm_create_device *cd)
> >> > +{
> >> > +       struct kvm_device *dev = NULL;
> >> > +       bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> >> > +       int id;
> >> > +       int r;
> >> > +
> >> > +       mutex_lock(&kvm->lock);
> >> > +
> >> > +       id = kvm->num_devices;
> >> > +       if (id >= KVM_MAX_DEVICES && !test) {
> >> > +               r = -ENOSPC;
> >> > +               goto out;
> >> > +       }
> >> > +
> >> > +       switch (cd->type) {
> >> > +       default:
> >> > +               r = -ENODEV;
> >> > +               goto out;
> >> > +       }
> >>
> >> do we really believe that there will be any arch-generic recognition
> >> of types? shouldn't this be a call to an arch-specific function
> >> instead. Which makes me wonder whether the device type IDs should be
> >> arch specific as well...
> >
> >
> > I prefer to look at it from the other direction -- is there any reason why
> > this *should* be architecture specific?  What will that make easier?
> >
> 
> The fact that you don't have to create static inlines for the
> architectures that don't define the functions that get called or have
> to similar #ifdef tricks, and I also think it's easier to read the
> arch-specific bits of the code that way, instead of some arbitrary
> function that you have to trace through to figure out where it's
> called from.
> 
> > By doing device recognition here we don't need a separate copy of this per
> > arch (including some #ifdef or modifying every arch at once -- including ARM
> > which I can't modify yet because it's not merged), and *if* we should end up
> > with an in-kernel-emulated device that gets used across multiple
> > architectures, it would be annoying to have to modify all relevant
> > architectures (and worse to deal with per-arch numberspaces).
> 
> I would say that's exactly what you're going to need with your approach:
> 
> switch (cd->type) {
> case KVM_ARM_VGIC_V2_0:
>     kvm_arm_vgic_v2_0_create(...);
> }
> 
> 
> are you going to ifdef here in this function, or? I think it's cleaner
> to have the single arch-specific hooks and handle the cases there.
> 
That is exactly what last patch is doing:
+#ifdef CONFIG_KVM_MPIC
+       case KVM_DEV_TYPE_FSL_MPIC_20:
+       case KVM_DEV_TYPE_FSL_MPIC_42: {
+               if (test) {
+                       r = 0;
+                       break;
+               }
+
+               r = kvm_create_mpic(kvm, cd->type, &dev);
+               break;
+       }
+#endif

> The use case of having a single device which is so central to the
> system that we emulate it inside the kernel and is shared across
> multiple archs is pretty far fetched to me.
> 
There is (or should I say was) one such device: IOAPIC. It is shared
between ia64 and x86.

Unless we have device that is shared between all/some arches I am with
Christoffer on this one. If such device will appear we ca do:

kvm_ioctl_create_device()
{
	switch (cd->type) {
         case DEVICEA_SHARED_BY_ALL_ARCHS:
                r = createa()
         break;
         case DEVICEB_SHARED_BY_ALL_ARCHS:
	        r = createb()
	 break;
         default:
           r = kvm_ioctl_arch_create_device();
       }
}

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoffer Dall Feb. 19, 2013, 3:51 p.m. UTC | #7
On Tue, Feb 19, 2013 at 4:24 AM, Gleb Natapov <gleb@redhat.com> wrote:
> On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
>> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
>> >Copying Christoffer since ARM has in kernel irq chip too.
>> >
>> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
>> >> Currently, devices that are emulated inside KVM are configured in a
>> >> hardcoded manner based on an assumption that any given architecture
>> >> only has one way to do it.  If there's any need to access device
>> >state,
>> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
>> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
>> >> cumbersome and depletes a limited numberspace.
>> >>
>> >> This API provides a mechanism to instantiate a device of a certain
>> >> type, returning an ID that can be used to set/get attributes of the
>> >> device.  Attributes may include configuration parameters (e.g.
>> >> register base address), device state, operational commands, etc.  It
>> >> is similar to the ONE_REG API, except that it acts on devices rather
>> >> than vcpus.
>> >You are not only provide different way to create in kernel irq
>> >chip you
>> >also use an alternate way to trigger interrupt lines. Before going
>> >into
>> >interface specifics lets think about whether it is really worth it?
>>
>> Which "it" do you mean here?
>>
> "It" is adding of a new interface if it will have only one user while
> existing one can be adjusted for your needs. If ARM people are on board
> I feel much better about it. The question is how on board they are :)
> are they willing to make vGIC use it before upstream merge? vGIC is
> merged separately from KVM code, so it will not affect merging of KVM
> itself.

Everything is queued for 3.9, the vgic+timers code is in the arm-soc
tree, so this is not going to change right now, but I can see it
happening for 3.10 if this proposed interface is accepted.

>
>> The ability to set/get attributes is needed.  Sorry, but "get or set
>> one blob of data, up to 512 bytes, for the entire irqchip" is just
>> not good enough -- assuming you don't want us to start sticking
>> pointers and commands in *that* data. :-)
>>
> Proposed interface sticks pointers into ioctl data, so why doing the same
> for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile. For signaling irqs (I
> think this is what you mean by "commands") we have KVM_IRQ_LINE.
>
>> If you mean the way to inject interrupts, it's simpler this way.
>> Why go out of our way to inject common glue code into a
>> communication path between hw/kvm/mpic.c in QEMU and
>> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
>> glue be specific to this one function when we could reuse the same
>> communication glue used for other things, such as device state?
> You will need glue anyway and I do no see how amount of it is much
> different one way or the other. Gluing qemu_set_irq() to
> ioctl(KVM_IRQ_LINE) or ioctl(KVM_SET_DEVICE_ATTR) is not much different.
>
> Of course, since the interface you propose is not irq chip specific we
> need non irq chip specific way to talk to it. But how do you propose
> to model things like KVM_IRQ_LINE_STATUS with KVM_SET_DEVICE_ATTR?
> KVM_SET_DEVICE_ATTR needs to return data back and getting data back from
> "set" ioctl is strange. Other devices may get other commands that need
> response, so if we design generic interface we should take it into
> account. I think using KVM_SET_DEVICE_ATTR to inject interrupts is a
> misnomer, you do not set internal device attribute, you toggle external
> input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.
>

I agree on using KVM_SET_DEVICE_ATTR for injecting interrupts is a bit
funky, for the ARM uses we would use this for setting the address used
to expose distributor and CPU interfaces to guests, not to inject
interrupt, first approximation.

>>
>> And that's just for regular interrupts.  MSIs are vastly simpler on
>> MPIC than what x86 does.
>>
>> >x86 obviously support old way and will have to for some, very
>> >long, time.
>>
>> Sure.
>>
>> >ARM vGIC code, that is ready to go upstream, uses old way too. So
>> >it will
>> >be 2 archs against one.
>>
>> I wasn't aware that that's how it worked. :-P
>>
> What worked? That vGIC uses existing interface or that non generic
> interface used by many arches wins generic one used by only one arch?
>
>> I was trying to be considerate by not making the entire thing
>> gratuitously PPC or MPIC specific, as some others seem inclined to
>> do (e.g. see irqfd and APIC).  We already had a discussion on ARM's
>> "set address" ioctl and rather than extend *that* interface, they
>> preferred to just stick something ARM-specific in ASAP with the
>> understanding that it would be replaced (or more accurately, kept
>> around as a thin wrapper around the new stuff) later.
>>
> I am not against generic interfaces in general and proposed one in
> particular (I have comments about it but this is for other emails),
> I am trying to make sure it will be used by more than one user before
> committing to it. APIs are easy to add and impossible to remove.
>
>> >Christoffer do you think the proposed way it
>> >better for your needs. Are you willing to make vGIC use it?
>> >
>> >Scott, what other devices are you planning to support with this
>> >interface?
>>
>> At the moment I do not have plans for other devices, though what
>> does it hurt for the capability to be there?
>>
>> -Scott
>
> --
>                         Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoffer Dall Feb. 20, 2013, 2:16 a.m. UTC | #8
On Tue, Feb 19, 2013 at 12:16 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 02/18/2013 11:50:44 PM, Christoffer Dall wrote:
>>
>> On Mon, Feb 18, 2013 at 4:53 PM, Scott Wood <scottwood@freescale.com>
>> wrote:
>> > On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:
>> >>
>> >> On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood <scottwood@freescale.com>
>> >> > +#define KVM_CREATE_DEVICE        _IOWR(KVMIO,  0xac, struct
>> >> > kvm_create_device)
>> >> > +#define KVM_SET_DEVICE_ATTR      _IOW(KVMIO,  0xad, struct
>> >> > kvm_device_attr)
>> >> > +#define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xae, struct
>> >> > kvm_device_attr)
>> >>
>> >> _IOWR ?
>> >
>> >
>> > struct kvm_device_attr itself is write-only, though the data pointed to
>> > by
>> > the addr field goes the other way for GET.  ONE_REG is in the same
>> > situation
>> > and also uses _IOW for both.
>> >
>> >
>>
>> ok.
>>
>> Btw., what about the size of the attr? implicitly defined through the attr
>> id?
>
>
> Yes, same as in ONE_REG.
>
>
>> and I also think it's easier to read the
>> arch-specific bits of the code that way, instead of some arbitrary
>> function that you have to trace through to figure out where it's
>> called from.
>
>
> I don't follow.
>
>

I just mean that if you look in arch/XXX/kvm/XXX.c and see a function
called kvm_create_xxx_dev(...) then you're like, what context is this
being called from again and in which order, etc. Of course we can name
the function kvm_ioctl_create_dev_xxx(...), but I still like to be
able to follow the flow of things that are really arch specific, but
anyhow, this is a weak argument.

>> > By doing device recognition here we don't need a separate copy of this
>> > per
>> > arch (including some #ifdef or modifying every arch at once -- including
>> > ARM
>> > which I can't modify yet because it's not merged), and *if* we should
>> > end up
>> > with an in-kernel-emulated device that gets used across multiple
>> > architectures, it would be annoying to have to modify all relevant
>> > architectures (and worse to deal with per-arch numberspaces).
>>
>> I would say that's exactly what you're going to need with your approach:
>>
>> switch (cd->type) {
>> case KVM_ARM_VGIC_V2_0:
>>     kvm_arm_vgic_v2_0_create(...);
>> }
>>
>>
>> are you going to ifdef here in this function, or? I think it's cleaner
>> to have the single arch-specific hooks and handle the cases there.
>
>
> There's an ifdef, as you can see from the patch that adds MPIC support.  But
> it's the same ifdef that gets used to determine whether the device code gets
> built in.  Nothing special needs to be added; no per-architecture hoop to
> jump through.
>
> Note that we would still need per-device ifdefs in the arch code, because
> not all PPC KVM builds are going to have MPIC support.

yeah, that's the same on ARM.

>
> What if, instead of a switch statement and ifdefs, it operated on a
> registration basis?
>
>

Probably just makes the code more confusing and harder to grep with
the limited number of in-kernel devices we support. Between that and
your current approach, I prefer the current approach. Anyhow, the
whole thing is internal state, as I wrote earlier, and by no means a
deal breaker for me, and as long as we don't have to mess around with
include/linux/kvm_host.h for changing arch-specific stuff, which I
believe is the case even with the current design, I'm ok with it.

>> The use case of having a single device which is so central to the
>> system that we emulate it inside the kernel and is shared across
>> multiple archs is pretty far fetched to me.
>
>
> I don't think it's that far fetched.  APIC is shared between x86 and ia64 --
> even if APIC has no need for anything beyond existing API, it shows that
> it's not a crazy possibility.  Freescale already has some devices that are
> shared between PPC and ARM, and that set will expand (probably not to irq
> controllers, though the probability is non-zero) with Layerscape, about
> which Freeescale says, "The unique, core-agnostic architecture incorporates
> the optimum core for the given application—ARM cores or Power Architecture
> cores."  We already need to go back and non-ppc-ize various drivers,
> including their reliance on I/O accessors that are defined in
> architecture-specific ways...  Making things gratuitiously architecture
> specific is just a bad idea, even if the "use case" for it actually being
> used on multiple architectures is remote.
>

For the record I think this is a simplification: making this generic
always comes at the cost of some added complexity exactly due to the
loss of being specific. It's a balancing act to figure out which is
preferred given the magnitude of the cost. As I indicate above, this
case is not too bad.

> Normal kernel drivers tend to go in drivers/, not arch/, even if they're
> only used on one architecture...
>

I know, but then you don't have #ifdef CONFIG_SOME_WEIRD_DEVICE in
kernel/xxx.c which is the only thing I find mildly unpleasing.

But let's not waste any more time on this detail.

>
>> However, this is internal and can always be changed, so if everyone
>> agrees on the overall API, whichever way you implement it is fine with
>> me.
>
>
> We at least need the numberspace to not be architecture-specific if we want
> to retain the possibility of changing later -- not to mention what happens
> if architectures merge.  I see that "arm" and "arm64" are separate, despite
> the fact that other architectures that used to be split this way have since
> merged.  Maybe "arm64" is too different from "arm" for that to happen, but
> who knows...
>

Fair point, nobody knows.

> ...and if they don't merge, wouldn't that be a likely case for devices
> shared across architectures?  Does arm64 use gic/vgic?  This post suggests
> that there is at least something in common (the bit about "once the GIC code
> is shared between
> arm and arm64"):
> http://lists.infradead.org/pipermail/linux-arm-kernel/2012-December/135836.html
>

I'm not sure how much of that is public at this point, or even
determined. But KVM already shares code between arm64 and arm, so I
guess I thought of this as a single architecture from the point of
view of virt/kvm/kvm_main.c, but that may be incorrect actually.

I really need to find time to play around more with the arm64 code.

Thanks for the thoughts.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Feb. 20, 2013, 1:09 p.m. UTC | #9
On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> The ability to set/get attributes is needed.  Sorry, but "get or set
> >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> >> not good enough -- assuming you don't want us to start sticking
> >> pointers and commands in *that* data. :-)
> >>
> >Proposed interface sticks pointers into ioctl data, so why doing
> >the same
> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> 
> There's a difference between putting a pointer in an ioctl control
> structure that is specifically documented as being that way (as in
> ONE_REG), versus taking an ioctl that claims to be setting/getting a
> blob of state and embedding pointers in it.  It would be like
> sticking a pointer in the attribute payload of this API, which I
> think is something to be discouraged.
If documentation is what differentiate for you between silly and smart
then write documentation instead of new interfaces.

KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
x86, nothing prevent you from adding MPIC specifics to the interface,
Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
for you to transfer the state put pointers there and _document_ them.
But with 512 bytes you can transfer properties inline, so you probably
do not need pointer there anyway. I see you have three properties 2 of
them 32bit and one 64bit.

>                                        It'd also be using
> KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> to later on regarding KVM_IRQ_LINE_STATUS.
> 
Do not see why.

> Then there's the silliness of transporting 512 bytes just to read a
> descriptor for transporting something else.
> 
Yes, agree. But is this enough of a reason to introduce entirely new
interface? Is it on performance critical path? Doubt it, unless you
abuse the interface to send interrupts, but then isn't it silty to
do copy_from_user() twice to inject an interrupt like proposed interface
does?

> >For signaling irqs (I think this is what you mean by "commands")
> >we have KVM_IRQ_LINE.
> 
> It's one type of command.  Another is setting the address.  Another
> is writing to registers that have side effects (this is how MSI
> injection is done on MPIC, just as in real hardware).
> 
Setting the address is setting an attribute. Sending MSI is a command.
Things you set/get during init/migration are attributes. Things you do
to cause side-effects are commands.

> What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> (non-glue/wrapper) code can become common?
> 
No new ioctl with exactly same result (well actually even faster since
less copying is done). You need to show us the benefits of the new interface
vs existing one, not vice versa.

> And I really hope you don't want us to do MSIs the x86 way.
> 
What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to glue it
to mpic.

> In the XICS thread, Paul brought up the possibliity of cascaded
> MPICs.  It's not relevant to the systems we're trying to model, but
> if one did want to use the in-kernel irqchip interface for that, it
> would be really nice to be able to operate on a specific MPIC for
> injection rather than have to come up with some sort of global
> identifier (above and beyond the minor flattening we'd need to do to
> represent a single MPIC's interrupts in a flat numberspace).
> 
ARM encodes information in irq field of KVM_IRQ_LINE like that:
  bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
  field: | irq_type  | vcpu_index |   irq_number    |
Why will similar approach not work?

> >> If you mean the way to inject interrupts, it's simpler this way.
> >> Why go out of our way to inject common glue code into a
> >> communication path between hw/kvm/mpic.c in QEMU and
> >> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
> >> glue be specific to this one function when we could reuse the same
> >> communication glue used for other things, such as device state?
> >You will need glue anyway and I do no see how amount of it is much
> >different one way or the other.
> 
> It uses glue that we need to be present for other things anyway.  If
> it weren't for XICS we wouldn't need a KVM_IRQ_LINE implementation
> at all on PPC.  It may not be a major difference, but it doesn't
> affect anything but MPIC and it seems more straightforward this way.
> 
We are talking about something like 4 lines of userspace code including
bracket. I do not think this is strong point in favor of different
interface.

> >Gluing qemu_set_irq() to
> >ioctl(KVM_IRQ_LINE) or ioctl(KVM_SET_DEVICE_ATTR) is not much
> >different.
> 
> qemu_set_irq() is not glued to either of those.  It's glued to
> kvm_openpic_set_irq(), kvm_ioapic_set_irq(), etc.  It's already not
> generic code.
> 
OK, this does not invalidates my argument though.

> >Of course, since the interface you propose is not irq chip specific
> 
> This part of it is.
> 
> >we need non irq chip specific way to talk to it. But how do you
> >propose
> >to model things like KVM_IRQ_LINE_STATUS with KVM_SET_DEVICE_ATTR?
> 
> That one's not even in api.txt, so could you explain what exactly
> it's supposed to return, and why it's needed?
True. We need to add it. Your guess below is correct.

> 
> AFAICT, the only thing it gets used for in QEMU is coalescing
> mc146818rtc interrupts.
> 
At present yes. Still need to be supportable.

> Could an error return be used for cases where the IRQ was not
> delivered, in the very unlikely event that we want to implement
> something similar on MPIC?
We can, but I do not think it will be good API. This condition is not an
error.

>                            Note again that MPIC's decision to use
> or not use KVM_IRQ_LINE is only about what MPIC does; it is not
> inherent in the device control API.
That's the crux of the problem though. MPIC tries to be different just
for the sake to be different. Why? The only explanation you provide is
because current API is "silly", not that you cannot implement MPIC with
it or it will be unnecessary slow, just "silly".

> 
> >KVM_SET_DEVICE_ATTR needs to return data back and getting data
> >back from
> >"set" ioctl is strange.
> 
> If we really need a single atomic operation to both read and write,
> beyond returning error values, then yes, that would be a new ioctl.
> It could be added in the future if needed.
> 
> >Other devices may get other commands that need
> >response, so if we design generic interface we should take it into
> >account. I think using KVM_SET_DEVICE_ATTR to inject interrupts is a
> >misnomer, you do not set internal device attribute, you toggle
> >external
> >input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.
> 
> I see no need for a separate ioctl in terms of the underlying
> infrastructure for distinguishing "attribute" from "write-only
> command".  I'm open to improvements on what the ioctl is called.
> It's basically like setting a register on a device, except I was
> concerned that if we actually called it a "register" that people
> would take it too literally and think it's only for the architected
> register state of the emulated device.
I agree "attribute" is better name than "register", but injecting
interrupt is not setting an attribute.

> 
> >> >ARM vGIC code, that is ready to go upstream, uses old way too. So
> >> >it will
> >> >be 2 archs against one.
> >>
> >> I wasn't aware that that's how it worked. :-P
> >>
> >What worked? That vGIC uses existing interface or that non generic
> >interface used by many arches wins generic one used by only one arch?
> 
> The latter.  Two wrongs don't make a right, and adding another
> inextensible, device-specific API is not the answer to the existing
> APIs being too inextensible and device/arch-specific.  Some portion
> will always need to be device-specific because we're controlling the
> creation and of a specific device, but the glue does not need to be.
>
This is not "adding another inextensible, device-specific API" vs "adding
cool generic extensible API" though. It is "using existing inextensible,
device-specific API" vs "adding cool generic extensible API".

> >APIs are easy to add and impossible to remove.
> 
> That's why I want to get it right this time.
> 
And what if you'll fail? What if next architecture will bring new
developer that will proclaim your new interface "silly" since it does not
allow for device destruction and do not return file descriptor for newly
created device that userspace can do select on to wait for a device's
events or mmap memory for fast userspace/device communication?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Feb. 20, 2013, 9:17 p.m. UTC | #10
On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> >Copying Christoffer since ARM has in kernel irq chip too.
> >
> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> >> Currently, devices that are emulated inside KVM are configured in a
> >> hardcoded manner based on an assumption that any given architecture
> >> only has one way to do it.  If there's any need to access device
> >state,
> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> >> cumbersome and depletes a limited numberspace.

Creation of ioctl has advantages. It makes explicit what the 
data contains and how it should be interpreted.
This means that for example, policy control can be performed at ioctl
level (can group, by ioctl number, which ioctl calls are allowed after
VM creation, for example).

It also makes it explicit that its a userspace interface which
applications depend.

With a single 'set device attribute' ioctl its more difficult
to do that.

Abstracting the details also makes it cumbersome to read 
strace output :-)

> >> This API provides a mechanism to instantiate a device of a certain
> >> type, returning an ID that can be used to set/get attributes of the
> >> device.  Attributes may include configuration parameters (e.g.
> >> register base address), device state, operational commands, etc.  It
> >> is similar to the ONE_REG API, except that it acts on devices rather
> >> than vcpus.
> >You are not only provide different way to create in kernel irq
> >chip you
> >also use an alternate way to trigger interrupt lines. Before going
> >into
> >interface specifics lets think about whether it is really worth it?
> 
> Which "it" do you mean here?
> 
> The ability to set/get attributes is needed.  Sorry, but "get or set
> one blob of data, up to 512 bytes, for the entire irqchip" is just
> not good enough -- assuming you don't want us to start sticking
> pointers and commands in *that* data. :-)

Why not? Is it necessary to constantly read/write attributes?

> If you mean the way to inject interrupts, it's simpler this way.

Are you injecting interrupts via this new SET_DEVICE_ATTRIBUTE ioctl?

> Why go out of our way to inject common glue code into a
> communication path between hw/kvm/mpic.c in QEMU and
> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
> glue be specific to this one function when we could reuse the same
> communication glue used for other things, such as device state?
> 
> And that's just for regular interrupts.  MSIs are vastly simpler on
> MPIC than what x86 does.
> 
> >x86 obviously support old way and will have to for some, very
> >long, time.
> 
> Sure.
> 
> >ARM vGIC code, that is ready to go upstream, uses old way too. So
> >it will
> >be 2 archs against one.
> 
> I wasn't aware that that's how it worked. :-P
> 
> I was trying to be considerate by not making the entire thing
> gratuitously PPC or MPIC specific, as some others seem inclined to
> do (e.g. see irqfd and APIC).  We already had a discussion on ARM's
> "set address" ioctl and rather than extend *that* interface, they
> preferred to just stick something ARM-specific in ASAP with the
> understanding that it would be replaced (or more accurately, kept
> around as a thin wrapper around the new stuff) later.
> 
> >Christoffer do you think the proposed way it
> >better for your needs. Are you willing to make vGIC use it?
> >
> >Scott, what other devices are you planning to support with this
> >interface?
> 
> At the moment I do not have plans for other devices, though what
> does it hurt for the capability to be there?
> 
> -Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Feb. 20, 2013, 9:28 p.m. UTC | #11
On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote:
> On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > >> The ability to set/get attributes is needed.  Sorry, but "get or set
> > >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> > >> not good enough -- assuming you don't want us to start sticking
> > >> pointers and commands in *that* data. :-)
> > >>
> > >Proposed interface sticks pointers into ioctl data, so why doing
> > >the same
> > >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> > 
> > There's a difference between putting a pointer in an ioctl control
> > structure that is specifically documented as being that way (as in
> > ONE_REG), versus taking an ioctl that claims to be setting/getting a
> > blob of state and embedding pointers in it.  It would be like
> > sticking a pointer in the attribute payload of this API, which I
> > think is something to be discouraged.
> If documentation is what differentiate for you between silly and smart
> then write documentation instead of new interfaces.
> 
> KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
> x86, nothing prevent you from adding MPIC specifics to the interface,
> Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
> for you to transfer the state put pointers there and _document_ them.
> But with 512 bytes you can transfer properties inline, so you probably
> do not need pointer there anyway. I see you have three properties 2 of
> them 32bit and one 64bit.
> 
> >                                        It'd also be using
> > KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> > to later on regarding KVM_IRQ_LINE_STATUS.
> > 
> Do not see why.
> 
> > Then there's the silliness of transporting 512 bytes just to read a
> > descriptor for transporting something else.
> > 
> Yes, agree. But is this enough of a reason to introduce entirely new
> interface? Is it on performance critical path? Doubt it, unless you
> abuse the interface to send interrupts, but then isn't it silty to
> do copy_from_user() twice to inject an interrupt like proposed interface
> does?
> 
> > >For signaling irqs (I think this is what you mean by "commands")
> > >we have KVM_IRQ_LINE.
> > 
> > It's one type of command.  Another is setting the address.  Another
> > is writing to registers that have side effects (this is how MSI
> > injection is done on MPIC, just as in real hardware).
> > 
> Setting the address is setting an attribute. Sending MSI is a command.
> Things you set/get during init/migration are attributes. Things you do
> to cause side-effects are commands.

Yes, it would be good to restrict what can be done via the interface
(to avoid abstracting away problems). At a first glance, i would say
it should allow for "initial configuration of device state", such as
registers etc.

Why exactly you need to set attributes Scott?

> > What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> > (non-glue/wrapper) code can become common?
> > 
> No new ioctl with exactly same result (well actually even faster since
> less copying is done). You need to show us the benefits of the new interface
> vs existing one, not vice versa.

Also related to this question is the point of avoiding the
implementation of devices to be spread across userspace and the kernel
(this is one point Avi brought up often). If the device emulation
is implemented entirely in the kernel, all is necessary are initial
values of device registers (and retrieval of those values later for
savevm/migration).

It is then not necessary to set device attributes on a live guest and
deal with the complications associated with that.

<snip>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Feb. 20, 2013, 10:44 p.m. UTC | #12
On Wed, Feb 20, 2013 at 06:28:24PM -0300, Marcelo Tosatti wrote:
> On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote:
> > On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > > On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > > >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > > >> The ability to set/get attributes is needed.  Sorry, but "get or set
> > > >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> > > >> not good enough -- assuming you don't want us to start sticking
> > > >> pointers and commands in *that* data. :-)
> > > >>
> > > >Proposed interface sticks pointers into ioctl data, so why doing
> > > >the same
> > > >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> > > 
> > > There's a difference between putting a pointer in an ioctl control
> > > structure that is specifically documented as being that way (as in
> > > ONE_REG), versus taking an ioctl that claims to be setting/getting a
> > > blob of state and embedding pointers in it.  It would be like
> > > sticking a pointer in the attribute payload of this API, which I
> > > think is something to be discouraged.
> > If documentation is what differentiate for you between silly and smart
> > then write documentation instead of new interfaces.
> > 
> > KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
> > x86, nothing prevent you from adding MPIC specifics to the interface,
> > Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
> > for you to transfer the state put pointers there and _document_ them.
> > But with 512 bytes you can transfer properties inline, so you probably
> > do not need pointer there anyway. I see you have three properties 2 of
> > them 32bit and one 64bit.
> > 
> > >                                        It'd also be using
> > > KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> > > to later on regarding KVM_IRQ_LINE_STATUS.
> > > 
> > Do not see why.
> > 
> > > Then there's the silliness of transporting 512 bytes just to read a
> > > descriptor for transporting something else.
> > > 
> > Yes, agree. But is this enough of a reason to introduce entirely new
> > interface? Is it on performance critical path? Doubt it, unless you
> > abuse the interface to send interrupts, but then isn't it silty to
> > do copy_from_user() twice to inject an interrupt like proposed interface
> > does?
> > 
> > > >For signaling irqs (I think this is what you mean by "commands")
> > > >we have KVM_IRQ_LINE.
> > > 
> > > It's one type of command.  Another is setting the address.  Another
> > > is writing to registers that have side effects (this is how MSI
> > > injection is done on MPIC, just as in real hardware).
> > > 
> > Setting the address is setting an attribute. Sending MSI is a command.
> > Things you set/get during init/migration are attributes. Things you do
> > to cause side-effects are commands.
> 
> Yes, it would be good to restrict what can be done via the interface
> (to avoid abstracting away problems). At a first glance, i would say
> it should allow for "initial configuration of device state", such as
> registers etc.
> 
> Why exactly you need to set attributes Scott?
> 
> > > What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> > > (non-glue/wrapper) code can become common?
> > > 
> > No new ioctl with exactly same result (well actually even faster since
> > less copying is done). You need to show us the benefits of the new interface
> > vs existing one, not vice versa.
> 
> Also related to this question is the point of avoiding the
> implementation of devices to be spread across userspace and the kernel
> (this is one point Avi brought up often). If the device emulation
> is implemented entirely in the kernel, all is necessary are initial
> values of device registers (and retrieval of those values later for
> savevm/migration).
> 
> It is then not necessary to set device attributes on a live guest and
> deal with the complications associated with that.
> 
> <snip>

I suppose the atomic-test-and-set type operations introduced to ONE_REG ioctls
are one example of such complications.

Avi, can you remind us what are the drawbacks of having separate
userspace/kernel device emulation?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Feb. 21, 2013, 12:01 a.m. UTC | #13
On Wed, Feb 20, 2013 at 05:20:50PM -0600, Scott Wood wrote:
>  On 02/20/2013 03:17:27 PM, Marcelo Tosatti wrote:
> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> >> >Copying Christoffer since ARM has in kernel irq chip too.
> >> >
> >> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> >> >> Currently, devices that are emulated inside KVM are
> >configured in a
> >> >> hardcoded manner based on an assumption that any given
> >architecture
> >> >> only has one way to do it.  If there's any need to access device
> >> >state,
> >> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little
> >thing is
> >> >> cumbersome and depletes a limited numberspace.
> >
> >Creation of ioctl has advantages. It makes explicit what the
> >data contains and how it should be interpreted.
> >This means that for example, policy control can be performed at ioctl
> >level (can group, by ioctl number, which ioctl calls are allowed after
> >VM creation, for example).
> >
> >It also makes it explicit that its a userspace interface which
> >applications depend.
> >
> >With a single 'set device attribute' ioctl its more difficult
> >to do that.
> 
> I don't see why creating ioctls rather than device attributes (or
> whatever you want to call them) differs this way.  Device attributes
> are inherently userspace interfaces, just as ioctls are.  What the
> data contains and how it is interpreted are documented, albeit in a
> more lightweight format than KVM usually uses for ioctls.
> 
> An ioctl is no guarantee of good documentation.  KVM is far better
> than most of the kernel in that regard, but even there some ioctls
> are missing (e.g. KVM_IRQ_LINE_STATUS as mentioned elsewhere in this
> thread), and some others are inadequately explained (e.g. IRQ
> routing).
> 
> By "policy control", do you just mean that some ioctls act on a
> /dev/kvm fd, others on a vm fd, and others on a vcpu fd?  This is
> pretty similar to having a device fd, except for the mechanism used.

No, i mean policy control acting on ioctl numbers. Its essentially the
same issue as with 'strace' (in that the its necessary to parse the
ioctl data further).

> The main things that I dislike about adding new things at the ioctl
> level are:
> - limited numberspace, and more prone to merge conflicts than a
> device-specific numberspace

Can reserve per-architecture ranges to deal with that.

> - having to add a new pathway to get into the driver for each ioctl,
> rather than having all operations on a particular device go to the
>   right place, and the device implementation interprets further
> (this assumes that we're talking about vm ioctls, and not returning
> a
>   new fd for the device)

Agree with that point.

> - awkward way of negotiating capabilities with userspace (have to
> declare the capability id separately, and add code somewhere outside
> the
>   device implementation to respond to capability requests)

Agree with that point.

> - api.txt formatting that imposes yet another document-internal
> numberspace to conflict on :-)

The main problem, to me, is that the interface allows flexibility but
the details of using such flexibility are not worked out.

That is, lack of definitions such as when setting attributes is allowed.
With a big blog, that information is implicit: a SET operation is a full
device reset.

With individual attributes:
- Which attributes can be set individually?
- Is there an order which must be respected?
- Locking.
- End result: more logic/code necessary.

> >Abstracting the details also makes it cumbersome to read
> >strace output :-)
> 
> You'd have to update strace one way or another if you want
> pretty-printed output.  Having a more restricted API than arbitrary
> ioctls could actually improve the situation -- this could be a good
> reason for including the attribute length as an explicit parameter
> rather than being implicit in the attribute id.  Then you'd just
> teach strace about the device control API once, and not for each new
> attribute or device that gets added.
> 
> >> >> This API provides a mechanism to instantiate a device of a
> >certain
> >> >> type, returning an ID that can be used to set/get attributes
> >of the
> >> >> device.  Attributes may include configuration parameters (e.g.
> >> >> register base address), device state, operational commands,
> >etc.  It
> >> >> is similar to the ONE_REG API, except that it acts on devices
> >rather
> >> >> than vcpus.
> >> >You are not only provide different way to create in kernel irq
> >> >chip you
> >> >also use an alternate way to trigger interrupt lines. Before going
> >> >into
> >> >interface specifics lets think about whether it is really worth it?
> >>
> >> Which "it" do you mean here?
> >>
> >> The ability to set/get attributes is needed.  Sorry, but "get or set
> >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> >> not good enough -- assuming you don't want us to start sticking
> >> pointers and commands in *that* data. :-)
> >
> >Why not? Is it necessary to constantly read/write attributes?
> 
> It's not about how often we do it, but how much state we have,
> especially if we ever want to implement migration.

Migration reads/writes the device state once, which is supposedly much
smaller than VM's RAM, so can't see the logic here.

> >> If you mean the way to inject interrupts, it's simpler this way.
> >
> >Are you injecting interrupts via this new SET_DEVICE_ATTRIBUTE ioctl?
> 
> Yes, but even if that gets shot down (the best objection IMHO is the
> one nobody is raising -- how to hook into irqfd), we still need the
> rest of it.  

Why are individual attributes necessary ? Still unclear.

How about dropping it? And then assume full blob write is a device reset.

> I think we'd even want to keep attributes for IRQ line
> status so that we have a way to read it, even if just for debugging.

No problem reading full blob on that case.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Feb. 21, 2013, 12:14 a.m. UTC | #14
On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >Why exactly you need to set attributes Scott?
> 
> That's best answered by looking at patch 6/6 and discussing the
> actual attributes that are defined so far.
> 
> The need to set the base address of the registers is
> straightforward.  When ARM added their special ioctl for this, we
> discussed it being eventually replaced with a more general API (in
> fact, that was the reason they put "ARM" in the name).
> 
> Access to device registers was originally intended for debugging,
> and eventually migration.  It turned out to also be very useful for
> injecting MSIs -- nothing special needed to be done.  It Just
> Works(tm) the same way it does in hardware, with an MMIO write from
> a PCI device to a specific MPIC register.  Again, irqfd may
> complicate this, but there's no need for QEMU-generated MSIs to have
> to take a circuitous path.
> 
> Finally, there's the interrupt source attribute.  Even if we get rid
> of that, we'll need MPIC-specific documentation on how to flatten
> the IRQ numberspace, and if we ever have a cascaded PIC situation
> things could get ugly since there's no way to identify a specific
> IRQ controller in KVM_IRQ_LINE.
> 
> >Also related to this question is the point of avoiding the
> >implementation of devices to be spread across userspace and the kernel
> >(this is one point Avi brought up often). If the device emulation
> >is implemented entirely in the kernel, all is necessary are initial
> >values of device registers (and retrieval of those values later for
> >savevm/migration).
> 
> MPIC emulation is entirely in the kernel with this patchset --
> though the register that lets you reset cores will likely need to be
> kicked back to QEMU.
> 
> >It is then not necessary to set device attributes on a live guest and
> >deal with the complications associated with that.
> 
> Which complications?
> 
> -Scott

Semantics of individual attribute writes, for one.

Locking versus currently executing VCPUs, for another (see how
KVM_SET_IRQ's RCU usage, for instance, that is something should be
shared).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Feb. 21, 2013, 6:39 a.m. UTC | #15
On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
> On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> >On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >> >It is then not necessary to set device attributes on a live
> >guest and
> >> >deal with the complications associated with that.
> >>
> >> Which complications?
> >>
> >> -Scott
> >
> >Semantics of individual attribute writes, for one.
> 
> When the attribute is a device register, the hardware documentation
> takes care of that.  Otherwise, the semantics are documented in the
> device-specific documentation -- which can include restricting when
> the access is allowed.  Same as with any other interface
> documentation.
> 
> I suppose mpic.txt could use an additional statement that
> KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
> by the guest.
> 
If access to an attribute has a guest visible side effect you cannot
use this interface for migration. This is exactly the point in favor
of distinguish accesses that have side effects (COMMAND or whatever)
and accesses that set/get attribute (SET|GET_ATTR).

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Feb. 21, 2013, 8:22 a.m. UTC | #16
On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
>  On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
> >On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> >> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> >> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> >> The ability to set/get attributes is needed.  Sorry, but "get
> >or set
> >> >> one blob of data, up to 512 bytes, for the entire irqchip" is
> >just
> >> >> not good enough -- assuming you don't want us to start sticking
> >> >> pointers and commands in *that* data. :-)
> >> >>
> >> >Proposed interface sticks pointers into ioctl data, so why doing
> >> >the same
> >> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> >>
> >> There's a difference between putting a pointer in an ioctl control
> >> structure that is specifically documented as being that way (as in
> >> ONE_REG), versus taking an ioctl that claims to be setting/getting a
> >> blob of state and embedding pointers in it.  It would be like
> >> sticking a pointer in the attribute payload of this API, which I
> >> think is something to be discouraged.
> >If documentation is what differentiate for you between silly and smart
> >then write documentation instead of new interfaces.
> 
> You mean like what we did with SREGS, that got deprecated and
> replaced with ONE_REG?
> 
SREGS is implemented by ppc. I see ONE_REG as addition to REGS
interface. You can access all register at once or you can access them
one by one. If there is a need we can add MULTIPLE_REGS that will get
list of requested REGS. The interface is not over generic i.e it does
not try to replace KVM_RUN by writing special register.

> How is writing documentation not creating new interfaces, if the
> documentation is different from what the interface is currently
> understood to do?
If this case you misunderstand what I am proposing. The interface
sets/gets irq chip state and this is what it will continue to do. What
needs to be documented is the format of mpic irqchip. Nobody expects it
to be the same for all irq chips.
 
> Note that Marcelo seems to view KVM_SET_IRQCHIP as effectively being
> a device reset, which is rather different.
> 
Marcelo views the interface exactly the same as I view it. It is used for
initializing device state (reset is one of those time when it happens)
and to transfer device state for migration purposes.

> >KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of
> >data on
> >x86, nothing prevent you from adding MPIC specifics to the interface,
> >Add mpic state into kvm_irqchip structure and if 512 bytes is not
> >enough
> >for you to transfer the state put pointers there and _document_ them.
> 
> So basically, you want me to keep this interface but share the ioctl
> number with an older interface? :-P
If that is what you want. Obviously you can drop things that makes
proposed interface generic one.

> 
> >But with 512 bytes you can transfer properties inline, so you probably
> >do not need pointer there anyway. I see you have three properties 2 of
> >them 32bit and one 64bit.
> 
> Three *groups* of properties.  One of the property groups is per
> source, and we can have hundreds of sources.  Another exposes the
> register space, which is 64 KiB (admittedly it's somewhat sparse,
> but there's more than 512 bytes of real data in there). 
I mean that you still access each property one by one but since each
individual one is not bigger than 64bit you can put it inline and do not
need pointers, or you can access groups of properties if each one fits
into the buffer.

>                                                    And we
> don't necessarily want to set *everything*.
What are those cases? You do need to on reset/migration.

> 
> >>                                        It'd also be using
> >> KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> >> to later on regarding KVM_IRQ_LINE_STATUS.
> >>
> >Do not see why.
> 
> It's either that, or have the data direction of the "chip" field in
> KVM_GET_IRQCHIP not be entirely in the "get" direction.
> 
Still do not follow. Example?

> >> Then there's the silliness of transporting 512 bytes just to read a
> >> descriptor for transporting something else.
> >>
> >Yes, agree. But is this enough of a reason to introduce entirely new
> >interface? Is it on performance critical path? Doubt it, unless you
> >abuse the interface to send interrupts, but then isn't it silty to
> >do copy_from_user() twice to inject an interrupt like proposed
> >interface
> >does?
> 
> It should probably be get_user() instead, which is pretty fast in
> the absence of a fault.
> 
> >> >For signaling irqs (I think this is what you mean by "commands")
> >> >we have KVM_IRQ_LINE.
> >>
> >> It's one type of command.  Another is setting the address.  Another
> >> is writing to registers that have side effects (this is how MSI
> >> injection is done on MPIC, just as in real hardware).
> >>
> >Setting the address is setting an attribute. Sending MSI is a command.
> >Things you set/get during init/migration are attributes. Things you do
> >to cause side-effects are commands.
> 
> What if I set the address at a time that isn't init/migration (the
> hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not
> an attribute due to how the caller uses it?
> 
What's the interface for guest to move it? Why it goes via userspace?
You can move APIC base too, but this does not involve userspace. But
even if you do go via userspace, it is just a guest asking to change device
configuration, so using SET_ATTR to set new configuration is fine.

> >> What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> >> (non-glue/wrapper) code can become common?
> >>
> >No new ioctl with exactly same result (well actually even faster since
> >less copying is done).
> 
> Which ioctl would go away?
> 
Those that you propose in your new interface.

> >You need to show us the benefits of the new interface
> >vs existing one, not vice versa.
> 
> Well, as I said to Marcello, the real reason why we probably need to
> use the existing interface is irqfd.  That doesn't make the device
> control stuff go away.
> 
> >> And I really hope you don't want us to do MSIs the x86 way.
> >>
> >What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to
> >glue it
> >to mpic.
> 
> We'll have to write extra code for it compared to the current way
> where it works with *zero* code beyond what is wanted for other
> purposes such as debug and (eventually) migration.  At least it's
> more direct than having to establish a GSI route...
If just writing a register cause MSI to be send how do you distinguish
between write that should send MSI and write that is done on migration
to transfer current value? We had that problem with MSRs on x86. We had
to, eventually, add a flag that tells us the reason of MSR access.

> 
> >> In the XICS thread, Paul brought up the possibliity of cascaded
> >> MPICs.  It's not relevant to the systems we're trying to model, but
> >> if one did want to use the in-kernel irqchip interface for that, it
> >> would be really nice to be able to operate on a specific MPIC for
> >> injection rather than have to come up with some sort of global
> >> identifier (above and beyond the minor flattening we'd need to do to
> >> represent a single MPIC's interrupts in a flat numberspace).
> >>
> >ARM encodes information in irq field of KVM_IRQ_LINE like that:
> >  bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
> >  field: | irq_type  | vcpu_index |   irq_number    |
> >Why will similar approach not work?
> 
> Well, it conflicts with the GSI routing stuff, and I don't see an
> irq chip ID field...
It does :( Can't say I am happy about it, but I skipped the discussion
about the interface back then and it is too late to complain now. Since,
as you notices, irqfd interfaces with irq routing I wonder what's ARM
plan about it. But if you choose to go ARM way the format is ARM specific,
so you can use your own encoding and put irq chip information there.

> 
> But otherwise (or assuming you mean to use such an encoding when
> setting up a GSI route), I didn't say this part couldn't be made to
> work.  It will require new kernel code for managing a GSI table in a
> non-APIC way, and a new callback into the device code, but as I've
> said elsewhere I think we need it for irqfd anyway.  If I use
> KVM_IRQ_LINE for injecting interrupts, do you still object to the
> rest of it?
The rest of what, proposed interface? There are two separate discussions
happening here interleaved. First is "do we need to introduce new generic
interface for device creation when existing one, albeit not ideal, can be
used" and I am OK with that as long as ARM moves to it for 3.10, although
I would prefer to have some example of what this interface will be used
for besides irq chips otherwise it will be just another way to create
irqchip. Second one is "how the interface should look like". And here I
think that strong distinction is needed between setting the attributes
and sending commands with side effects for reasons explained all over
this ml thread.

> 
> >> Could an error return be used for cases where the IRQ was not
> >> delivered, in the very unlikely event that we want to implement
> >> something similar on MPIC?
> >We can, but I do not think it will be good API. This condition is
> >not an
> >error.
> 
> -EBUSY seems appropriate enough...
It is. Other commands may have more elaborate return status. Generic
interface should take this into account.

> 
> >>                            Note again that MPIC's decision to use
> >> or not use KVM_IRQ_LINE is only about what MPIC does; it is not
> >> inherent in the device control API.
> >That's the crux of the problem though. MPIC tries to be different just
> >for the sake to be different. Why? The only explanation you provide is
> >because current API is "silly", not that you cannot implement MPIC
> >with
> >it or it will be unnecessary slow, just "silly".
> 
> It's not about "silliness" as that this new thing I added for other
> reasons did the job just as well (again, except when it comes to
> irqfd), and avoided the need for a GSI table, etc.  IRQ injection
> was not the main point of the new interface.
Having generic interface for device creation and then make some devices
special by allowing them to be used with KVM_IRQ_LINE makes little
sense, so IRQ injection may be not the main point of the new interface,
but communication with a device created by the new interface is
something that cannot be ignored at the interface design stage.
 
> 
> >> >Other devices may get other commands that need
> >> >response, so if we design generic interface we should take it into
> >> >account. I think using KVM_SET_DEVICE_ATTR to inject interrupts
> >is a
> >> >misnomer, you do not set internal device attribute, you toggle
> >> >external
> >> >input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.
> >>
> >> I see no need for a separate ioctl in terms of the underlying
> >> infrastructure for distinguishing "attribute" from "write-only
> >> command".  I'm open to improvements on what the ioctl is called.
> >> It's basically like setting a register on a device, except I was
> >> concerned that if we actually called it a "register" that people
> >> would take it too literally and think it's only for the architected
> >> register state of the emulated device.
> >I agree "attribute" is better name than "register", but injecting
> >interrupt is not setting an attribute.
> 
> It's a dynamic attribute -- the state of the input line.  Better
> names are welcome.  I don't see this difference as enough to warrant
> separate ioctls.
As long as you use the same attribute for migration and interrupt injection
purpose I do. If you use separate attributes for migration and interrupt
injection then not having separate ioctl is just a hack.

> 
> >> >> >ARM vGIC code, that is ready to go upstream, uses old way
> >too. So
> >> >> >it will
> >> >> >be 2 archs against one.
> >> >>
> >> >> I wasn't aware that that's how it worked. :-P
> >> >>
> >> >What worked? That vGIC uses existing interface or that non generic
> >> >interface used by many arches wins generic one used by only one
> >arch?
> >>
> >> The latter.  Two wrongs don't make a right, and adding another
> >> inextensible, device-specific API is not the answer to the existing
> >> APIs being too inextensible and device/arch-specific.  Some portion
> >> will always need to be device-specific because we're controlling the
> >> creation and of a specific device, but the glue does not need to be.
> >>
> >This is not "adding another inextensible, device-specific API" vs
> >"adding
> >cool generic extensible API" though. It is "using existing
> >inextensible,
> >device-specific API" vs "adding cool generic extensible API".
> 
> The "existing inextensible device-specific API" doesn't have support
> for this "specific device".  Something new has to be added one way
> or another.
> 
And extending existing interface (which is already supports more then one
irqchips BTW) wins.

> >> >APIs are easy to add and impossible to remove.
> >>
> >> That's why I want to get it right this time.
> >>
> >And what if you'll fail?
> 
> That's always a possibility of course.  I don't think that's a good
> reason to avoid trying to move in the right direction.
It is not, but that is not the point I am trying to make :)

> 
> >What if next architecture will bring new
> >developer that will proclaim your new interface "silly" since it
> >does not
> >allow for device destruction and do not return file descriptor for
> >newly
> >created device that userspace can do select on to wait for a device's
> >events or mmap memory for fast userspace/device communication?
> 
> The device id that gets returned is arbitrary; you could turn it
> into an fd later with no loss of compatibility.
> 
> Device destruction would complicate things and I would not support
> requiring all devices to allow it.  If someone wanted to add it for
> certain devices, at the interface level it would just be a new
> ioctl.
> 
And here is the point that I am trying to make. You propose how your
interface can be extended and I agree, it can. But that other guy will
see it in other light: Why should I extend interface that is broken
instead of providing new, perfect one.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Feb. 21, 2013, 11:03 p.m. UTC | #17
On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
> On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> >On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >> >It is then not necessary to set device attributes on a live
> >guest and
> >> >deal with the complications associated with that.
> >>
> >> Which complications?
> >>
> >> -Scott
> >
> >Semantics of individual attribute writes, for one.
> 
> When the attribute is a device register, the hardware documentation
> takes care of that.  

You are not writing to the registers from the CPU point of view.

> Otherwise, the semantics are documented in the
> device-specific documentation -- which can include restricting when
> the access is allowed.  Same as with any other interface
> documentation.

Again, you are talking about the semantics of device access from the
software operating on the machine view. We are discussing hypervisor
userspace <-> hypervisor kernel interface.

In general you never have to set attributes on a device after it has
been initialized, because there is state associated with that device
that requires proper handling (example: if you modify a timer counter
register of a timer device, any software timers used to emulate the
timer counter must be cancelled).

Also, it is necessary to provide proper locking of device attribute
write versus vcpu device access. So far we have been focusing on having 
a lockless vcpu path.

So when device attributes can be modified has implications beyond what
may seem visible at first.

Are this reasonable arguments?

Basically abstract 'device attributes' are too abstract.

However, your proposed interface deals with sucky capability, versioning
and namespace conflicts we have now. Note these items can probably be
improved separately.

> I suppose mpic.txt could use an additional statement that
> KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
> by the guest.
> 
> >Locking versus currently executing VCPUs, for another (see how
> >KVM_SET_IRQ's RCU usage, for instance, that is something should be
> >shared).
> 
> If you mean kvm_set_irq() in irq_comm.c, that's only relevant when
> you have a GSI routing table, which this patchset doesn't.
> 
> Assuming we end up having a routing table to support irqfd, we still
> can't share the code as is, since it's APIC-specific.

Suppose it is worthwhile to attempt to share code as much as possible.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Feb. 23, 2013, 3:04 p.m. UTC | #18
On Thu, Feb 21, 2013 at 08:00:25PM -0600, Scott Wood wrote:
> On 02/21/2013 05:03:32 PM, Marcelo Tosatti wrote:
> >On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
> >> On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> >> >On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >> >> >It is then not necessary to set device attributes on a live
> >> >guest and
> >> >> >deal with the complications associated with that.
> >> >>
> >> >> Which complications?
> >> >>
> >> >> -Scott
> >> >
> >> >Semantics of individual attribute writes, for one.
> >>
> >> When the attribute is a device register, the hardware documentation
> >> takes care of that.
> >
> >You are not writing to the registers from the CPU point of view.
> 
> That's exactly how KVM_DEV_MPIC_GRP_REGISTER is defined and
> implemented on MPIC (with the exception of registers whose behavior
> changes based on which specific vcpu you use to access them).
> If/when we have a need to set/get state in a different manner,
> that's a separate attribute group.

Can you describe usage of this register again?

> >> Otherwise, the semantics are documented in the
> >> device-specific documentation -- which can include restricting when
> >> the access is allowed.  Same as with any other interface
> >> documentation.
> >
> >Again, you are talking about the semantics of device access from the
> >software operating on the machine view. We are discussing hypervisor
> >userspace <-> hypervisor kernel interface.
> 
> And I was talking about the userspace-to-hypervisor kernel interface
> documentation.  It just happens that one specific MPIC device group
> ("when the attribute is a device register") is defined with respect
> to what guest software would see if it did a similar access.
> 
> >In general you never have to set attributes on a device after it has
> >been initialized, because there is state associated with that device
> >that requires proper handling (example: if you modify a timer counter
> >register of a timer device, any software timers used to emulate the
> >timer counter must be cancelled).
> 
> Yes, it requires proper handling and the MMIO code does that.
>
> If and when we add raw state accessors, it's totally reasonable for
> there to be command/attribute-specific documented restrictions on
> when the access can be done.

> >Also, it is necessary to provide proper locking of device attribute
> >write versus vcpu device access. So far we have been focusing on
> >having
> >a lockless vcpu path.
> 
> How is device access related to vcpus?  Existing irqchip code is not
> lockless.

VCPUS access in-kernel devices. Yes, it is lockless (see RCU usage in
virt/kvm/).

> >So when device attributes can be modified has implications beyond what
> >may seem visible at first.
> >
> >Are this reasonable arguments?
> >
> >Basically abstract 'device attributes' are too abstract.
> 
> It's up to the device-specific documentation to make them not
> abstract (I admit there are a few details missing in mpic.txt that
> I've pointed out in this thread -- it is RFC v1 after all).  This
> wouldn't be any different if we used separate ioctls for everything.
> It's like saying abstract 'ioctl' is too abstract.

Perhaps a better way to put it is that its too permissive.

> >However, your proposed interface deals with sucky capability,
> >versioning
> >and namespace conflicts we have now. Note these items can probably be
> >improved separately.
> 
> Any particular proposals?

Namespace conflicts: Reserve ranges for each arch. 

The other two items, haven't though. I am not the one bothered :-) (yes, they
suck).

> >> I suppose mpic.txt could use an additional statement that
> >> KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
> >> by the guest.
> >>
> >> >Locking versus currently executing VCPUs, for another (see how
> >> >KVM_SET_IRQ's RCU usage, for instance, that is something should be
> >> >shared).
> >>
> >> If you mean kvm_set_irq() in irq_comm.c, that's only relevant when
> >> you have a GSI routing table, which this patchset doesn't.
> >>
> >> Assuming we end up having a routing table to support irqfd, we still
> >> can't share the code as is, since it's APIC-specific.
> >
> >Suppose it is worthwhile to attempt to share code as much as possible.
> 
> Sure... my point is it isn't a case of "the common code is right
> over there, why aren't you using it?"  I'll try to share what I
> reasonably can, subject to my limited knowledge of how the APIC
> stuff works.  The irqfd code is substantial enough that refactoring
> for sharing should be worthwhile.  I'm not so sure about irq_comm.c.
> 
> -scott

Note just pointing out drawbacks of device attributes (if something of
that sort is integrated, x86 should also use it).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marc Zyngier Feb. 24, 2013, 1:12 p.m. UTC | #19
On Tue, 19 Feb 2013 18:16:53 -0800, Christoffer Dall
<cdall@cs.columbia.edu> wrote:
> On Tue, Feb 19, 2013 at 12:16 PM, Scott Wood <scottwood@freescale.com>
> wrote:


>> We at least need the numberspace to not be architecture-specific if we
>> want
>> to retain the possibility of changing later -- not to mention what
>> happens
>> if architectures merge.  I see that "arm" and "arm64" are separate,
>> despite
>> the fact that other architectures that used to be split this way have
>> since
>> merged.  Maybe "arm64" is too different from "arm" for that to happen,
>> but
>> who knows...
>>
> 
> Fair point, nobody knows.

This is unlikely to happen soon.

>> ...and if they don't merge, wouldn't that be a likely case for devices
>> shared across architectures?  Does arm64 use gic/vgic?  This post
>> suggests
>> that there is at least something in common (the bit about "once the GIC
>> code
>> is shared between
>> arm and arm64"):
>>
http://lists.infradead.org/pipermail/linux-arm-kernel/2012-December/135836.html
>>
> 
> I'm not sure how much of that is public at this point, or even
> determined. But KVM already shares code between arm64 and arm, so I
> guess I thought of this as a single architecture from the point of
> view of virt/kvm/kvm_main.c, but that may be incorrect actually.

As I am the sad bastard who wrote most of the VGIC stuff, and also the one
who maintains KVM on arm64, I feel a need to chime in:

arm64 very much uses GIC/VGIC. Actually, all the "in-kernel device" code
is shared between arm and arm64. That means VGIC, timers and PSCI. Which
probably means that code should be moved out of arch/arm/kvm, but that's a
separate story.

It is my intention to keep the interface as close as possible to arm, as
long as that doesn't cause any major issue (arm64 has some slightly
different requirements).

Cheers,

        M.
Gleb Natapov Feb. 24, 2013, 3:46 p.m. UTC | #20
On Thu, Feb 21, 2013 at 08:17:54PM -0600, Scott Wood wrote:
>  On 02/21/2013 02:22:09 AM, Gleb Natapov wrote:
> >On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
> >>  On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
> >> >On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> >> >> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> >> >> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> >> >> The ability to set/get attributes is needed.  Sorry, but "get
> >> >or set
> >> >> >> one blob of data, up to 512 bytes, for the entire irqchip" is
> >> >just
> >> >> >> not good enough -- assuming you don't want us to start
> >sticking
> >> >> >> pointers and commands in *that* data. :-)
> >> >> >>
> >> >> >Proposed interface sticks pointers into ioctl data, so why doing
> >> >> >the same
> >> >> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> >> >>
> >> >> There's a difference between putting a pointer in an ioctl
> >control
> >> >> structure that is specifically documented as being that way
> >(as in
> >> >> ONE_REG), versus taking an ioctl that claims to be
> >setting/getting a
> >> >> blob of state and embedding pointers in it.  It would be like
> >> >> sticking a pointer in the attribute payload of this API, which I
> >> >> think is something to be discouraged.
> >> >If documentation is what differentiate for you between silly
> >and smart
> >> >then write documentation instead of new interfaces.
> >>
> >> You mean like what we did with SREGS, that got deprecated and
> >> replaced with ONE_REG?
> >>
> >SREGS is implemented by ppc. I see ONE_REG as addition to REGS
> >interface. You can access all register at once or you can access them
> >one by one. If there is a need we can add MULTIPLE_REGS that will get
> >list of requested REGS.
> 
> http://www.spinics.net/lists/kvm-ppc/msg04876.html
> http://www.spinics.net/lists/kvm-ppc/msg05842.html
> 
If Alex prefers ONE_REG interface this is his call :)

> >The interface is not over generic i.e it does
> >not try to replace KVM_RUN by writing special register.
> 
> Sigh.
> 
> >>                                                    And we
> >> don't necessarily want to set *everything*.
> >What are those cases? You do need to on reset/migration.
> 
> Why do we want to set all the registers on reset, rather than tell
> the in-kernel device to reset?  The default state came from the
> kernel in the first place on irqchip creation...
> 
I have nothing against telling in-kernel device to reset provided there
is a way to do so, which current interface lacks. Reset in userspase has
its advantage too: bugs are easier to fix, there may be different kind
of resets (hard/soft).

> >> >>                                        It'd also be using
> >> >> KVM_SET_IRQCHIP to read data, which is the sort of thing you
> >object
> >> >> to later on regarding KVM_IRQ_LINE_STATUS.
> >> >>
> >> >Do not see why.
> >>
> >> It's either that, or have the data direction of the "chip" field in
> >> KVM_GET_IRQCHIP not be entirely in the "get" direction.
> >>
> >Still do not follow. Example?
> 
> struct kvm_irqchip has "chip_id", "pad", and "chip".  "pad" is
> insufficient to communicate attribute type plus a pointer.  So if we
> want to provide a pointer for the kernel to write the attribute
> into, it has to read from memory that the ioctl definition suggests
> should only be written to.
Yes, but this is not different from the interface you propose.

> 
> >> >Setting the address is setting an attribute. Sending MSI is a
> >command.
> >> >Things you set/get during init/migration are attributes. Things
> >you do
> >> >to cause side-effects are commands.
> >>
> >> What if I set the address at a time that isn't init/migration (the
> >> hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not
> >> an attribute due to how the caller uses it?
> >>
> >What's the interface for guest to move it?
> 
> Some non-MPIC registers called CCSRBARH, CCSRBARL, and CCSRBAR.
> 
> >Why it goes via userspace?
> 
> Because the mechanism in question doesn't just move MPIC.  It moves
> a big block of a bunch of different devices all at once.
> 
> >You can move APIC base too, but this does not involve userspace. But
> >even if you do go via userspace, it is just a guest asking to
> >change device
> >configuration, so using SET_ATTR to set new configuration is fine.
> >
> >> >> What is the benefit of KVM_IRQ_LINE over what MPIC does?
> >What real
> >> >> (non-glue/wrapper) code can become common?
> >> >>
> >> >No new ioctl with exactly same result (well actually even
> >faster since
> >> >less copying is done).
> >>
> >> Which ioctl would go away?
> >>
> >Those that you propose in your new interface.
> 
> No, they wouldn't.  At most one MPIC attribute group would go away
> (though as I've noted it would still be useful to be able to "get"
> those attributes for debugging).
> 
> >> >> And I really hope you don't want us to do MSIs the x86 way.
> >> >>
> >> >What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to
> >> >glue it
> >> >to mpic.
> >>
> >> We'll have to write extra code for it compared to the current way
> >> where it works with *zero* code beyond what is wanted for other
> >> purposes such as debug and (eventually) migration.  At least it's
> >> more direct than having to establish a GSI route...
> >If just writing a register cause MSI to be send how do you distinguish
> >between write that should send MSI and write that is done on migration
> >to transfer current value?
> 
> It is a write-only command register.  The registers that contain the
> state are elsewhere.
> 
The register may be write-only from OS point of view, but its internal
state may still need to be transfered on migration. This brings us back
to the point that state and commands should have different accessors.

> Again, we do not currently support migration on MPIC.  It is a very
> low priority for embedded.  We do not wish to rule it out entirely,
> but it most likely would require adding more state accesors.
The interface suppose to be generic, we are not talking about MPIC
specifically here. Regarding migration "never say never"  :)

> 
> >We had that problem with MSRs on x86. We had
> >to, eventually, add a flag that tells us the reason of MSR access.
> 
> The equivalent to that flag would be using the right kind of
> accessor for what you want to do (simulated guest access versus
> backdoor state access).
Ugh.

> 
> >> >> In the XICS thread, Paul brought up the possibliity of cascaded
> >> >> MPICs.  It's not relevant to the systems we're trying to
> >model, but
> >> >> if one did want to use the in-kernel irqchip interface for
> >that, it
> >> >> would be really nice to be able to operate on a specific MPIC for
> >> >> injection rather than have to come up with some sort of global
> >> >> identifier (above and beyond the minor flattening we'd need
> >to do to
> >> >> represent a single MPIC's interrupts in a flat numberspace).
> >> >>
> >> >ARM encodes information in irq field of KVM_IRQ_LINE like that:
> >> >  bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
> >> >  field: | irq_type  | vcpu_index |   irq_number    |
> >> >Why will similar approach not work?
> >>
> >> Well, it conflicts with the GSI routing stuff, and I don't see an
> >> irq chip ID field...
> >It does :( Can't say I am happy about it, but I skipped the discussion
> >about the interface back then and it is too late to complain now.
> >Since,
> >as you notices, irqfd interfaces with irq routing I wonder what's ARM
> >plan about it. But if you choose to go ARM way the format is ARM
> >specific,
> >so you can use your own encoding and put irq chip information there.
> 
> Well, we do want to support irqfd, so I don't think we'll be
> following ARM here.
> 
I think they want too. May be they have a plan to enhance irqfd. S390
people do something about it now. Haven't look at proposed patches yet.

> >> But otherwise (or assuming you mean to use such an encoding when
> >> setting up a GSI route), I didn't say this part couldn't be made to
> >> work.  It will require new kernel code for managing a GSI table in a
> >> non-APIC way, and a new callback into the device code, but as I've
> >> said elsewhere I think we need it for irqfd anyway.  If I use
> >> KVM_IRQ_LINE for injecting interrupts, do you still object to the
> >> rest of it?
> >The rest of what, proposed interface? There are two separate
> >discussions
> >happening here interleaved. First is "do we need to introduce new
> >generic
> >interface for device creation when existing one, albeit not ideal,
> >can be
> >used" and I am OK with that as long as ARM moves to it for 3.10,
> >although
> >I would prefer to have some example of what this interface will be
> >used
> >for besides irq chips otherwise it will be just another way to create
> >irqchip.
> 
> We need a new way to create irqchips anyway, even if it's just what
> the XICS patchset adds (KVM_CREATE_IRQCHIP_ARGS, which is similar to
> KVM_CREATE_DEVICE except it doesn't return an identifier for
> operating on a specific device).  And of course we want to sort this
> out before either patchset gets merged, so we don't end up adding
> both methods.  I suspect the XICS patchset flew under your radar
> because it has "PPC:" in front of it and the subject line doesn't
> mention the ioctl, but I'm not the only one that felt the need for a
> few new ioctls.
> 
I noticed the thread (there was in-kernel irqchip there to compensate for
PPC), but haven't read it before sending the email otherwise I would
have added here that you guys need to agree on common interface.

Now I looked closer into proposed interface. I am not sure why Paul
decided to implement KVM_CREATE_IRQCHIP_ARGS instead of using
KVM_CREATE_IRQCHIP. He supports only one type with his patches and I am
not sure if he is planning add something else. KVM_IRQCHIP_SET_SOURCES
looks like it tires to reimplement irq routing.
 
> As for other types of devices, x86 has i8254.c emulated in-kernel --
> I know that's not going to switch to the new interface, but it could
> have if it existed back then.
Since it is not going to switch it is not a good example. On x86 probably
interrupt remapping device will have to have some kernel component that
may take advantage of the new interface, but I haven't thought about
interrupt remapping implementation enough to be sure.

>                                I can also see creating an in-kernel
> emulation device for doing MMIO filtering on some piece of embedded
> hardware that guests need to access with reasonable performance, but
> the hardware desginers screwed up the protection slightly (e.g. put
> other things in the same 4K page).  We've done such filtering before
> in our standalone hypervisor; the question is whether it happens to
> anything with enough performance requirements to be done in the
> kernel.
I am not sure why special device is needed for such filtering. If MMIO
is not handled by the kernel it is forwarded to a userspace.

> 
> >Second one is "how the interface should look like". And here I
> >think that strong distinction is needed between setting the attributes
> >and sending commands with side effects for reasons explained all over
> >this ml thread.
> 
> OK, so let's just call them "commands".  I like the split into
> "read" and "write" commands, especially when most of the commands
> naturally come in such pairs, but if you don't like that part it can
> be reduced to a single read/write command (and then we'd define
> separate set/get commands where appropriate).
I think read/write will be simpler.

> 
> Note that the XICS patchset also involves device commands.  It does
> it by passing any unknown vm ioctl to the irqchip (XICS implements
> KVM_IRQCHIP_GET_SOURCES and KVM_IRQCHIP_SET_SOURCES in addition to
> KVM_IRQ_LINE).  Obviously both ways can work; I've given my reasons
> elsewhere in the thread for preferring something that doesn't
> require a new ioctl for every device command.
> 
> >> It's not about "silliness" as that this new thing I added for other
> >> reasons did the job just as well (again, except when it comes to
> >> irqfd), and avoided the need for a GSI table, etc.  IRQ injection
> >> was not the main point of the new interface.
> >Having generic interface for device creation and then make some
> >devices
> >special by allowing them to be used with KVM_IRQ_LINE makes little
> >sense,
> 
> Well, we'd want to document which devices tie into which generic
> interfaces, and which device is selected if multiple such devices
> are created (if that is allowed for a particular device class).
> 
> For KVM_IRQ_LINE, we could perhaps use the device id as the irqchip
> id in kvm_irq_routing_irqchip (or, have an attribute/command that
> retrieves the id to be used there).  Unfortunately there is no
> irqchip field in kvm_irq_routing_msi, though since it's basically a
> command to write to an arbitrary MMIO address, maybe it could just
> be implemented that way?
> 
How MSI is delivered with MPIC? From device point of view sending an MSI
is doing write at address X of data Y. How PPC with MPIC translates this
into actual interrupt?

> >> >> I see no need for a separate ioctl in terms of the underlying
> >> >> infrastructure for distinguishing "attribute" from "write-only
> >> >> command".  I'm open to improvements on what the ioctl is called.
> >> >> It's basically like setting a register on a device, except I was
> >> >> concerned that if we actually called it a "register" that people
> >> >> would take it too literally and think it's only for the
> >architected
> >> >> register state of the emulated device.
> >> >I agree "attribute" is better name than "register", but injecting
> >> >interrupt is not setting an attribute.
> >>
> >> It's a dynamic attribute -- the state of the input line.  Better
> >> names are welcome.  I don't see this difference as enough to warrant
> >> separate ioctls.
> >As long as you use the same attribute for migration and interrupt
> >injection
> >purpose I do. If you use separate attributes for migration and
> >interrupt
> >injection then not having separate ioctl is just a hack.
> 
> Why is it a hack?  Is it also a hack to not use a separate ioctl to
> reset the device, to move its address map, etc?
>
It is a hack because purpose of the interface becomes unclear. If you
see it called in a code you have no idea what semantics is expected.
For instance getting/setting of a state should be done when vcpus are
not running, but commands may be sent while they are. This also
encourage bugs when incorrect attribute is used during migration or vice
versa. In short interface does not help you, it requires you to read
documentation of each device very carefully. Reset is one thing that
will be implemented as a command ioctl. Moving address map depends. If
an address is a part of device's sate and will be migrated as such then it
is device attribute, otherwise use command to move it.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Mackerras Feb. 25, 2013, 1:11 a.m. UTC | #21
On Mon, Feb 18, 2013 at 02:21:59PM +0200, Gleb Natapov wrote:
> Copying Christoffer since ARM has in kernel irq chip too.
> 
> On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> > 
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> You are not only provide different way to create in kernel irq chip you
> also use an alternate way to trigger interrupt lines. Before going into
> interface specifics lets think about whether it is really worth it? x86
> obviously support old way and will have to for some, very long, time.
> ARM vGIC code, that is ready to go upstream, uses old way too. So it will
> be 2 archs against one. Christoffer do you think the proposed way it
> better for your needs. Are you willing to make vGIC use it?

In fact there have been two distinct interrupt controller emulations
for PPC posted recently, Scott's and mine, with quite different
interfaces.  In contrast to Scott's device control API, where the
operations you would do for different interrupt controllers are quite
different, mine attempted to define a much more abstract interface for
interrupt controller emulations, with the idea that an interrupt
controller subsystem connects a set of interrupt sources, each with
some state, to a set of interrupt delivery mechanisms, one per vcpu,
also with some state.

Thus my interface had:

- a KVM_CREATE_IRQCHIP_ARGS ioctl, with an argument structure that
  indicates the overall architecture of the interrupt subsystem,

- KVM_IRQCHIP_SET_SOURCES and KVM_IRQCHIP_GET_SOURCES ioctls, that
  return or modify the state of some set of interrupt sources

- a KVM_REG_PPC_ICP_STATE identifier in the ONE_REG register
  identifier space, that is used to save and restore the per-vcpu
  interrupt presentation state.

Alternatively, I could modify my code to use the existing
KVM_CREATE_IRQCHIP, KVM_GET_IRQCHIP, and KVM_SET_IRQCHIP ioctls.  If I
were to do that I would want to rename the 'pad' field in struct
kvm_irqchip to 'nr' and use it with 'chip_id' to identify a range of
interrupt sources to be affected.  I'd also want to keep the ONE_REG
identifier for the per-vcpu state.

Or I could change over to using Scott's device control API, though I
have some objections to doing that.

What is your advice about which interface to use?

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Feb. 25, 2013, 1:09 p.m. UTC | #22
On Mon, Feb 25, 2013 at 12:11:19PM +1100, Paul Mackerras wrote:
> On Mon, Feb 18, 2013 at 02:21:59PM +0200, Gleb Natapov wrote:
> > Copying Christoffer since ARM has in kernel irq chip too.
> > 
> > On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> > > Currently, devices that are emulated inside KVM are configured in a
> > > hardcoded manner based on an assumption that any given architecture
> > > only has one way to do it.  If there's any need to access device state,
> > > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > > cumbersome and depletes a limited numberspace.
> > > 
> > > This API provides a mechanism to instantiate a device of a certain
> > > type, returning an ID that can be used to set/get attributes of the
> > > device.  Attributes may include configuration parameters (e.g.
> > > register base address), device state, operational commands, etc.  It
> > > is similar to the ONE_REG API, except that it acts on devices rather
> > > than vcpus.
> > You are not only provide different way to create in kernel irq chip you
> > also use an alternate way to trigger interrupt lines. Before going into
> > interface specifics lets think about whether it is really worth it? x86
> > obviously support old way and will have to for some, very long, time.
> > ARM vGIC code, that is ready to go upstream, uses old way too. So it will
> > be 2 archs against one. Christoffer do you think the proposed way it
> > better for your needs. Are you willing to make vGIC use it?
> 
> In fact there have been two distinct interrupt controller emulations
> for PPC posted recently, Scott's and mine, with quite different
> interfaces.  In contrast to Scott's device control API, where the
> operations you would do for different interrupt controllers are quite
> different, mine attempted to define a much more abstract interface for
> interrupt controller emulations, with the idea that an interrupt
> controller subsystem connects a set of interrupt sources, each with
> some state, to a set of interrupt delivery mechanisms, one per vcpu,
> also with some state.
> 
I agree with Scott that it is futile to try and come up with generic
irqchip configuration interface and I doubt it is needed from QEMU
or other userspace pov. I looked at proposed KVM_IRQCHIP_SET_SOURCES
interface and while it is possible to pass some information about
pic/ioapic using it there will be a lot of information that will not
fit there. For one there is global irqchips related state and proposed
interface only talk about interrupt sources. Another is that using
generic interface will require us to have a code that translate irqchip
representation into this generic one and back for no apparent gain.
Currently pic/ioapic state is very similar to what HW specification
defines and it is not going to change.

Looking at your implementation of KVM_IRQCHIP_SET_SOURCES I wounder how
well it work for migration though. The interface suppose to transfer
irqchips state as is, but I see things like that in your code:

                       mutex_lock(&ics->lock);
                       irqp->server = val & KVM_IRQ_SERVER_MASK;
                       irqp->priority = val >> KVM_IRQ_PRIORITY_SHIFT;
                       irqp->resend = 0;
                       irqp->masked_pending = 0;
                       irqp->asserted = 0;

Why it is safe to initialize those values to default values during
migration? Wouldn't it be simpler and more correct to just transfer
the whole content of irqp from src to dst?

> Thus my interface had:
> 
> - a KVM_CREATE_IRQCHIP_ARGS ioctl, with an argument structure that
>   indicates the overall architecture of the interrupt subsystem,
> 
> - KVM_IRQCHIP_SET_SOURCES and KVM_IRQCHIP_GET_SOURCES ioctls, that
>   return or modify the state of some set of interrupt sources
> 
> - a KVM_REG_PPC_ICP_STATE identifier in the ONE_REG register
>   identifier space, that is used to save and restore the per-vcpu
>   interrupt presentation state.
> 
> Alternatively, I could modify my code to use the existing
> KVM_CREATE_IRQCHIP, KVM_GET_IRQCHIP, and KVM_SET_IRQCHIP ioctls.  If I
> were to do that I would want to rename the 'pad' field in struct
> kvm_irqchip to 'nr' and use it with 'chip_id' to identify a range of
> interrupt sources to be affected.  I'd also want to keep the ONE_REG
> identifier for the per-vcpu state.
It is preferable to the interface you propose since I do not think your
interface fits other interrupt controllers well. You can put nr field
into dummy[] payload, or replace pad with "union {pad, nr}".

> 
> Or I could change over to using Scott's device control API, though I
> have some objections to doing that.
> 
> What is your advice about which interface to use?
> 
The ideal situation would be if you and Scott agree on the details about
the interface. If you don't like something about Scott's interface we
can discuss it and shape it to something you agree with or even like.
I already asked Scott to introduce command interface. Scott does not
care about migration, you do, so you can make sure that interface works
for that.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf Feb. 25, 2013, 3:23 p.m. UTC | #23
On 24.02.2013, at 16:46, Gleb Natapov wrote:

> On Thu, Feb 21, 2013 at 08:17:54PM -0600, Scott Wood wrote:
>> On 02/21/2013 02:22:09 AM, Gleb Natapov wrote:
>>> On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
>>>> On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
>>>>> On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
>>>>>> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
>>>>>>> On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
>>>>>>>> The ability to set/get attributes is needed.  Sorry, but "get
>>>>> or set
>>>>>>>> one blob of data, up to 512 bytes, for the entire irqchip" is
>>>>> just
>>>>>>>> not good enough -- assuming you don't want us to start
>>> sticking
>>>>>>>> pointers and commands in *that* data. :-)
>>>>>>>> 
>>>>>>> Proposed interface sticks pointers into ioctl data, so why doing
>>>>>>> the same
>>>>>>> for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
>>>>>> 
>>>>>> There's a difference between putting a pointer in an ioctl
>>> control
>>>>>> structure that is specifically documented as being that way
>>> (as in
>>>>>> ONE_REG), versus taking an ioctl that claims to be
>>> setting/getting a
>>>>>> blob of state and embedding pointers in it.  It would be like
>>>>>> sticking a pointer in the attribute payload of this API, which I
>>>>>> think is something to be discouraged.
>>>>> If documentation is what differentiate for you between silly
>>> and smart
>>>>> then write documentation instead of new interfaces.
>>>> 
>>>> You mean like what we did with SREGS, that got deprecated and
>>>> replaced with ONE_REG?
>>>> 
>>> SREGS is implemented by ppc. I see ONE_REG as addition to REGS
>>> interface. You can access all register at once or you can access them
>>> one by one. If there is a need we can add MULTIPLE_REGS that will get
>>> list of requested REGS.
>> 
>> http://www.spinics.net/lists/kvm-ppc/msg04876.html
>> http://www.spinics.net/lists/kvm-ppc/msg05842.html
>> 
> If Alex prefers ONE_REG interface this is his call :)

There's a reason I prefer it :). We ran out of space in the SREGS call, we had a bit map with multiple chunks of data you could set in there and worst of all, when we ran out of space we didn't realize it until the code was already in Linus' tree!

The normal "here's a chunk of data" ioctl style works great when you know all or most of the details of what you want to transfer. In the case of kvm, we need to expose an interface we

  a) Don't control itself. Hardware guys make hardware. We just follow.
  b) Is pretty complex. There can be a lot of state in a CPU / device.

So because if that we needed something more flexible and dynamic. Something where we could add registers we forgot without jumping through lots of hoops.

That's why I prefer ONE_REG for vcpus. Since I see the same danger on device emulation, I tend to prefer it here as well.

> 
>>> The interface is not over generic i.e it does
>>> not try to replace KVM_RUN by writing special register.
>> 
>> Sigh.
>> 
>>>>                                                   And we
>>>> don't necessarily want to set *everything*.
>>> What are those cases? You do need to on reset/migration.
>> 
>> Why do we want to set all the registers on reset, rather than tell
>> the in-kernel device to reset?  The default state came from the
>> kernel in the first place on irqchip creation...
>> 
> I have nothing against telling in-kernel device to reset provided there
> is a way to do so, which current interface lacks. Reset in userspase has
> its advantage too: bugs are easier to fix, there may be different kind
> of resets (hard/soft).

The same argument holds for state access. If you suddenly realize "oh, I need to also set this one hidden state bit in my hardware" you only want to add that specific write on reset.

If you suddenly find a race condition in migration that happens very rarely, you want to add the one register that you forgot to synchronize in addition in your new code. You don't want to change the whole ioctl passed struct.

That's why we have the padding in our ioctl structs. And feature bitmaps. And once we exceed them, we're screwed.

I want to make sure we don't hit that point.

Actually, there's a really easy way to understand what this interface does. It basically implements a state get/set ioctl which allows for very fine grained control over your ioctl struct. It does that by only passing one register at a time.

If we run into concurrency issues (multiple registers need to be set in a locked manner at the same time), we can always add a new ioctl that sets multiple attrs/regs at once. For CPUs, we haven't hit that case yet.

> 
>>>>>>                                       It'd also be using
>>>>>> KVM_SET_IRQCHIP to read data, which is the sort of thing you
>>> object
>>>>>> to later on regarding KVM_IRQ_LINE_STATUS.
>>>>>> 
>>>>> Do not see why.
>>>> 
>>>> It's either that, or have the data direction of the "chip" field in
>>>> KVM_GET_IRQCHIP not be entirely in the "get" direction.
>>>> 
>>> Still do not follow. Example?
>> 
>> struct kvm_irqchip has "chip_id", "pad", and "chip".  "pad" is
>> insufficient to communicate attribute type plus a pointer.  So if we
>> want to provide a pointer for the kernel to write the attribute
>> into, it has to read from memory that the ioctl definition suggests
>> should only be written to.
> Yes, but this is not different from the interface you propose.
> 
>> 
>>>>> Setting the address is setting an attribute. Sending MSI is a
>>> command.
>>>>> Things you set/get during init/migration are attributes. Things
>>> you do
>>>>> to cause side-effects are commands.
>>>> 
>>>> What if I set the address at a time that isn't init/migration (the
>>>> hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not
>>>> an attribute due to how the caller uses it?
>>>> 
>>> What's the interface for guest to move it?
>> 
>> Some non-MPIC registers called CCSRBARH, CCSRBARL, and CCSRBAR.
>> 
>>> Why it goes via userspace?
>> 
>> Because the mechanism in question doesn't just move MPIC.  It moves
>> a big block of a bunch of different devices all at once.
>> 
>>> You can move APIC base too, but this does not involve userspace. But
>>> even if you do go via userspace, it is just a guest asking to
>>> change device
>>> configuration, so using SET_ATTR to set new configuration is fine.
>>> 
>>>>>> What is the benefit of KVM_IRQ_LINE over what MPIC does?
>>> What real
>>>>>> (non-glue/wrapper) code can become common?
>>>>>> 
>>>>> No new ioctl with exactly same result (well actually even
>>> faster since
>>>>> less copying is done).
>>>> 
>>>> Which ioctl would go away?
>>>> 
>>> Those that you propose in your new interface.
>> 
>> No, they wouldn't.  At most one MPIC attribute group would go away
>> (though as I've noted it would still be useful to be able to "get"
>> those attributes for debugging).
>> 
>>>>>> And I really hope you don't want us to do MSIs the x86 way.
>>>>>> 
>>>>> What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to
>>>>> glue it
>>>>> to mpic.
>>>> 
>>>> We'll have to write extra code for it compared to the current way
>>>> where it works with *zero* code beyond what is wanted for other
>>>> purposes such as debug and (eventually) migration.  At least it's
>>>> more direct than having to establish a GSI route...
>>> If just writing a register cause MSI to be send how do you distinguish
>>> between write that should send MSI and write that is done on migration
>>> to transfer current value?
>> 
>> It is a write-only command register.  The registers that contain the
>> state are elsewhere.
>> 
> The register may be write-only from OS point of view, but its internal
> state may still need to be transfered on migration. This brings us back
> to the point that state and commands should have different accessors.

Scott expressed this through his "groups". Hidden registers are a separate group, but still accessed using the same ideology of one register at a time. I think that makes sense.

> 
>> Again, we do not currently support migration on MPIC.  It is a very
>> low priority for embedded.  We do not wish to rule it out entirely,
>> but it most likely would require adding more state accesors.
> The interface suppose to be generic, we are not talking about MPIC
> specifically here. Regarding migration "never say never"  :)
> 
>> 
>>> We had that problem with MSRs on x86. We had
>>> to, eventually, add a flag that tells us the reason of MSR access.
>> 
>> The equivalent to that flag would be using the right kind of
>> accessor for what you want to do (simulated guest access versus
>> backdoor state access).
> Ugh.

If we need to modify hidden state for migration, we can always create hidden registers in this interface. We won't know until we actually implement it. And then a new MPIC revision gets released and we need to do things all over again.

The point of a generic get/set interface really is that we can change and extend the type of state we move between user and kernel space.

> 
>> 
>>>>>> In the XICS thread, Paul brought up the possibliity of cascaded
>>>>>> MPICs.  It's not relevant to the systems we're trying to
>>> model, but
>>>>>> if one did want to use the in-kernel irqchip interface for
>>> that, it
>>>>>> would be really nice to be able to operate on a specific MPIC for
>>>>>> injection rather than have to come up with some sort of global
>>>>>> identifier (above and beyond the minor flattening we'd need
>>> to do to
>>>>>> represent a single MPIC's interrupts in a flat numberspace).
>>>>>> 
>>>>> ARM encodes information in irq field of KVM_IRQ_LINE like that:
>>>>>  bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
>>>>> field: | irq_type  | vcpu_index |   irq_number    |
>>>>> Why will similar approach not work?
>>>> 
>>>> Well, it conflicts with the GSI routing stuff, and I don't see an
>>>> irq chip ID field...
>>> It does :( Can't say I am happy about it, but I skipped the discussion
>>> about the interface back then and it is too late to complain now.
>>> Since,
>>> as you notices, irqfd interfaces with irq routing I wonder what's ARM
>>> plan about it. But if you choose to go ARM way the format is ARM
>>> specific,
>>> so you can use your own encoding and put irq chip information there.
>> 
>> Well, we do want to support irqfd, so I don't think we'll be
>> following ARM here.
>> 
> I think they want too. May be they have a plan to enhance irqfd. S390
> people do something about it now. Haven't look at proposed patches yet.
> 
>>>> But otherwise (or assuming you mean to use such an encoding when
>>>> setting up a GSI route), I didn't say this part couldn't be made to
>>>> work.  It will require new kernel code for managing a GSI table in a
>>>> non-APIC way, and a new callback into the device code, but as I've
>>>> said elsewhere I think we need it for irqfd anyway.  If I use
>>>> KVM_IRQ_LINE for injecting interrupts, do you still object to the
>>>> rest of it?
>>> The rest of what, proposed interface? There are two separate
>>> discussions
>>> happening here interleaved. First is "do we need to introduce new
>>> generic
>>> interface for device creation when existing one, albeit not ideal,
>>> can be
>>> used" and I am OK with that as long as ARM moves to it for 3.10,
>>> although
>>> I would prefer to have some example of what this interface will be
>>> used
>>> for besides irq chips otherwise it will be just another way to create
>>> irqchip.
>> 
>> We need a new way to create irqchips anyway, even if it's just what
>> the XICS patchset adds (KVM_CREATE_IRQCHIP_ARGS, which is similar to
>> KVM_CREATE_DEVICE except it doesn't return an identifier for
>> operating on a specific device).  And of course we want to sort this
>> out before either patchset gets merged, so we don't end up adding
>> both methods.  I suspect the XICS patchset flew under your radar
>> because it has "PPC:" in front of it and the subject line doesn't
>> mention the ioctl, but I'm not the only one that felt the need for a
>> few new ioctls.
>> 
> I noticed the thread (there was in-kernel irqchip there to compensate for
> PPC), but haven't read it before sending the email otherwise I would
> have added here that you guys need to agree on common interface.
> 
> Now I looked closer into proposed interface. I am not sure why Paul
> decided to implement KVM_CREATE_IRQCHIP_ARGS instead of using
> KVM_CREATE_IRQCHIP. He supports only one type with his patches and I am
> not sure if he is planning add something else. KVM_IRQCHIP_SET_SOURCES
> looks like it tires to reimplement irq routing.

Because the same host hardware can have either an in-kernel MPIC or in-kernel XICS, depending on the type of guest you want to run.

> 
>> As for other types of devices, x86 has i8254.c emulated in-kernel --
>> I know that's not going to switch to the new interface, but it could
>> have if it existed back then.
> Since it is not going to switch it is not a good example. On x86 probably
> interrupt remapping device will have to have some kernel component that
> may take advantage of the new interface, but I haven't thought about
> interrupt remapping implementation enough to be sure.
> 
>>                               I can also see creating an in-kernel
>> emulation device for doing MMIO filtering on some piece of embedded
>> hardware that guests need to access with reasonable performance, but
>> the hardware desginers screwed up the protection slightly (e.g. put
>> other things in the same 4K page).  We've done such filtering before
>> in our standalone hypervisor; the question is whether it happens to
>> anything with enough performance requirements to be done in the
>> kernel.
> I am not sure why special device is needed for such filtering. If MMIO
> is not handled by the kernel it is forwarded to a userspace.

Have you attended my talk at KVM Forum? :)

Emulated device access (PIO / MMIO) in kvm is about 2-3x faster than going to QEMU on x86. Pure roundtrip time. And it gets worse with hypercalls. See slide 41:

  https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf

I haven't done these benchmarks on PPC yet, but I don't expect them to look vastly different.

> 
>> 
>>> Second one is "how the interface should look like". And here I
>>> think that strong distinction is needed between setting the attributes
>>> and sending commands with side effects for reasons explained all over
>>> this ml thread.
>> 
>> OK, so let's just call them "commands".  I like the split into
>> "read" and "write" commands, especially when most of the commands
>> naturally come in such pairs, but if you don't like that part it can
>> be reduced to a single read/write command (and then we'd define
>> separate set/get commands where appropriate).
> I think read/write will be simpler.
> 
>> 
>> Note that the XICS patchset also involves device commands.  It does
>> it by passing any unknown vm ioctl to the irqchip (XICS implements
>> KVM_IRQCHIP_GET_SOURCES and KVM_IRQCHIP_SET_SOURCES in addition to
>> KVM_IRQ_LINE).  Obviously both ways can work; I've given my reasons
>> elsewhere in the thread for preferring something that doesn't
>> require a new ioctl for every device command.
>> 
>>>> It's not about "silliness" as that this new thing I added for other
>>>> reasons did the job just as well (again, except when it comes to
>>>> irqfd), and avoided the need for a GSI table, etc.  IRQ injection
>>>> was not the main point of the new interface.
>>> Having generic interface for device creation and then make some
>>> devices
>>> special by allowing them to be used with KVM_IRQ_LINE makes little
>>> sense,
>> 
>> Well, we'd want to document which devices tie into which generic
>> interfaces, and which device is selected if multiple such devices
>> are created (if that is allowed for a particular device class).
>> 
>> For KVM_IRQ_LINE, we could perhaps use the device id as the irqchip
>> id in kvm_irq_routing_irqchip (or, have an attribute/command that
>> retrieves the id to be used there).  Unfortunately there is no
>> irqchip field in kvm_irq_routing_msi, though since it's basically a
>> command to write to an arbitrary MMIO address, maybe it could just
>> be implemented that way?
>> 
> How MSI is delivered with MPIC? From device point of view sending an MSI
> is doing write at address X of data Y. How PPC with MPIC translates this
> into actual interrupt?

The device DMAs into an MPIC register to trigger the MSI. The MPIC then enables the "level high" bit for that particular internal interrupt line. It also triggers an interrupt on a CPU - the same way it would with level or timer interrupts. Filtering and priority also work the same way. Once the guest ACKs that interrupt, the "level high" bit for that MSI line gets cleared.

So essentially, the MPIC converts MSIs into level based interrupts.

That's why pushing an MSI into the in-kernel MPIC by using the live migration interface works. All you need to do is live migrate the "MSI interrupt line is active" bit into the device state.

> 
>>>>>> I see no need for a separate ioctl in terms of the underlying
>>>>>> infrastructure for distinguishing "attribute" from "write-only
>>>>>> command".  I'm open to improvements on what the ioctl is called.
>>>>>> It's basically like setting a register on a device, except I was
>>>>>> concerned that if we actually called it a "register" that people
>>>>>> would take it too literally and think it's only for the
>>> architected
>>>>>> register state of the emulated device.
>>>>> I agree "attribute" is better name than "register", but injecting
>>>>> interrupt is not setting an attribute.
>>>> 
>>>> It's a dynamic attribute -- the state of the input line.  Better
>>>> names are welcome.  I don't see this difference as enough to warrant
>>>> separate ioctls.
>>> As long as you use the same attribute for migration and interrupt
>>> injection
>>> purpose I do. If you use separate attributes for migration and
>>> interrupt
>>> injection then not having separate ioctl is just a hack.
>> 
>> Why is it a hack?  Is it also a hack to not use a separate ioctl to
>> reset the device, to move its address map, etc?
>> 
> It is a hack because purpose of the interface becomes unclear. If you
> see it called in a code you have no idea what semantics is expected.
> For instance getting/setting of a state should be done when vcpus are
> not running, but commands may be sent while they are. This also
> encourage bugs when incorrect attribute is used during migration or vice
> versa. In short interface does not help you, it requires you to read
> documentation of each device very carefully. Reset is one thing that
> will be implemented as a command ioctl. Moving address map depends. If
> an address is a part of device's sate and will be migrated as such then it
> is device attribute, otherwise use command to move it.

I agree here. I think it makes sense to have a separate interface to set interrupt states. This separate interface can be a separate group inside this interface too. But I agree that we shouldn't use the live migration interface to set the interrupt line of the MPIC.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf Feb. 25, 2013, 3:29 p.m. UTC | #24
On 25.02.2013, at 14:09, Gleb Natapov wrote:

> On Mon, Feb 25, 2013 at 12:11:19PM +1100, Paul Mackerras wrote:
>> On Mon, Feb 18, 2013 at 02:21:59PM +0200, Gleb Natapov wrote:
>>> Copying Christoffer since ARM has in kernel irq chip too.
>>> 
>>> On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
>>>> Currently, devices that are emulated inside KVM are configured in a
>>>> hardcoded manner based on an assumption that any given architecture
>>>> only has one way to do it.  If there's any need to access device state,
>>>> it is done through inflexible one-purpose-only IOCTLs (e.g.
>>>> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
>>>> cumbersome and depletes a limited numberspace.
>>>> 
>>>> This API provides a mechanism to instantiate a device of a certain
>>>> type, returning an ID that can be used to set/get attributes of the
>>>> device.  Attributes may include configuration parameters (e.g.
>>>> register base address), device state, operational commands, etc.  It
>>>> is similar to the ONE_REG API, except that it acts on devices rather
>>>> than vcpus.
>>> You are not only provide different way to create in kernel irq chip you
>>> also use an alternate way to trigger interrupt lines. Before going into
>>> interface specifics lets think about whether it is really worth it? x86
>>> obviously support old way and will have to for some, very long, time.
>>> ARM vGIC code, that is ready to go upstream, uses old way too. So it will
>>> be 2 archs against one. Christoffer do you think the proposed way it
>>> better for your needs. Are you willing to make vGIC use it?
>> 
>> In fact there have been two distinct interrupt controller emulations
>> for PPC posted recently, Scott's and mine, with quite different
>> interfaces.  In contrast to Scott's device control API, where the
>> operations you would do for different interrupt controllers are quite
>> different, mine attempted to define a much more abstract interface for
>> interrupt controller emulations, with the idea that an interrupt
>> controller subsystem connects a set of interrupt sources, each with
>> some state, to a set of interrupt delivery mechanisms, one per vcpu,
>> also with some state.
>> 
> I agree with Scott that it is futile to try and come up with generic
> irqchip configuration interface and I doubt it is needed from QEMU
> or other userspace pov. I looked at proposed KVM_IRQCHIP_SET_SOURCES
> interface and while it is possible to pass some information about
> pic/ioapic using it there will be a lot of information that will not
> fit there. For one there is global irqchips related state and proposed
> interface only talk about interrupt sources. Another is that using
> generic interface will require us to have a code that translate irqchip
> representation into this generic one and back for no apparent gain.
> Currently pic/ioapic state is very similar to what HW specification
> defines and it is not going to change.
> 
> Looking at your implementation of KVM_IRQCHIP_SET_SOURCES I wounder how
> well it work for migration though. The interface suppose to transfer
> irqchips state as is, but I see things like that in your code:
> 
>                       mutex_lock(&ics->lock);
>                       irqp->server = val & KVM_IRQ_SERVER_MASK;
>                       irqp->priority = val >> KVM_IRQ_PRIORITY_SHIFT;
>                       irqp->resend = 0;
>                       irqp->masked_pending = 0;
>                       irqp->asserted = 0;
> 
> Why it is safe to initialize those values to default values during
> migration? Wouldn't it be simpler and more correct to just transfer
> the whole content of irqp from src to dst?
> 
>> Thus my interface had:
>> 
>> - a KVM_CREATE_IRQCHIP_ARGS ioctl, with an argument structure that
>>  indicates the overall architecture of the interrupt subsystem,
>> 
>> - KVM_IRQCHIP_SET_SOURCES and KVM_IRQCHIP_GET_SOURCES ioctls, that
>>  return or modify the state of some set of interrupt sources
>> 
>> - a KVM_REG_PPC_ICP_STATE identifier in the ONE_REG register
>>  identifier space, that is used to save and restore the per-vcpu
>>  interrupt presentation state.
>> 
>> Alternatively, I could modify my code to use the existing
>> KVM_CREATE_IRQCHIP, KVM_GET_IRQCHIP, and KVM_SET_IRQCHIP ioctls.  If I
>> were to do that I would want to rename the 'pad' field in struct
>> kvm_irqchip to 'nr' and use it with 'chip_id' to identify a range of
>> interrupt sources to be affected.  I'd also want to keep the ONE_REG
>> identifier for the per-vcpu state.
> It is preferable to the interface you propose since I do not think your
> interface fits other interrupt controllers well. You can put nr field
> into dummy[] payload, or replace pad with "union {pad, nr}".
> 
>> 
>> Or I could change over to using Scott's device control API, though I
>> have some objections to doing that.
>> 
>> What is your advice about which interface to use?
>> 
> The ideal situation would be if you and Scott agree on the details about
> the interface. If you don't like something about Scott's interface we
> can discuss it and shape it to something you agree with or even like.
> I already asked Scott to introduce command interface. Scott does not
> care about migration, you do, so you can make sure that interface works
> for that.

I agree. I really want to see one style used for both devices. If we come to the conclusion that we need to go old-style ioctl payload style, then fine. I obviously prefer fine-grained individual state accessors.

But really, even if the "generic interface" only generalizes MPIC and XICS it's already a win for me :).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Mackerras March 6, 2013, 12:59 a.m. UTC | #25
On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
> 
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.

The API looks fine to me.  Ultimately I could use a version of the
get/set attribute ioctls that get or set multiple attributes within a
group, but that can come later.

Were you thinking that the attribute codes should encode the size of
the attribute, like the one_reg register IDs do?  If so it would be
good to define the bitfield and values for that in this patch.

The one comment I have on the implementation is that it doesn't seem
to conveniently allow for multiple instances of a device type, since
there is no instance-specific pointer stored anywhere.  More comments
below...

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0350e0d..dbaf012 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -335,6 +335,25 @@ struct kvm_memslots {
>  	short id_to_index[KVM_MEM_SLOTS_NUM];
>  };
>  
> +/*
> + * The worst case number of simultaneous devices will likely be very low
> + * (usually zero or one) for the forseeable future.  If the worst case
> + * exceeds this, then it can be increased, or we can convert to idr.
> + */
> +#define KVM_MAX_DEVICES 4
> +
> +struct kvm_device {
> +	u32 type;
> +
> +	int (*set_attr)(struct kvm *kvm, struct kvm_device *dev,
> +			struct kvm_device_attr *attr);
> +	int (*get_attr)(struct kvm *kvm, struct kvm_device *dev,
> +			struct kvm_device_attr *attr);
> +	int (*has_attr)(struct kvm *kvm, struct kvm_device *dev,
> +			struct kvm_device_attr *attr);
> +	void (*destroy)(struct kvm *kvm, struct kvm_device *dev);
> +};

This is more of a device class definition than a device instance
definition.  There is nothing in this struct that would be different
between different instances of a given device, and in fact it would
make sense to use the one copy of this struct for all instances of a
given type.  However, the functions listed here only take the struct
kvm_device pointer, meaning that to distinguish between instances,
these functions would have to do some sort of container_of trick to
know which instance to operate on.

I think it would make more sense either to add a void * instance data
pointer to struct kvm_device, or to add a void * argument to those
functions as an instance data pointer and add another field to struct
kvm like this:

> +
>  struct kvm {
>  	spinlock_t mmu_lock;
>  	struct mutex slots_lock;
> @@ -385,6 +404,8 @@ struct kvm {
>  	long mmu_notifier_count;
>  #endif
>  	long tlbs_dirty;
> +	struct kvm_device *devices[KVM_MAX_DEVICES];

+	void *device_instance[KVM_MAX_DEVICES];

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt March 6, 2013, 2:48 a.m. UTC | #26
On Wed, 2013-02-13 at 23:49 -0600, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.

Allright guys, let's take a break for a minute :-)

What you seem to be proposing is a whole new construct / API to create
"device" objects with "attributes" as a way to avoid adding tons of
ioctls to the VM.

Then you somewhat "coerce" behaviours (ie. methods) as side effects of
setting some of those attributes, and create some kind of rigid API
through which any kind of potential device "attribute" needs to be
coerced through.

Essentially you are trying to re-invent encapsulation of kernel objects
manipulated by userspace, we already have several mechanisms for doing
just that and you are trying to add yet a new one :-)

What about instead using existing mechanisms for doing just that:

Make your "create device" return an anonymous file descriptor !

That gives you everything ... private ioctl's which can do whatever the
heck you want (attributes, methods, etc...), mmap, etc...

Guess what ? That's already what we do for various things like our
in-kernel emulated iommu tables as far as I can remember.

If your problem is to avoid the bottleneck of having to deal with
upstream maintainers for generic VM ioctls every time you add some new
platform specific kernel "object" then this is probably a much better
approach.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov March 6, 2013, 10:18 a.m. UTC | #27
On Wed, Mar 06, 2013 at 01:48:33PM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2013-02-13 at 23:49 -0600, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> > 
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> 
> Allright guys, let's take a break for a minute :-)
> 
> What you seem to be proposing is a whole new construct / API to create
> "device" objects with "attributes" as a way to avoid adding tons of
> ioctls to the VM.
> 
> Then you somewhat "coerce" behaviours (ie. methods) as side effects of
> setting some of those attributes, and create some kind of rigid API
> through which any kind of potential device "attribute" needs to be
> coerced through.
> 
> Essentially you are trying to re-invent encapsulation of kernel objects
> manipulated by userspace, we already have several mechanisms for doing
> just that and you are trying to add yet a new one :-)
> 
> What about instead using existing mechanisms for doing just that:
> 
> Make your "create device" return an anonymous file descriptor !
> 
That was faster that I predicted! :)

http://www.spinics.net/lists/kvm/msg86687.html last paragraph.

> That gives you everything ... private ioctl's which can do whatever the
> heck you want (attributes, methods, etc...), mmap, etc...
> 
> Guess what ? That's already what we do for various things like our
> in-kernel emulated iommu tables as far as I can remember.
> 
> If your problem is to avoid the bottleneck of having to deal with
> upstream maintainers for generic VM ioctls every time you add some new
> platform specific kernel "object" then this is probably a much better
> approach.
> 
> Cheers,
> Ben.
> 

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index c2534c3..5bcdb42 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2122,6 +2122,82 @@  header; first `n_valid' valid entries with contents from the data
 written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
+4.79 KVM_CREATE_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_create_device (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EEXIST: Device already created, and this type of device may not
+          be instantiated multiple times
+  ENOSPC: Too many devices have been created
+
+  Other error conditions may be defined by individual device types.
+
+Creates an emulated device in the kernel.  The returned handle
+can be used with KVM_SET/GET_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	id;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device id is invalid
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+  EPERM:  The attribute cannot (currently) be accessed this way
+          (e.g. read-only attribute, or attribute that only makes
+          sense when the device is in a different state)
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific except for certain global attributes.  See
+individual device documentation in the "devices" directory.  As with
+ONE_REG, the size of the data transferred is defined by the particular
+attribute.
+
+Attributes in group KVM_DEV_ATTR_COMMON are not device-specific:
+   KVM_DEV_ATTR_TYPE (ro, 32-bit): the device type passed to KVM_CREATE_DEVICE
+
+struct kvm_device_attr {
+	__u32	dev;		/* id from KVM_CREATE_DEVICE */
+	__u32	group;		/* KVM_DEV_ATTR_COMMON or device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+4.81 KVM_HAS_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device id is invalid
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  "addr" is ignored.
 
 5. The kvm_run structure
 ------------------------
diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
new file mode 100644
index 0000000..34a6983
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/README
@@ -0,0 +1 @@ 
+This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0350e0d..dbaf012 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -335,6 +335,25 @@  struct kvm_memslots {
 	short id_to_index[KVM_MEM_SLOTS_NUM];
 };
 
+/*
+ * The worst case number of simultaneous devices will likely be very low
+ * (usually zero or one) for the forseeable future.  If the worst case
+ * exceeds this, then it can be increased, or we can convert to idr.
+ */
+#define KVM_MAX_DEVICES 4
+
+struct kvm_device {
+	u32 type;
+
+	int (*set_attr)(struct kvm *kvm, struct kvm_device *dev,
+			struct kvm_device_attr *attr);
+	int (*get_attr)(struct kvm *kvm, struct kvm_device *dev,
+			struct kvm_device_attr *attr);
+	int (*has_attr)(struct kvm *kvm, struct kvm_device *dev,
+			struct kvm_device_attr *attr);
+	void (*destroy)(struct kvm *kvm, struct kvm_device *dev);
+};
+
 struct kvm {
 	spinlock_t mmu_lock;
 	struct mutex slots_lock;
@@ -385,6 +404,8 @@  struct kvm {
 	long mmu_notifier_count;
 #endif
 	long tlbs_dirty;
+	struct kvm_device *devices[KVM_MAX_DEVICES];
+	unsigned int num_devices;
 };
 
 #define kvm_err(fmt, ...) \
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 9a2db57..1f348e0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -662,6 +662,7 @@  struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_HTAB_FD 84
 #define KVM_CAP_S390_CSS_SUPPORT 85
 #define KVM_CAP_PPC_EPR 86
+#define KVM_CAP_DEVICE_CTRL 87
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -890,6 +891,30 @@  struct kvm_s390_ucas_mapping {
 /* Available with KVM_CAP_PPC_HTAB_FD */
 #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
 
+/* Available with KVM_CAP_DEVICE_CTRL */
+#define KVM_CREATE_DEVICE_TEST		1
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	id;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+struct kvm_device_attr {
+	__u32	dev;		/* id from KVM_CREATE_DEVICE */
+	__u32	group;		/* KVM_DEV_ATTR_COMMON or device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+#define KVM_DEV_ATTR_COMMON		0
+#define   KVM_DEV_ATTR_TYPE		0 /* 32-bit */
+
+#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xac, struct kvm_create_device)
+#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xad, struct kvm_device_attr)
+#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xae, struct kvm_device_attr)
+#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xaf, struct kvm_device_attr)
+
 /*
  * ioctls for vcpu fds
  */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2e93630..baf8481 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -580,6 +580,18 @@  void kvm_free_physmem(struct kvm *kvm)
 	kfree(kvm->memslots);
 }
 
+static void kvm_destroy_devices(struct kvm *kvm)
+{
+	int i;
+
+	for (i = 0; i < kvm->num_devices; i++) {
+		kvm->devices[i]->destroy(kvm, kvm->devices[i]);
+		kvm->devices[i] = NULL;
+	}
+
+	kvm->num_devices = 0;
+}
+
 static void kvm_destroy_vm(struct kvm *kvm)
 {
 	int i;
@@ -590,6 +602,7 @@  static void kvm_destroy_vm(struct kvm *kvm)
 	list_del(&kvm->vm_list);
 	raw_spin_unlock(&kvm_lock);
 	kvm_free_irq_routing(kvm);
+	kvm_destroy_devices(kvm);
 	for (i = 0; i < KVM_NR_BUSES; i++)
 		kvm_io_bus_destroy(kvm->buses[i]);
 	kvm_coalesced_mmio_free(kvm);
@@ -2178,6 +2191,86 @@  out:
 }
 #endif
 
+static int kvm_ioctl_create_device(struct kvm *kvm,
+				   struct kvm_create_device *cd)
+{
+	struct kvm_device *dev = NULL;
+	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
+	int id;
+	int r;
+
+	mutex_lock(&kvm->lock);
+
+	id = kvm->num_devices;
+	if (id >= KVM_MAX_DEVICES && !test) {
+		r = -ENOSPC;
+		goto out;
+	}
+
+	switch (cd->type) {
+	default:
+		r = -ENODEV;
+		goto out;
+	}
+
+	if (test) {
+		WARN_ON_ONCE(dev);
+		goto out;
+	}
+
+	if (r == 0) {
+		WARN_ON_ONCE(dev->type != cd->type);
+
+		kvm->devices[id] = dev;
+		cd->id = id;
+		kvm->num_devices++;
+	}
+
+out:
+	mutex_unlock(&kvm->lock);
+	return r;
+}
+
+static int kvm_ioctl_device_attr(struct kvm *kvm, int ioctl,
+				 struct kvm_device_attr *attr)
+{
+	struct kvm_device *dev;
+	int (*accessor)(struct kvm *kvm, struct kvm_device *dev,
+			struct kvm_device_attr *attr);
+
+	if (attr->dev >= KVM_MAX_DEVICES)
+		return -ENODEV;
+
+	dev = kvm->devices[attr->dev];
+	if (!dev)
+		return -ENODEV;
+
+	switch (ioctl) {
+	case KVM_SET_DEVICE_ATTR:
+		if (attr->group == KVM_DEV_ATTR_COMMON &&
+		    attr->attr == KVM_DEV_ATTR_TYPE)
+			return -EPERM;
+
+		accessor = dev->set_attr;
+		break;
+	case KVM_GET_DEVICE_ATTR:
+		if (attr->group == KVM_DEV_ATTR_COMMON &&
+		    attr->attr == KVM_DEV_ATTR_TYPE)
+			return dev->type;
+
+		accessor = dev->get_attr;
+		break;
+	case KVM_HAS_DEVICE_ATTR:
+		accessor = dev->has_attr;
+		break;
+	}
+
+	if (!accessor)
+		return -EPERM;
+
+	return accessor(kvm, dev, attr);
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -2292,6 +2385,40 @@  static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
+	case KVM_CREATE_DEVICE: {
+		struct kvm_create_device cd;
+
+		r = -EFAULT;
+		if (copy_from_user(&cd, argp, sizeof(cd)))
+			goto out;
+
+		r = kvm_ioctl_create_device(kvm, &cd);
+		if (r)
+			goto out;
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &cd, sizeof(cd)))
+			goto out;
+
+		r = 0;
+		break;
+	}
+	case KVM_SET_DEVICE_ATTR:
+	case KVM_GET_DEVICE_ATTR:
+	case KVM_HAS_DEVICE_ATTR: {
+		struct kvm_device_attr attr;
+
+		r = -EFAULT;
+		if (copy_from_user(&attr, argp, sizeof(attr)))
+			goto out;
+
+		r = kvm_ioctl_device_attr(kvm, ioctl, &attr);
+		if (r)
+			goto out;
+
+		r = 0;
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 		if (r == -ENOTTY)