[2/2] x86, apicv: Add Posted Interrupt supporting

Message ID	1355383780-1367-3-git-send-email-yang.z.zhang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> From: Yang Zhang <yang.z.zhang@intel.com> To: kvm@vger.kernel.org Cc: gleb@redhat.com, haitao.shan@intel.com, xiantao.zhang@intel.com, jun.nakajima@intel.com, h.peter.anvin@intel.com, Yang Zhang <yang.z.zhang@Intel.com> Subject: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting Date: Thu, 13 Dec 2012 15:29:40 +0800 Message-Id: <1355383780-1367-3-git-send-email-yang.z.zhang@intel.com> In-Reply-To: <1355383780-1367-1-git-send-email-yang.z.zhang@intel.com> References: <1355383780-1367-1-git-send-email-yang.z.zhang@intel.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Zhang, Yang Z Dec. 13, 2012, 7:29 a.m. UTC

From: Yang Zhang <yang.z.zhang@Intel.com>

Posted Interrupt allows APIC interrupts to inject into guest directly
without any vmexit.

- When delivering a interrupt to guest, if target vcpu is running,
  update Posted-interrupt requests bitmap and send a notification event
  to the vcpu. Then the vcpu will handle this interrupt automatically,
  without any software involvemnt.

- If target vcpu is not running or there already a notification event
  pending in the vcpu, do nothing. The interrupt will be handled by
  next vm entry.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
---
 arch/x86/include/asm/entry_arch.h  |    1 +
 arch/x86/include/asm/hw_irq.h      |    1 +
 arch/x86/include/asm/irq.h         |    1 +
 arch/x86/include/asm/irq_vectors.h |    4 +
 arch/x86/include/asm/kvm_host.h    |    3 +
 arch/x86/include/asm/vmx.h         |    4 +
 arch/x86/kernel/entry_64.S         |    2 +
 arch/x86/kernel/irq.c              |   25 +++++++
 arch/x86/kernel/irqinit.c          |    2 +
 arch/x86/kvm/lapic.c               |   16 +++-
 arch/x86/kvm/lapic.h               |    1 +
 arch/x86/kvm/vmx.c                 |  133 +++++++++++++++++++++++++++++++++---
 12 files changed, 180 insertions(+), 13 deletions(-)

Marcelo Tosatti Jan. 22, 2013, 10:59 p.m. UTC | #1

On Thu, Dec 13, 2012 at 03:29:40PM +0800, Yang Zhang wrote:
> From: Yang Zhang <yang.z.zhang@Intel.com>
> 
> Posted Interrupt allows APIC interrupts to inject into guest directly
> without any vmexit.
> 
> - When delivering a interrupt to guest, if target vcpu is running,
>   update Posted-interrupt requests bitmap and send a notification event
>   to the vcpu. Then the vcpu will handle this interrupt automatically,
>   without any software involvemnt.
> 
> - If target vcpu is not running or there already a notification event
>   pending in the vcpu, do nothing. The interrupt will be handled by
>   next vm entry.
> 
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> ---

<snip>

> +static void pi_handler(void)
> +{
> +	;
> +}
> +
> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
> +{
> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
> +}
> +
> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
> +		int vector)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	pi_set_pir(vector, vmx->pi);

Section 29.6 "Posted interrupt processing":

"No other agent can read or write a PIR bit (or groups of bits) between
the time it is read (to determine what to OR into VIRR) and when it is
cleared".

> +	if (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
> +		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), POSTED_INTR_VECTOR);
> +		return 1;
> +	}
> +	return 0;

What is the purpose of outstanding-notification bit? At first, its use as a
"lock" for PIR posted-interrupt bits is limited because its cleared
on step 3. before PIR is cleared. If it were cleared after step 5. then
software could

	if (!pi_test_and_set_on(vmx->pi)) {
		pi_set_pir(vector, vmx->pi);
		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), POSTED_INTR_VECTOR);
	}

Does this mean software has to read PIR _and_ outstanding notification
bit to know when its possible to set bits in PIR + send IPI?

Or is it really cleared after step 5?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhang, Yang Z Jan. 23, 2013, 5:09 a.m. UTC | #2

Marcelo Tosatti wrote on 2013-01-23:
> On Thu, Dec 13, 2012 at 03:29:40PM +0800, Yang Zhang wrote:
>> From: Yang Zhang <yang.z.zhang@Intel.com>
>> 
>> Posted Interrupt allows APIC interrupts to inject into guest directly
>> without any vmexit.
>> 
>> - When delivering a interrupt to guest, if target vcpu is running,
>>   update Posted-interrupt requests bitmap and send a notification event
>>   to the vcpu. Then the vcpu will handle this interrupt automatically,
>>   without any software involvemnt.
>> - If target vcpu is not running or there already a notification event
>>   pending in the vcpu, do nothing. The interrupt will be handled by
>>   next vm entry.
>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
>> ---
> 
> <snip>
> 
>> +static void pi_handler(void)
>> +{
>> +	;
>> +}
>> +
>> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
>> +{
>> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
>> +}
>> +
>> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
>> +		int vector)
>> +{
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +
>> +	pi_set_pir(vector, vmx->pi);
> 
> Section 29.6 "Posted interrupt processing":
> 
> "No other agent can read or write a PIR bit (or groups of bits) between
> the time it is read (to determine what to OR into VIRR) and when it is
> cleared".
This means hardware can ensure the read and clear operation is atomic, for example, use locked cmpxchg.

>> +	if (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
>> +		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), POSTED_INTR_VECTOR);
>> +		return 1; +	} +	return 0;
> 
> What is the purpose of outstanding-notification bit? At first, its use as a
> "lock" for PIR posted-interrupt bits is limited because its cleared
> on step 3. before PIR is cleared. If it were cleared after step 5. then
> software could
> 
> 	if (!pi_test_and_set_on(vmx->pi)) { 		pi_set_pir(vector, vmx->pi);
> 		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), POSTED_INTR_VECTOR); 	}
> 
> Does this mean software has to read PIR _and_ outstanding notification
> bit to know when its possible to set bits in PIR + send IPI?
There is no limitation for software to set bits in PIR. Software can set PIR unconditionally with locked operation.
Software must to read ON bit to check whether the IPI is needed. If ON bit is set, this means an notification event already sent but not acked by target cpu and no need to resend it again.

> 
> Or is it really cleared after step 5?
No, it is in step 3.


Best regards,
Yang

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Jan. 24, 2013, 11:43 p.m. UTC | #3

On Thu, Dec 13, 2012 at 03:29:40PM +0800, Yang Zhang wrote:
> From: Yang Zhang <yang.z.zhang@Intel.com>
> 
> Posted Interrupt allows APIC interrupts to inject into guest directly
> without any vmexit.
> 
> - When delivering a interrupt to guest, if target vcpu is running,
>   update Posted-interrupt requests bitmap and send a notification event
>   to the vcpu. Then the vcpu will handle this interrupt automatically,
>   without any software involvemnt.
> 
> - If target vcpu is not running or there already a notification event
>   pending in the vcpu, do nothing. The interrupt will be handled by
>   next vm entry.
> 
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> ---
>  arch/x86/include/asm/entry_arch.h  |    1 +
>  arch/x86/include/asm/hw_irq.h      |    1 +
>  arch/x86/include/asm/irq.h         |    1 +
>  arch/x86/include/asm/irq_vectors.h |    4 +
>  arch/x86/include/asm/kvm_host.h    |    3 +
>  arch/x86/include/asm/vmx.h         |    4 +
>  arch/x86/kernel/entry_64.S         |    2 +
>  arch/x86/kernel/irq.c              |   25 +++++++
>  arch/x86/kernel/irqinit.c          |    2 +
>  arch/x86/kvm/lapic.c               |   16 +++-
>  arch/x86/kvm/lapic.h               |    1 +
>  arch/x86/kvm/vmx.c                 |  133 +++++++++++++++++++++++++++++++++---
>  12 files changed, 180 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
> index 40afa00..7b0a29e 100644
> --- a/arch/x86/include/asm/entry_arch.h
> +++ b/arch/x86/include/asm/entry_arch.h
> @@ -18,6 +18,7 @@ BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)
>  #endif
>  
>  BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
> +BUILD_INTERRUPT(posted_intr_ipi, POSTED_INTR_VECTOR)
>  
>  /*
>   * every pentium local APIC has two 'local interrupts', with a
> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
> index eb92a6e..ee61af3 100644
> --- a/arch/x86/include/asm/hw_irq.h
> +++ b/arch/x86/include/asm/hw_irq.h
> @@ -28,6 +28,7 @@
>  /* Interrupt handlers registered during init_IRQ */
>  extern void apic_timer_interrupt(void);
>  extern void x86_platform_ipi(void);
> +extern void posted_intr_ipi(void);
>  extern void error_interrupt(void);
>  extern void irq_work_interrupt(void);
>  
> diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
> index ba870bb..cff9933 100644
> --- a/arch/x86/include/asm/irq.h
> +++ b/arch/x86/include/asm/irq.h
> @@ -30,6 +30,7 @@ extern void irq_force_complete_move(int);
>  #endif
>  
>  extern void (*x86_platform_ipi_callback)(void);
> +extern void (*posted_intr_callback)(void);
>  extern void native_init_IRQ(void);
>  extern bool handle_irq(unsigned irq, struct pt_regs *regs);
>  
> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
> index 1508e51..8f2e383 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -102,6 +102,10 @@
>   */
>  #define X86_PLATFORM_IPI_VECTOR		0xf7
>  
> +#ifdef CONFIG_HAVE_KVM
> +#define POSTED_INTR_VECTOR 		0xf2
> +#endif
> +
>  /*
>   * IRQ work vector:
>   */
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7e26d1a..82423a8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -700,6 +700,9 @@ struct kvm_x86_ops {
>  	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
>  	void (*update_irq)(struct kvm_vcpu *vcpu);
>  	void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector, bool set);
> +	int (*has_posted_interrupt)(struct kvm_vcpu *vcpu);
> +	int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
> +	void (*update_irr)(struct kvm_vcpu *vcpu);
>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>  	int (*get_tdp_level)(void);
>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 1003341..7b9e1d0 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -152,6 +152,7 @@
>  #define PIN_BASED_EXT_INTR_MASK                 0x00000001
>  #define PIN_BASED_NMI_EXITING                   0x00000008
>  #define PIN_BASED_VIRTUAL_NMIS                  0x00000020
> +#define PIN_BASED_POSTED_INTR                   0x00000080
>  
>  #define VM_EXIT_SAVE_DEBUG_CONTROLS             0x00000002
>  #define VM_EXIT_HOST_ADDR_SPACE_SIZE            0x00000200
> @@ -174,6 +175,7 @@
>  /* VMCS Encodings */
>  enum vmcs_field {
>  	VIRTUAL_PROCESSOR_ID            = 0x00000000,
> +	POSTED_INTR_NV                  = 0x00000002,
>  	GUEST_ES_SELECTOR               = 0x00000800,
>  	GUEST_CS_SELECTOR               = 0x00000802,
>  	GUEST_SS_SELECTOR               = 0x00000804,
> @@ -208,6 +210,8 @@ enum vmcs_field {
>  	VIRTUAL_APIC_PAGE_ADDR_HIGH     = 0x00002013,
>  	APIC_ACCESS_ADDR		= 0x00002014,
>  	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
> +	POSTED_INTR_DESC_ADDR           = 0x00002016,
> +	POSTED_INTR_DESC_ADDR_HIGH      = 0x00002017,
>  	EPT_POINTER                     = 0x0000201a,
>  	EPT_POINTER_HIGH                = 0x0000201b,
>  	EOI_EXIT_BITMAP0                = 0x0000201c,
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index b51b2c7..d06eea1 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1160,6 +1160,8 @@ apicinterrupt LOCAL_TIMER_VECTOR \
>  	apic_timer_interrupt smp_apic_timer_interrupt
>  apicinterrupt X86_PLATFORM_IPI_VECTOR \
>  	x86_platform_ipi smp_x86_platform_ipi
> +apicinterrupt POSTED_INTR_VECTOR \
> +	posted_intr_ipi smp_posted_intr_ipi
>  
>  apicinterrupt THRESHOLD_APIC_VECTOR \
>  	threshold_interrupt smp_threshold_interrupt
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> index e4595f1..781d324 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -22,6 +22,9 @@ atomic_t irq_err_count;
>  
>  /* Function pointer for generic interrupt vector handling */
>  void (*x86_platform_ipi_callback)(void) = NULL;
> +/* Function pointer for posted interrupt vector handling */
> +void (*posted_intr_callback)(void) = NULL;
> +EXPORT_SYMBOL_GPL(posted_intr_callback);
>  
>  /*
>   * 'what should we do if we get a hw irq event on an illegal vector'.
> @@ -228,6 +231,28 @@ void smp_x86_platform_ipi(struct pt_regs *regs)
>  	set_irq_regs(old_regs);
>  }
>  
> +/*
> + * Handler for POSTED_INTERRUPT_VECTOR.
> + */
> +void smp_posted_intr_ipi(struct pt_regs *regs)
> +{
> +	struct pt_regs *old_regs = set_irq_regs(regs);
> +
> +	ack_APIC_irq();
> +
> +	irq_enter();
> +
> +	exit_idle();
> +
> +	if (posted_intr_callback)
> +		posted_intr_callback();
> +
> +	irq_exit();
> +
> +	set_irq_regs(old_regs);
> +}
> +
> +
>  EXPORT_SYMBOL_GPL(vector_used_by_percpu_irq);
>  
>  #ifdef CONFIG_HOTPLUG_CPU
> diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
> index 6e03b0d..d15ca4f 100644
> --- a/arch/x86/kernel/irqinit.c
> +++ b/arch/x86/kernel/irqinit.c
> @@ -205,6 +205,8 @@ static void __init apic_intr_init(void)
>  
>  	/* IPI for X86 platform specific use */
>  	alloc_intr_gate(X86_PLATFORM_IPI_VECTOR, x86_platform_ipi);
> +	/* IPI for posted interrupt use */
> +	alloc_intr_gate(POSTED_INTR_VECTOR, posted_intr_ipi);
>  
>  	/* IPI vectors for APIC spurious and error interrupts */
>  	alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 2109a6a..d660b9d 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -350,6 +350,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic)
>  	if (!apic->irr_pending)
>  		return -1;
>  
> +	kvm_x86_ops->update_irr(apic->vcpu);
>  	result = apic_search_irr(apic);
>  	ASSERT(result == -1 || result >= 16);
>  
> @@ -725,18 +726,25 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
>  		if (trig_mode) {
>  			apic_debug("level trig mode for vector %d", vector);
>  			apic_set_vector(vector, apic->regs + APIC_TMR);
> -		} else
> +		} else {
>  			apic_clear_vector(vector, apic->regs + APIC_TMR);
> -
> +			if (kvm_x86_ops->has_posted_interrupt(vcpu)) {
> +				result = 1;
> +				apic->irr_pending = true;
> +				kvm_x86_ops->send_nv(vcpu, vector);
> +				goto out;
> +			}

Hi,

Steps 4, 5 and 6 of section 29.6 are executed in both VMX root/non-root
modes, or only non-root mode?

If only non-root mode, there is a problem if target vcpu<->pcpu vm-exits
before receiving and acking the interrupt. In that case PIR set bits are
not transferred to VIRR.

It would be necessary to read notification bit on VM-exit and, if set,
do PIR->VIRR transfer in software. The downside, is lack of an atomic
(VIRR |= PIR; PIR = 0) in software. So it would require synchronization
to KVM APIC injection (which ATM relies on atomic test_and_set of IRR).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhang, Yang Z Jan. 25, 2013, 12:40 a.m. UTC | #4

Marcelo Tosatti wrote on 2013-01-25:
> On Thu, Dec 13, 2012 at 03:29:40PM +0800, Yang Zhang wrote:
>> From: Yang Zhang <yang.z.zhang@Intel.com>
>> 
>> Posted Interrupt allows APIC interrupts to inject into guest directly
>> without any vmexit.
>> 
>> - When delivering a interrupt to guest, if target vcpu is running,
>>   update Posted-interrupt requests bitmap and send a notification event
>>   to the vcpu. Then the vcpu will handle this interrupt automatically,
>>   without any software involvemnt.
>> - If target vcpu is not running or there already a notification event
>>   pending in the vcpu, do nothing. The interrupt will be handled by
>>   next vm entry.
>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
>> ---
>>  arch/x86/include/asm/entry_arch.h  |    1 +
>>  arch/x86/include/asm/hw_irq.h      |    1 + arch/x86/include/asm/irq.h
>>          |    1 + arch/x86/include/asm/irq_vectors.h |    4 +
>>  arch/x86/include/asm/kvm_host.h    |    3 + arch/x86/include/asm/vmx.h
>>          |    4 + arch/x86/kernel/entry_64.S         |    2 +
>>  arch/x86/kernel/irq.c              |   25 +++++++
>>  arch/x86/kernel/irqinit.c          |    2 + arch/x86/kvm/lapic.c      
>>          |   16 +++- arch/x86/kvm/lapic.h               |    1 +
>>  arch/x86/kvm/vmx.c                 |  133
>>  +++++++++++++++++++++++++++++++++--- 12 files changed, 180
>>  insertions(+), 13 deletions(-)
>> diff --git a/arch/x86/include/asm/entry_arch.h
>> b/arch/x86/include/asm/entry_arch.h index 40afa00..7b0a29e 100644 ---
>> a/arch/x86/include/asm/entry_arch.h +++
>> b/arch/x86/include/asm/entry_arch.h @@ -18,6 +18,7 @@
>> BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)
>>  #endif
>>  
>>  BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
>> +BUILD_INTERRUPT(posted_intr_ipi, POSTED_INTR_VECTOR)
>> 
>>  /*
>>   * every pentium local APIC has two 'local interrupts', with a
>> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
>> index eb92a6e..ee61af3 100644
>> --- a/arch/x86/include/asm/hw_irq.h
>> +++ b/arch/x86/include/asm/hw_irq.h
>> @@ -28,6 +28,7 @@
>>  /* Interrupt handlers registered during init_IRQ */ extern void
>>  apic_timer_interrupt(void); extern void x86_platform_ipi(void);
>>  +extern void posted_intr_ipi(void); extern void error_interrupt(void);
>>  extern void irq_work_interrupt(void);
>> diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
>> index ba870bb..cff9933 100644
>> --- a/arch/x86/include/asm/irq.h
>> +++ b/arch/x86/include/asm/irq.h
>> @@ -30,6 +30,7 @@ extern void irq_force_complete_move(int);
>>  #endif
>>  
>>  extern void (*x86_platform_ipi_callback)(void); +extern void
>>  (*posted_intr_callback)(void); extern void native_init_IRQ(void);
>>  extern bool handle_irq(unsigned irq, struct pt_regs *regs);
>> diff --git a/arch/x86/include/asm/irq_vectors.h
>> b/arch/x86/include/asm/irq_vectors.h index 1508e51..8f2e383 100644 ---
>> a/arch/x86/include/asm/irq_vectors.h +++
>> b/arch/x86/include/asm/irq_vectors.h @@ -102,6 +102,10 @@
>>   */
>>  #define X86_PLATFORM_IPI_VECTOR		0xf7
>> +#ifdef CONFIG_HAVE_KVM
>> +#define POSTED_INTR_VECTOR 		0xf2
>> +#endif
>> +
>>  /*
>>   * IRQ work vector:
>>   */
>> diff --git a/arch/x86/include/asm/kvm_host.h
>> b/arch/x86/include/asm/kvm_host.h index 7e26d1a..82423a8 100644 ---
>> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -700,6 +700,9 @@ struct kvm_x86_ops {
>>  	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
>>  	void (*update_irq)(struct kvm_vcpu *vcpu);
>>  	void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector, bool set);
>> +	int (*has_posted_interrupt)(struct kvm_vcpu *vcpu);
>> +	int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
>> +	void (*update_irr)(struct kvm_vcpu *vcpu);
>>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>>  	int (*get_tdp_level)(void);
>>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>> index 1003341..7b9e1d0 100644
>> --- a/arch/x86/include/asm/vmx.h
>> +++ b/arch/x86/include/asm/vmx.h
>> @@ -152,6 +152,7 @@
>>  #define PIN_BASED_EXT_INTR_MASK                 0x00000001
>>  #define PIN_BASED_NMI_EXITING                   0x00000008
>>  #define PIN_BASED_VIRTUAL_NMIS                  0x00000020
>> +#define PIN_BASED_POSTED_INTR                   0x00000080
>> 
>>  #define VM_EXIT_SAVE_DEBUG_CONTROLS             0x00000002 #define
>>  VM_EXIT_HOST_ADDR_SPACE_SIZE            0x00000200 @@ -174,6 +175,7 @@
>>  /* VMCS Encodings */ enum vmcs_field { 	VIRTUAL_PROCESSOR_ID          
>>   = 0x00000000, +	POSTED_INTR_NV                  = 0x00000002,
>>  	GUEST_ES_SELECTOR               = 0x00000800, 	GUEST_CS_SELECTOR     
>>           = 0x00000802, 	GUEST_SS_SELECTOR               = 0x00000804,
>>  @@ -208,6 +210,8 @@ enum vmcs_field { 	VIRTUAL_APIC_PAGE_ADDR_HIGH    
>>  = 0x00002013, 	APIC_ACCESS_ADDR		= 0x00002014,
>>  	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
>> +	POSTED_INTR_DESC_ADDR           = 0x00002016,
>> +	POSTED_INTR_DESC_ADDR_HIGH      = 0x00002017,
>>  	EPT_POINTER                     = 0x0000201a,
>>  	EPT_POINTER_HIGH                = 0x0000201b,
>>  	EOI_EXIT_BITMAP0                = 0x0000201c,
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index b51b2c7..d06eea1 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -1160,6 +1160,8 @@ apicinterrupt LOCAL_TIMER_VECTOR \
>>  	apic_timer_interrupt smp_apic_timer_interrupt
>>  apicinterrupt X86_PLATFORM_IPI_VECTOR \
>>  	x86_platform_ipi smp_x86_platform_ipi
>> +apicinterrupt POSTED_INTR_VECTOR \
>> +	posted_intr_ipi smp_posted_intr_ipi
>> 
>>  apicinterrupt THRESHOLD_APIC_VECTOR \
>>  	threshold_interrupt smp_threshold_interrupt
>> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
>> index e4595f1..781d324 100644
>> --- a/arch/x86/kernel/irq.c
>> +++ b/arch/x86/kernel/irq.c
>> @@ -22,6 +22,9 @@ atomic_t irq_err_count;
>> 
>>  /* Function pointer for generic interrupt vector handling */
>>  void (*x86_platform_ipi_callback)(void) = NULL;
>> +/* Function pointer for posted interrupt vector handling */
>> +void (*posted_intr_callback)(void) = NULL;
>> +EXPORT_SYMBOL_GPL(posted_intr_callback);
>> 
>>  /*
>>   * 'what should we do if we get a hw irq event on an illegal vector'.
>> @@ -228,6 +231,28 @@ void smp_x86_platform_ipi(struct pt_regs *regs)
>>  	set_irq_regs(old_regs);
>>  }
>> +/*
>> + * Handler for POSTED_INTERRUPT_VECTOR.
>> + */
>> +void smp_posted_intr_ipi(struct pt_regs *regs)
>> +{
>> +	struct pt_regs *old_regs = set_irq_regs(regs);
>> +
>> +	ack_APIC_irq();
>> +
>> +	irq_enter();
>> +
>> +	exit_idle();
>> +
>> +	if (posted_intr_callback)
>> +		posted_intr_callback();
>> +
>> +	irq_exit();
>> +
>> +	set_irq_regs(old_regs);
>> +}
>> +
>> +
>>  EXPORT_SYMBOL_GPL(vector_used_by_percpu_irq);
>>  
>>  #ifdef CONFIG_HOTPLUG_CPU
>> diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
>> index 6e03b0d..d15ca4f 100644
>> --- a/arch/x86/kernel/irqinit.c
>> +++ b/arch/x86/kernel/irqinit.c
>> @@ -205,6 +205,8 @@ static void __init apic_intr_init(void)
>> 
>>  	/* IPI for X86 platform specific use */
>>  	alloc_intr_gate(X86_PLATFORM_IPI_VECTOR, x86_platform_ipi);
>> +	/* IPI for posted interrupt use */
>> +	alloc_intr_gate(POSTED_INTR_VECTOR, posted_intr_ipi);
>> 
>>  	/* IPI vectors for APIC spurious and error interrupts */
>>  	alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 2109a6a..d660b9d 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -350,6 +350,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic
> *apic)
>>  	if (!apic->irr_pending)
>>  		return -1;
>> +	kvm_x86_ops->update_irr(apic->vcpu);
>>  	result = apic_search_irr(apic);
>>  	ASSERT(result == -1 || result >= 16);
>> @@ -725,18 +726,25 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int
> delivery_mode,
>>  		if (trig_mode) {
>>  			apic_debug("level trig mode for vector %d", vector);
>>  			apic_set_vector(vector, apic->regs + APIC_TMR);
>> -		} else
>> +		} else {
>>  			apic_clear_vector(vector, apic->regs + APIC_TMR);
>> -
>> +			if (kvm_x86_ops->has_posted_interrupt(vcpu)) {
>> +				result = 1;
>> +				apic->irr_pending = true;
>> +				kvm_x86_ops->send_nv(vcpu, vector);
>> +				goto out;
>> +			}
> 
> Hi,
> 
> Steps 4, 5 and 6 of section 29.6 are executed in both VMX root/non-root
> modes, or only non-root mode?
SDM doesn't tell. But we don't need know this in software level.

>
>
> If only non-root mode, there is a problem if target vcpu<->pcpu vm-exits
> before receiving and acking the interrupt. In that case PIR set bits are
> not transferred to VIRR.
>
> It would be necessary to read notification bit on VM-exit and, if set,
> do PIR->VIRR transfer in software. The downside, is lack of an atomic
In current implementation, it will sync PIR to VIRR before vmentry.

> (VIRR |= PIR; PIR = 0) in software. So it would require synchronization
> to KVM APIC injection (which ATM relies on atomic test_and_set of IRR).


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Jan. 30, 2013, 11:03 p.m. UTC | #5

On Fri, Jan 25, 2013 at 12:40:21AM +0000, Zhang, Yang Z wrote:
> Marcelo Tosatti wrote on 2013-01-25:
> > On Thu, Dec 13, 2012 at 03:29:40PM +0800, Yang Zhang wrote:
> >> From: Yang Zhang <yang.z.zhang@Intel.com>
> >> 
> >> Posted Interrupt allows APIC interrupts to inject into guest directly
> >> without any vmexit.
> >> 
> >> - When delivering a interrupt to guest, if target vcpu is running,
> >>   update Posted-interrupt requests bitmap and send a notification event
> >>   to the vcpu. Then the vcpu will handle this interrupt automatically,
> >>   without any software involvemnt.
> >> - If target vcpu is not running or there already a notification event
> >>   pending in the vcpu, do nothing. The interrupt will be handled by
> >>   next vm entry.
> >> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> >> ---
> >>  arch/x86/include/asm/entry_arch.h  |    1 +
> >>  arch/x86/include/asm/hw_irq.h      |    1 + arch/x86/include/asm/irq.h
> >>          |    1 + arch/x86/include/asm/irq_vectors.h |    4 +
> >>  arch/x86/include/asm/kvm_host.h    |    3 + arch/x86/include/asm/vmx.h
> >>          |    4 + arch/x86/kernel/entry_64.S         |    2 +
> >>  arch/x86/kernel/irq.c              |   25 +++++++
> >>  arch/x86/kernel/irqinit.c          |    2 + arch/x86/kvm/lapic.c      
> >>          |   16 +++- arch/x86/kvm/lapic.h               |    1 +
> >>  arch/x86/kvm/vmx.c                 |  133
> >>  +++++++++++++++++++++++++++++++++--- 12 files changed, 180
> >>  insertions(+), 13 deletions(-)
> >> diff --git a/arch/x86/include/asm/entry_arch.h
> >> b/arch/x86/include/asm/entry_arch.h index 40afa00..7b0a29e 100644 ---
> >> a/arch/x86/include/asm/entry_arch.h +++
> >> b/arch/x86/include/asm/entry_arch.h @@ -18,6 +18,7 @@
> >> BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)
> >>  #endif
> >>  
> >>  BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
> >> +BUILD_INTERRUPT(posted_intr_ipi, POSTED_INTR_VECTOR)
> >> 
> >>  /*
> >>   * every pentium local APIC has two 'local interrupts', with a
> >> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
> >> index eb92a6e..ee61af3 100644
> >> --- a/arch/x86/include/asm/hw_irq.h
> >> +++ b/arch/x86/include/asm/hw_irq.h
> >> @@ -28,6 +28,7 @@
> >>  /* Interrupt handlers registered during init_IRQ */ extern void
> >>  apic_timer_interrupt(void); extern void x86_platform_ipi(void);
> >>  +extern void posted_intr_ipi(void); extern void error_interrupt(void);
> >>  extern void irq_work_interrupt(void);
> >> diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
> >> index ba870bb..cff9933 100644
> >> --- a/arch/x86/include/asm/irq.h
> >> +++ b/arch/x86/include/asm/irq.h
> >> @@ -30,6 +30,7 @@ extern void irq_force_complete_move(int);
> >>  #endif
> >>  
> >>  extern void (*x86_platform_ipi_callback)(void); +extern void
> >>  (*posted_intr_callback)(void); extern void native_init_IRQ(void);
> >>  extern bool handle_irq(unsigned irq, struct pt_regs *regs);
> >> diff --git a/arch/x86/include/asm/irq_vectors.h
> >> b/arch/x86/include/asm/irq_vectors.h index 1508e51..8f2e383 100644 ---
> >> a/arch/x86/include/asm/irq_vectors.h +++
> >> b/arch/x86/include/asm/irq_vectors.h @@ -102,6 +102,10 @@
> >>   */
> >>  #define X86_PLATFORM_IPI_VECTOR		0xf7
> >> +#ifdef CONFIG_HAVE_KVM
> >> +#define POSTED_INTR_VECTOR 		0xf2
> >> +#endif
> >> +
> >>  /*
> >>   * IRQ work vector:
> >>   */
> >> diff --git a/arch/x86/include/asm/kvm_host.h
> >> b/arch/x86/include/asm/kvm_host.h index 7e26d1a..82423a8 100644 ---
> >> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -700,6 +700,9 @@ struct kvm_x86_ops {
> >>  	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> >>  	void (*update_irq)(struct kvm_vcpu *vcpu);
> >>  	void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector, bool set);
> >> +	int (*has_posted_interrupt)(struct kvm_vcpu *vcpu);
> >> +	int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
> >> +	void (*update_irr)(struct kvm_vcpu *vcpu);
> >>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >>  	int (*get_tdp_level)(void);
> >>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> >> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> >> index 1003341..7b9e1d0 100644
> >> --- a/arch/x86/include/asm/vmx.h
> >> +++ b/arch/x86/include/asm/vmx.h
> >> @@ -152,6 +152,7 @@
> >>  #define PIN_BASED_EXT_INTR_MASK                 0x00000001
> >>  #define PIN_BASED_NMI_EXITING                   0x00000008
> >>  #define PIN_BASED_VIRTUAL_NMIS                  0x00000020
> >> +#define PIN_BASED_POSTED_INTR                   0x00000080
> >> 
> >>  #define VM_EXIT_SAVE_DEBUG_CONTROLS             0x00000002 #define
> >>  VM_EXIT_HOST_ADDR_SPACE_SIZE            0x00000200 @@ -174,6 +175,7 @@
> >>  /* VMCS Encodings */ enum vmcs_field { 	VIRTUAL_PROCESSOR_ID          
> >>   = 0x00000000, +	POSTED_INTR_NV                  = 0x00000002,
> >>  	GUEST_ES_SELECTOR               = 0x00000800, 	GUEST_CS_SELECTOR     
> >>           = 0x00000802, 	GUEST_SS_SELECTOR               = 0x00000804,
> >>  @@ -208,6 +210,8 @@ enum vmcs_field { 	VIRTUAL_APIC_PAGE_ADDR_HIGH    
> >>  = 0x00002013, 	APIC_ACCESS_ADDR		= 0x00002014,
> >>  	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
> >> +	POSTED_INTR_DESC_ADDR           = 0x00002016,
> >> +	POSTED_INTR_DESC_ADDR_HIGH      = 0x00002017,
> >>  	EPT_POINTER                     = 0x0000201a,
> >>  	EPT_POINTER_HIGH                = 0x0000201b,
> >>  	EOI_EXIT_BITMAP0                = 0x0000201c,
> >> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> >> index b51b2c7..d06eea1 100644
> >> --- a/arch/x86/kernel/entry_64.S
> >> +++ b/arch/x86/kernel/entry_64.S
> >> @@ -1160,6 +1160,8 @@ apicinterrupt LOCAL_TIMER_VECTOR \
> >>  	apic_timer_interrupt smp_apic_timer_interrupt
> >>  apicinterrupt X86_PLATFORM_IPI_VECTOR \
> >>  	x86_platform_ipi smp_x86_platform_ipi
> >> +apicinterrupt POSTED_INTR_VECTOR \
> >> +	posted_intr_ipi smp_posted_intr_ipi
> >> 
> >>  apicinterrupt THRESHOLD_APIC_VECTOR \
> >>  	threshold_interrupt smp_threshold_interrupt
> >> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> >> index e4595f1..781d324 100644
> >> --- a/arch/x86/kernel/irq.c
> >> +++ b/arch/x86/kernel/irq.c
> >> @@ -22,6 +22,9 @@ atomic_t irq_err_count;
> >> 
> >>  /* Function pointer for generic interrupt vector handling */
> >>  void (*x86_platform_ipi_callback)(void) = NULL;
> >> +/* Function pointer for posted interrupt vector handling */
> >> +void (*posted_intr_callback)(void) = NULL;
> >> +EXPORT_SYMBOL_GPL(posted_intr_callback);
> >> 
> >>  /*
> >>   * 'what should we do if we get a hw irq event on an illegal vector'.
> >> @@ -228,6 +231,28 @@ void smp_x86_platform_ipi(struct pt_regs *regs)
> >>  	set_irq_regs(old_regs);
> >>  }
> >> +/*
> >> + * Handler for POSTED_INTERRUPT_VECTOR.
> >> + */
> >> +void smp_posted_intr_ipi(struct pt_regs *regs)
> >> +{
> >> +	struct pt_regs *old_regs = set_irq_regs(regs);
> >> +
> >> +	ack_APIC_irq();
> >> +
> >> +	irq_enter();
> >> +
> >> +	exit_idle();
> >> +
> >> +	if (posted_intr_callback)
> >> +		posted_intr_callback();
> >> +
> >> +	irq_exit();
> >> +
> >> +	set_irq_regs(old_regs);
> >> +}
> >> +
> >> +
> >>  EXPORT_SYMBOL_GPL(vector_used_by_percpu_irq);
> >>  
> >>  #ifdef CONFIG_HOTPLUG_CPU
> >> diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
> >> index 6e03b0d..d15ca4f 100644
> >> --- a/arch/x86/kernel/irqinit.c
> >> +++ b/arch/x86/kernel/irqinit.c
> >> @@ -205,6 +205,8 @@ static void __init apic_intr_init(void)
> >> 
> >>  	/* IPI for X86 platform specific use */
> >>  	alloc_intr_gate(X86_PLATFORM_IPI_VECTOR, x86_platform_ipi);
> >> +	/* IPI for posted interrupt use */
> >> +	alloc_intr_gate(POSTED_INTR_VECTOR, posted_intr_ipi);
> >> 
> >>  	/* IPI vectors for APIC spurious and error interrupts */
> >>  	alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
> >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >> index 2109a6a..d660b9d 100644
> >> --- a/arch/x86/kvm/lapic.c
> >> +++ b/arch/x86/kvm/lapic.c
> >> @@ -350,6 +350,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic
> > *apic)
> >>  	if (!apic->irr_pending)
> >>  		return -1;
> >> +	kvm_x86_ops->update_irr(apic->vcpu);
> >>  	result = apic_search_irr(apic);
> >>  	ASSERT(result == -1 || result >= 16);
> >> @@ -725,18 +726,25 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int
> > delivery_mode,
> >>  		if (trig_mode) {
> >>  			apic_debug("level trig mode for vector %d", vector);
> >>  			apic_set_vector(vector, apic->regs + APIC_TMR);
> >> -		} else
> >> +		} else {
> >>  			apic_clear_vector(vector, apic->regs + APIC_TMR);
> >> -
> >> +			if (kvm_x86_ops->has_posted_interrupt(vcpu)) {
> >> +				result = 1;
> >> +				apic->irr_pending = true;
> >> +				kvm_x86_ops->send_nv(vcpu, vector);
> >> +				goto out;
> >> +			}
> > 
> > Hi,
> > 
> > Steps 4, 5 and 6 of section 29.6 are executed in both VMX root/non-root
> > modes, or only non-root mode?
> SDM doesn't tell. But we don't need know this in software level.
> 
> >
> >
> > If only non-root mode, there is a problem if target vcpu<->pcpu vm-exits
> > before receiving and acking the interrupt. In that case PIR set bits are
> > not transferred to VIRR.
> >
> > It would be necessary to read notification bit on VM-exit and, if set,
> > do PIR->VIRR transfer in software. The downside, is lack of an atomic
> In current implementation, it will sync PIR to VIRR before vmentry.
> 
> > (VIRR |= PIR; PIR = 0) in software. So it would require synchronization
> > to KVM APIC injection (which ATM relies on atomic test_and_set of IRR).

Some comments: 

Enable ack-on-exit feature patch:
1) Value of register VM_EXIT_INTR_INFO is available at
vmx->exit_intr_info. See commit
887864758580c80710947c38a4692032163777df.

Posted interrupt patch:
2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
vcpu_enter_guest function. Otherwise:

cpu0			vcpu1<->cpu1

			vcpu->mode = IN_GUEST_MODE

if IN_GUEST_MODE == true
	send IPI
			local_irq_disable

PIR not transferred to VIRR, misses interrupt.

3) Must check outstanding PIR notification bit unconditionally on
every VM-entry, because:

	1. local_irq_disable
	2. vcpu->mode = IN_GUEST_MODE
	3. vmenter
	4. vmexit
	5. vcpu->mode = OUTSIDE_GUEST_MODE

If PIR-IPI-interrupt is sent between an event which triggers VM-exit
(for example, an external interrupt due to a device), and step 5
(assignment of vcpu->mode), the PIR->VIRR transfer before vmentry must
be made.

4) Today, an interrupt notification is cached on IRR until its delivered - further
interrupt injection is not generating further interrupt notification
bits. With PIR, behaviour changes: Its possible to have one bit in PIR and another 
on IRR APIC page (if timing is right). Is this harmless? Why?



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Jan. 30, 2013, 11:57 p.m. UTC | #6

On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
> On Fri, Jan 25, 2013 at 12:40:21AM +0000, Zhang, Yang Z wrote:
> > Marcelo Tosatti wrote on 2013-01-25:
> > > On Thu, Dec 13, 2012 at 03:29:40PM +0800, Yang Zhang wrote:
> > >> From: Yang Zhang <yang.z.zhang@Intel.com>
> > >> 
> > >> Posted Interrupt allows APIC interrupts to inject into guest directly
> > >> without any vmexit.
> > >> 
> > >> - When delivering a interrupt to guest, if target vcpu is running,
> > >>   update Posted-interrupt requests bitmap and send a notification event
> > >>   to the vcpu. Then the vcpu will handle this interrupt automatically,
> > >>   without any software involvemnt.
> > >> - If target vcpu is not running or there already a notification event
> > >>   pending in the vcpu, do nothing. The interrupt will be handled by
> > >>   next vm entry.
> > >> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> > >> ---
> > >>  arch/x86/include/asm/entry_arch.h  |    1 +
> > >>  arch/x86/include/asm/hw_irq.h      |    1 + arch/x86/include/asm/irq.h
> > >>          |    1 + arch/x86/include/asm/irq_vectors.h |    4 +
> > >>  arch/x86/include/asm/kvm_host.h    |    3 + arch/x86/include/asm/vmx.h
> > >>          |    4 + arch/x86/kernel/entry_64.S         |    2 +
> > >>  arch/x86/kernel/irq.c              |   25 +++++++
> > >>  arch/x86/kernel/irqinit.c          |    2 + arch/x86/kvm/lapic.c      
> > >>          |   16 +++- arch/x86/kvm/lapic.h               |    1 +
> > >>  arch/x86/kvm/vmx.c                 |  133
> > >>  +++++++++++++++++++++++++++++++++--- 12 files changed, 180
> > >>  insertions(+), 13 deletions(-)
> > >> diff --git a/arch/x86/include/asm/entry_arch.h
> > >> b/arch/x86/include/asm/entry_arch.h index 40afa00..7b0a29e 100644 ---
> > >> a/arch/x86/include/asm/entry_arch.h +++
> > >> b/arch/x86/include/asm/entry_arch.h @@ -18,6 +18,7 @@
> > >> BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)
> > >>  #endif
> > >>  
> > >>  BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
> > >> +BUILD_INTERRUPT(posted_intr_ipi, POSTED_INTR_VECTOR)
> > >> 
> > >>  /*
> > >>   * every pentium local APIC has two 'local interrupts', with a
> > >> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
> > >> index eb92a6e..ee61af3 100644
> > >> --- a/arch/x86/include/asm/hw_irq.h
> > >> +++ b/arch/x86/include/asm/hw_irq.h
> > >> @@ -28,6 +28,7 @@
> > >>  /* Interrupt handlers registered during init_IRQ */ extern void
> > >>  apic_timer_interrupt(void); extern void x86_platform_ipi(void);
> > >>  +extern void posted_intr_ipi(void); extern void error_interrupt(void);
> > >>  extern void irq_work_interrupt(void);
> > >> diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
> > >> index ba870bb..cff9933 100644
> > >> --- a/arch/x86/include/asm/irq.h
> > >> +++ b/arch/x86/include/asm/irq.h
> > >> @@ -30,6 +30,7 @@ extern void irq_force_complete_move(int);
> > >>  #endif
> > >>  
> > >>  extern void (*x86_platform_ipi_callback)(void); +extern void
> > >>  (*posted_intr_callback)(void); extern void native_init_IRQ(void);
> > >>  extern bool handle_irq(unsigned irq, struct pt_regs *regs);
> > >> diff --git a/arch/x86/include/asm/irq_vectors.h
> > >> b/arch/x86/include/asm/irq_vectors.h index 1508e51..8f2e383 100644 ---
> > >> a/arch/x86/include/asm/irq_vectors.h +++
> > >> b/arch/x86/include/asm/irq_vectors.h @@ -102,6 +102,10 @@
> > >>   */
> > >>  #define X86_PLATFORM_IPI_VECTOR		0xf7
> > >> +#ifdef CONFIG_HAVE_KVM
> > >> +#define POSTED_INTR_VECTOR 		0xf2
> > >> +#endif
> > >> +
> > >>  /*
> > >>   * IRQ work vector:
> > >>   */
> > >> diff --git a/arch/x86/include/asm/kvm_host.h
> > >> b/arch/x86/include/asm/kvm_host.h index 7e26d1a..82423a8 100644 ---
> > >> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
> > >> @@ -700,6 +700,9 @@ struct kvm_x86_ops {
> > >>  	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> > >>  	void (*update_irq)(struct kvm_vcpu *vcpu);
> > >>  	void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector, bool set);
> > >> +	int (*has_posted_interrupt)(struct kvm_vcpu *vcpu);
> > >> +	int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
> > >> +	void (*update_irr)(struct kvm_vcpu *vcpu);
> > >>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> > >>  	int (*get_tdp_level)(void);
> > >>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> > >> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> > >> index 1003341..7b9e1d0 100644
> > >> --- a/arch/x86/include/asm/vmx.h
> > >> +++ b/arch/x86/include/asm/vmx.h
> > >> @@ -152,6 +152,7 @@
> > >>  #define PIN_BASED_EXT_INTR_MASK                 0x00000001
> > >>  #define PIN_BASED_NMI_EXITING                   0x00000008
> > >>  #define PIN_BASED_VIRTUAL_NMIS                  0x00000020
> > >> +#define PIN_BASED_POSTED_INTR                   0x00000080
> > >> 
> > >>  #define VM_EXIT_SAVE_DEBUG_CONTROLS             0x00000002 #define
> > >>  VM_EXIT_HOST_ADDR_SPACE_SIZE            0x00000200 @@ -174,6 +175,7 @@
> > >>  /* VMCS Encodings */ enum vmcs_field { 	VIRTUAL_PROCESSOR_ID          
> > >>   = 0x00000000, +	POSTED_INTR_NV                  = 0x00000002,
> > >>  	GUEST_ES_SELECTOR               = 0x00000800, 	GUEST_CS_SELECTOR     
> > >>           = 0x00000802, 	GUEST_SS_SELECTOR               = 0x00000804,
> > >>  @@ -208,6 +210,8 @@ enum vmcs_field { 	VIRTUAL_APIC_PAGE_ADDR_HIGH    
> > >>  = 0x00002013, 	APIC_ACCESS_ADDR		= 0x00002014,
> > >>  	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
> > >> +	POSTED_INTR_DESC_ADDR           = 0x00002016,
> > >> +	POSTED_INTR_DESC_ADDR_HIGH      = 0x00002017,
> > >>  	EPT_POINTER                     = 0x0000201a,
> > >>  	EPT_POINTER_HIGH                = 0x0000201b,
> > >>  	EOI_EXIT_BITMAP0                = 0x0000201c,
> > >> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> > >> index b51b2c7..d06eea1 100644
> > >> --- a/arch/x86/kernel/entry_64.S
> > >> +++ b/arch/x86/kernel/entry_64.S
> > >> @@ -1160,6 +1160,8 @@ apicinterrupt LOCAL_TIMER_VECTOR \
> > >>  	apic_timer_interrupt smp_apic_timer_interrupt
> > >>  apicinterrupt X86_PLATFORM_IPI_VECTOR \
> > >>  	x86_platform_ipi smp_x86_platform_ipi
> > >> +apicinterrupt POSTED_INTR_VECTOR \
> > >> +	posted_intr_ipi smp_posted_intr_ipi
> > >> 
> > >>  apicinterrupt THRESHOLD_APIC_VECTOR \
> > >>  	threshold_interrupt smp_threshold_interrupt
> > >> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> > >> index e4595f1..781d324 100644
> > >> --- a/arch/x86/kernel/irq.c
> > >> +++ b/arch/x86/kernel/irq.c
> > >> @@ -22,6 +22,9 @@ atomic_t irq_err_count;
> > >> 
> > >>  /* Function pointer for generic interrupt vector handling */
> > >>  void (*x86_platform_ipi_callback)(void) = NULL;
> > >> +/* Function pointer for posted interrupt vector handling */
> > >> +void (*posted_intr_callback)(void) = NULL;
> > >> +EXPORT_SYMBOL_GPL(posted_intr_callback);
> > >> 
> > >>  /*
> > >>   * 'what should we do if we get a hw irq event on an illegal vector'.
> > >> @@ -228,6 +231,28 @@ void smp_x86_platform_ipi(struct pt_regs *regs)
> > >>  	set_irq_regs(old_regs);
> > >>  }
> > >> +/*
> > >> + * Handler for POSTED_INTERRUPT_VECTOR.
> > >> + */
> > >> +void smp_posted_intr_ipi(struct pt_regs *regs)
> > >> +{
> > >> +	struct pt_regs *old_regs = set_irq_regs(regs);
> > >> +
> > >> +	ack_APIC_irq();
> > >> +
> > >> +	irq_enter();
> > >> +
> > >> +	exit_idle();
> > >> +
> > >> +	if (posted_intr_callback)
> > >> +		posted_intr_callback();
> > >> +
> > >> +	irq_exit();
> > >> +
> > >> +	set_irq_regs(old_regs);
> > >> +}
> > >> +
> > >> +
> > >>  EXPORT_SYMBOL_GPL(vector_used_by_percpu_irq);
> > >>  
> > >>  #ifdef CONFIG_HOTPLUG_CPU
> > >> diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
> > >> index 6e03b0d..d15ca4f 100644
> > >> --- a/arch/x86/kernel/irqinit.c
> > >> +++ b/arch/x86/kernel/irqinit.c
> > >> @@ -205,6 +205,8 @@ static void __init apic_intr_init(void)
> > >> 
> > >>  	/* IPI for X86 platform specific use */
> > >>  	alloc_intr_gate(X86_PLATFORM_IPI_VECTOR, x86_platform_ipi);
> > >> +	/* IPI for posted interrupt use */
> > >> +	alloc_intr_gate(POSTED_INTR_VECTOR, posted_intr_ipi);
> > >> 
> > >>  	/* IPI vectors for APIC spurious and error interrupts */
> > >>  	alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
> > >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > >> index 2109a6a..d660b9d 100644
> > >> --- a/arch/x86/kvm/lapic.c
> > >> +++ b/arch/x86/kvm/lapic.c
> > >> @@ -350,6 +350,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic
> > > *apic)
> > >>  	if (!apic->irr_pending)
> > >>  		return -1;
> > >> +	kvm_x86_ops->update_irr(apic->vcpu);
> > >>  	result = apic_search_irr(apic);
> > >>  	ASSERT(result == -1 || result >= 16);
> > >> @@ -725,18 +726,25 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int
> > > delivery_mode,
> > >>  		if (trig_mode) {
> > >>  			apic_debug("level trig mode for vector %d", vector);
> > >>  			apic_set_vector(vector, apic->regs + APIC_TMR);
> > >> -		} else
> > >> +		} else {
> > >>  			apic_clear_vector(vector, apic->regs + APIC_TMR);
> > >> -
> > >> +			if (kvm_x86_ops->has_posted_interrupt(vcpu)) {
> > >> +				result = 1;
> > >> +				apic->irr_pending = true;
> > >> +				kvm_x86_ops->send_nv(vcpu, vector);
> > >> +				goto out;
> > >> +			}
> > > 
> > > Hi,
> > > 
> > > Steps 4, 5 and 6 of section 29.6 are executed in both VMX root/non-root
> > > modes, or only non-root mode?
> > SDM doesn't tell. But we don't need know this in software level.
> > 
> > >
> > >
> > > If only non-root mode, there is a problem if target vcpu<->pcpu vm-exits
> > > before receiving and acking the interrupt. In that case PIR set bits are
> > > not transferred to VIRR.
> > >
> > > It would be necessary to read notification bit on VM-exit and, if set,
> > > do PIR->VIRR transfer in software. The downside, is lack of an atomic
> > In current implementation, it will sync PIR to VIRR before vmentry.
> > 
> > > (VIRR |= PIR; PIR = 0) in software. So it would require synchronization
> > > to KVM APIC injection (which ATM relies on atomic test_and_set of IRR).
> 
> Some comments: 
> 
> Enable ack-on-exit feature patch:
> 1) Value of register VM_EXIT_INTR_INFO is available at
> vmx->exit_intr_info. See commit
> 887864758580c80710947c38a4692032163777df.
> 
> Posted interrupt patch:
> 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
> vcpu_enter_guest function. Otherwise:
> 
> cpu0			vcpu1<->cpu1
> 
> 			vcpu->mode = IN_GUEST_MODE
> 
> if IN_GUEST_MODE == true
> 	send IPI
> 			local_irq_disable
> 
> PIR not transferred to VIRR, misses interrupt.
> 
> 3) Must check outstanding PIR notification bit unconditionally on
> every VM-entry, because:
> 
> 	1. local_irq_disable
> 	2. vcpu->mode = IN_GUEST_MODE
> 	3. vmenter
> 	4. vmexit
> 	5. vcpu->mode = OUTSIDE_GUEST_MODE
> 
> If PIR-IPI-interrupt is sent between an event which triggers VM-exit
> (for example, an external interrupt due to a device), and step 5
> (assignment of vcpu->mode), the PIR->VIRR transfer before vmentry must
> be made.

Check for outstanding PIR notification bit must be made after disabling
local interrupts, in vcpu_enter_guest. If PIR notification bit is
set, should set KVM_REQ_EVENT in vcpu->requests, so that ->run is not
executed, and vcpu_enter_guest reexecuted.

Then the PIR->VIRR SW transfer can be made in KVM_REQ_EVENT processing
(in a separate function please, not from get_interrupt).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Jan. 31, 2013, 7:35 a.m. UTC | #7

On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
> Some comments: 
> 
> Enable ack-on-exit feature patch:
> 1) Value of register VM_EXIT_INTR_INFO is available at
> vmx->exit_intr_info. See commit
> 887864758580c80710947c38a4692032163777df.
> 
It is available only for EXIT_REASON_MCE_DURING_VMENTRY and
EXIT_REASON_EXCEPTION_NMI exit reasons. To use vmx->exit_intr_info
it needs to be updated on EXIT_REASON_EXTERNAL_INTERRUPT
too and vmx_handle_external_intr() will have to check that
vmx->exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT before using it.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Jan. 31, 2013, 9:37 a.m. UTC | #8

On Thu, Dec 13, 2012 at 03:29:40PM +0800, Yang Zhang wrote:
> From: Yang Zhang <yang.z.zhang@Intel.com>
> 
> Posted Interrupt allows APIC interrupts to inject into guest directly
> without any vmexit.
> 
> - When delivering a interrupt to guest, if target vcpu is running,
>   update Posted-interrupt requests bitmap and send a notification event
>   to the vcpu. Then the vcpu will handle this interrupt automatically,
>   without any software involvemnt.
> 
> - If target vcpu is not running or there already a notification event
>   pending in the vcpu, do nothing. The interrupt will be handled by
>   next vm entry.
> 
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> ---
>  arch/x86/include/asm/entry_arch.h  |    1 +
>  arch/x86/include/asm/hw_irq.h      |    1 +
>  arch/x86/include/asm/irq.h         |    1 +
>  arch/x86/include/asm/irq_vectors.h |    4 +
>  arch/x86/include/asm/kvm_host.h    |    3 +
>  arch/x86/include/asm/vmx.h         |    4 +
>  arch/x86/kernel/entry_64.S         |    2 +
>  arch/x86/kernel/irq.c              |   25 +++++++
>  arch/x86/kernel/irqinit.c          |    2 +
>  arch/x86/kvm/lapic.c               |   16 +++-
>  arch/x86/kvm/lapic.h               |    1 +
>  arch/x86/kvm/vmx.c                 |  133 +++++++++++++++++++++++++++++++++---
>  12 files changed, 180 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
> index 40afa00..7b0a29e 100644
> --- a/arch/x86/include/asm/entry_arch.h
> +++ b/arch/x86/include/asm/entry_arch.h
> @@ -18,6 +18,7 @@ BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)
>  #endif
>  
>  BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
> +BUILD_INTERRUPT(posted_intr_ipi, POSTED_INTR_VECTOR)
>  
>  /*
>   * every pentium local APIC has two 'local interrupts', with a
> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
> index eb92a6e..ee61af3 100644
> --- a/arch/x86/include/asm/hw_irq.h
> +++ b/arch/x86/include/asm/hw_irq.h
> @@ -28,6 +28,7 @@
>  /* Interrupt handlers registered during init_IRQ */
>  extern void apic_timer_interrupt(void);
>  extern void x86_platform_ipi(void);
> +extern void posted_intr_ipi(void);
>  extern void error_interrupt(void);
>  extern void irq_work_interrupt(void);
>  
> diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
> index ba870bb..cff9933 100644
> --- a/arch/x86/include/asm/irq.h
> +++ b/arch/x86/include/asm/irq.h
> @@ -30,6 +30,7 @@ extern void irq_force_complete_move(int);
>  #endif
>  
>  extern void (*x86_platform_ipi_callback)(void);
> +extern void (*posted_intr_callback)(void);
>  extern void native_init_IRQ(void);
>  extern bool handle_irq(unsigned irq, struct pt_regs *regs);
>  
> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
> index 1508e51..8f2e383 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -102,6 +102,10 @@
>   */
>  #define X86_PLATFORM_IPI_VECTOR		0xf7
>  
> +#ifdef CONFIG_HAVE_KVM
Users of POSTED_INTR_VECTOR are not under ifdef, which means compilation
will fails with kvm disabled. Test it please.

> +#define POSTED_INTR_VECTOR 		0xf2
> +#endif
> +
>  /*
>   * IRQ work vector:
>   */
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7e26d1a..82423a8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -700,6 +700,9 @@ struct kvm_x86_ops {
>  	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
>  	void (*update_irq)(struct kvm_vcpu *vcpu);
>  	void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector, bool set);
> +	int (*has_posted_interrupt)(struct kvm_vcpu *vcpu);
> +	int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
> +	void (*update_irr)(struct kvm_vcpu *vcpu);
>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>  	int (*get_tdp_level)(void);
>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 1003341..7b9e1d0 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -152,6 +152,7 @@
>  #define PIN_BASED_EXT_INTR_MASK                 0x00000001
>  #define PIN_BASED_NMI_EXITING                   0x00000008
>  #define PIN_BASED_VIRTUAL_NMIS                  0x00000020
> +#define PIN_BASED_POSTED_INTR                   0x00000080
>  
>  #define VM_EXIT_SAVE_DEBUG_CONTROLS             0x00000002
>  #define VM_EXIT_HOST_ADDR_SPACE_SIZE            0x00000200
> @@ -174,6 +175,7 @@
>  /* VMCS Encodings */
>  enum vmcs_field {
>  	VIRTUAL_PROCESSOR_ID            = 0x00000000,
> +	POSTED_INTR_NV                  = 0x00000002,
>  	GUEST_ES_SELECTOR               = 0x00000800,
>  	GUEST_CS_SELECTOR               = 0x00000802,
>  	GUEST_SS_SELECTOR               = 0x00000804,
> @@ -208,6 +210,8 @@ enum vmcs_field {
>  	VIRTUAL_APIC_PAGE_ADDR_HIGH     = 0x00002013,
>  	APIC_ACCESS_ADDR		= 0x00002014,
>  	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
> +	POSTED_INTR_DESC_ADDR           = 0x00002016,
> +	POSTED_INTR_DESC_ADDR_HIGH      = 0x00002017,
>  	EPT_POINTER                     = 0x0000201a,
>  	EPT_POINTER_HIGH                = 0x0000201b,
>  	EOI_EXIT_BITMAP0                = 0x0000201c,
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index b51b2c7..d06eea1 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1160,6 +1160,8 @@ apicinterrupt LOCAL_TIMER_VECTOR \
>  	apic_timer_interrupt smp_apic_timer_interrupt
>  apicinterrupt X86_PLATFORM_IPI_VECTOR \
>  	x86_platform_ipi smp_x86_platform_ipi
> +apicinterrupt POSTED_INTR_VECTOR \
> +	posted_intr_ipi smp_posted_intr_ipi
>  
>  apicinterrupt THRESHOLD_APIC_VECTOR \
>  	threshold_interrupt smp_threshold_interrupt
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> index e4595f1..781d324 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -22,6 +22,9 @@ atomic_t irq_err_count;
>  
>  /* Function pointer for generic interrupt vector handling */
>  void (*x86_platform_ipi_callback)(void) = NULL;
> +/* Function pointer for posted interrupt vector handling */
> +void (*posted_intr_callback)(void) = NULL;
> +EXPORT_SYMBOL_GPL(posted_intr_callback);
>  
>  /*
>   * 'what should we do if we get a hw irq event on an illegal vector'.
> @@ -228,6 +231,28 @@ void smp_x86_platform_ipi(struct pt_regs *regs)
>  	set_irq_regs(old_regs);
>  }
>  
> +/*
> + * Handler for POSTED_INTERRUPT_VECTOR.
> + */
> +void smp_posted_intr_ipi(struct pt_regs *regs)
> +{
> +	struct pt_regs *old_regs = set_irq_regs(regs);
> +
> +	ack_APIC_irq();
> +
> +	irq_enter();
> +
> +	exit_idle();
> +
> +	if (posted_intr_callback)
> +		posted_intr_callback();
> +
The callback does nothing. Drop it.

> +	irq_exit();
> +
> +	set_irq_regs(old_regs);
> +}
> +
> +
>  EXPORT_SYMBOL_GPL(vector_used_by_percpu_irq);
>  
>  #ifdef CONFIG_HOTPLUG_CPU
> diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
> index 6e03b0d..d15ca4f 100644
> --- a/arch/x86/kernel/irqinit.c
> +++ b/arch/x86/kernel/irqinit.c
> @@ -205,6 +205,8 @@ static void __init apic_intr_init(void)
>  
>  	/* IPI for X86 platform specific use */
>  	alloc_intr_gate(X86_PLATFORM_IPI_VECTOR, x86_platform_ipi);
> +	/* IPI for posted interrupt use */
> +	alloc_intr_gate(POSTED_INTR_VECTOR, posted_intr_ipi);
>  
>  	/* IPI vectors for APIC spurious and error interrupts */
>  	alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 2109a6a..d660b9d 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -350,6 +350,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic)
>  	if (!apic->irr_pending)
>  		return -1;
>  
> +	kvm_x86_ops->update_irr(apic->vcpu);
>  	result = apic_search_irr(apic);
>  	ASSERT(result == -1 || result >= 16);
>  
> @@ -725,18 +726,25 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
>  		if (trig_mode) {
>  			apic_debug("level trig mode for vector %d", vector);
>  			apic_set_vector(vector, apic->regs + APIC_TMR);
> -		} else
> +		} else {
>  			apic_clear_vector(vector, apic->regs + APIC_TMR);
> -
Why doing pi only for edge triggered interrupts?

> +			if (kvm_x86_ops->has_posted_interrupt(vcpu)) {
Drop has_posted_interrupt(). Just call send_nv() directly. And give it
more descriptive name.


> +				result = 1;
> +				apic->irr_pending = true;
This is always true with vid anyway.

> +				kvm_x86_ops->send_nv(vcpu, vector);
> +				goto out;
> +			}
> +		}
>  		result = !apic_test_and_set_irr(vector, apic);
> -		trace_kvm_apic_accept_irq(vcpu->vcpu_id, delivery_mode,
> -					  trig_mode, vector, !result);
>  		if (!result) {
>  			if (trig_mode)
>  				apic_debug("level trig mode repeatedly for "
>  						"vector %d", vector);
>  			break;
>  		}
> +out:
> +		trace_kvm_apic_accept_irq(vcpu->vcpu_id, delivery_mode,
> +					  trig_mode, vector, !result);
No trace if !result now.

>  
>  		kvm_make_request(KVM_REQ_EVENT, vcpu);
>  		kvm_vcpu_kick(vcpu);
What is the point of sending notification vector to vcpu is you kick it
out of a guest mode immediately after?

> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 10e3f66..0f8361e 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -42,6 +42,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_cpu_has_extint(struct kvm_vcpu *v);
>  int kvm_cpu_get_extint(struct kvm_vcpu *v);
>  int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu);
> +void kvm_apic_update_irr(struct kvm_vcpu *vcpu, unsigned int *pir);
>  void kvm_lapic_reset(struct kvm_vcpu *vcpu);
>  u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
>  void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 6b6bd03..07dbde6 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -31,6 +31,7 @@
>  #include <linux/ftrace_event.h>
>  #include <linux/slab.h>
>  #include <linux/tboot.h>
> +#include <linux/interrupt.h>
>  #include "kvm_cache_regs.h"
>  #include "x86.h"
>  
> @@ -86,6 +87,8 @@ module_param(fasteoi, bool, S_IRUGO);
>  static bool __read_mostly enable_apicv_reg_vid = 1;
>  module_param(enable_apicv_reg_vid, bool, S_IRUGO);
>  
> +static bool __read_mostly enable_apicv_pi = 1;
> +module_param(enable_apicv_pi, bool, S_IRUGO);
>  /*
>   * If nested=1, nested virtualization is supported, i.e., guests may use
>   * VMX and be a hypervisor for its own guests. If nested=0, guests may not
> @@ -369,6 +372,35 @@ struct nested_vmx {
>  	struct page *apic_access_page;
>  };
>  
> +#define POSTED_INTR_ON  0
> +/* Posted-Interrupt Descriptor */
> +struct pi_desc {
> +	u32 pir[8];     /* Posted interrupt requested */
> +	union {
> +		struct {
> +			u8  on:1,
> +			    rsvd:7;
> +		} control;
> +		u32 rsvd[8];
> +	} u;
> +} __aligned(64);
> +
> +static void pi_clear_on(struct pi_desc *pi_desc)
> +{
> +	clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
> +}
> +
> +static u8 pi_test_and_set_on(struct pi_desc *pi_desc)
> +{
> +	return test_and_set_bit(POSTED_INTR_ON,
> +			(unsigned long *)&pi_desc->u.control);
> +}
> +
> +static void pi_set_pir(int vector, struct pi_desc *pi_desc)
> +{
> +	set_bit(vector, (unsigned long *)pi_desc->pir);
> +}
> +
>  struct vcpu_vmx {
>  	struct kvm_vcpu       vcpu;
>  	unsigned long         host_rsp;
> @@ -435,6 +467,9 @@ struct vcpu_vmx {
>  	u8 eoi_exitmap_changed;
>  	u32 eoi_exit_bitmap[8];
>  
> +	/* Posted interrupt descriptor */
> +	struct pi_desc *pi;
> +
I am not convinced we should try to save 46 bytes here when !pi.

>  	/* Support for a guest hypervisor (nested VMX) */
>  	struct nested_vmx nested;
>  };
> @@ -779,6 +814,11 @@ static inline bool cpu_has_vmx_virtual_intr_delivery(void)
>  		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>  }
>  
> +static inline bool cpu_has_vmx_posted_intr(void)
> +{
> +	return vmcs_config.pin_based_exec_ctrl & PIN_BASED_POSTED_INTR;
> +}
> +
>  static inline bool cpu_has_vmx_flexpriority(void)
>  {
>  	return cpu_has_vmx_tpr_shadow() &&
> @@ -2475,12 +2515,6 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  	u32 _vmexit_control = 0;
>  	u32 _vmentry_control = 0;
>  
> -	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
> -	opt = PIN_BASED_VIRTUAL_NMIS;
> -	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> -				&_pin_based_exec_control) < 0)
> -		return -EIO;
> -
>  	min = CPU_BASED_HLT_EXITING |
>  #ifdef CONFIG_X86_64
>  	      CPU_BASED_CR8_LOAD_EXITING |
> @@ -2554,6 +2588,17 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  				&_vmexit_control) < 0)
>  		return -EIO;
>  
> +	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
> +	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR;
> +	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> +				&_pin_based_exec_control) < 0)
> +		return -EIO;
> +
> +	if (!(_cpu_based_2nd_exec_control &
> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) ||
> +		!(_vmexit_control & VM_EXIT_ACK_INTR_ON_EXIT))
> +		_pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;
> +
>  	min = 0;
>  	opt = VM_ENTRY_LOAD_IA32_PAT;
>  	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
> @@ -2739,6 +2784,9 @@ static __init int hardware_setup(void)
>  	if (enable_apicv_reg_vid)
>  		kvm_x86_ops->update_cr8_intercept = NULL;
>  
> +	if (!cpu_has_vmx_posted_intr() || !enable_apicv_reg_vid)
> +		enable_apicv_pi = 0;
> +
>  	if (nested)
>  		nested_vmx_setup_ctls_msrs();
>  
> @@ -3904,6 +3952,57 @@ static void ept_set_mmio_spte_mask(void)
>  	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull);
>  }
>  
> +static void pi_handler(void)
> +{
> +	;
> +}
> +
> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
> +{
> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
> +}
> +
> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
> +		int vector)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	pi_set_pir(vector, vmx->pi);
> +	if (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
> +		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), POSTED_INTR_VECTOR);
> +		return 1;
> +	}
> +	return 0;
> +}
Return value is not checked by the caller.

> +
> +static void vmx_update_irr(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +        struct kvm_lapic *apic = vcpu->arch.apic;
> +        unsigned int i, old, new, val, irr_off;
Spaces instead of tabs. Not only here. Run through checkpatch.

> +
> +	if (!enable_apicv_pi)
> +		return;
> +
Why no check for pit->on before syncing?

> +        for (i = 0; i <= 7; i++) {
> +		if (vmx->pi->pir[i]) {
> +			irr_off = APIC_IRR + i * 0x10;
> +			do {
> +				old = kvm_apic_get_reg(apic, irr_off);
> +				new = old | vmx->pi->pir[i];
> +				val = cmpxchg((u32 *)(apic->regs + irr_off), old, new);
> +			} while (unlikely (val != old));
> +			vmx->pi->pir[i] = 0;
> +		}
> +        }
> +}
> +
> +static void free_pi(struct vcpu_vmx *vmx)
> +{
> +	if (enable_apicv_pi)
> +		kfree(vmx->pi);
> +}
> +
>  /*
>   * Sets up the vmcs for emulated real mode.
>   */
> @@ -3913,6 +4012,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  	unsigned long a;
>  #endif
>  	int i;
> +	u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl;
>  	u32 vmexit_ctrl = vmcs_config.vmexit_ctrl;
>  
>  	/* I/O */
> @@ -3925,8 +4025,10 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  	vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
>  
>  	/* Control */
> -	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> -		vmcs_config.pin_based_exec_ctrl);
> +	if (!enable_apicv_pi)
> +		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR;
> +
> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
>  
>  	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, vmx_exec_control(vmx));
>  
> @@ -3944,6 +4046,13 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  		vmcs_write16(GUEST_INTR_STATUS, 0);
>  	}
>  
> +	if (enable_apicv_pi) {
> +		vmx->pi = kmalloc(sizeof(struct pi_desc),
> +				GFP_KERNEL | __GFP_ZERO);
kzalloc().

> +		vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
> +		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi)));
> +	}
> +
>  	if (ple_gap) {
>  		vmcs_write32(PLE_GAP, ple_gap);
>  		vmcs_write32(PLE_WINDOW, ple_window);
> @@ -6220,6 +6329,8 @@ static void vmx_update_irq(struct kvm_vcpu *vcpu)
>  					vmx->eoi_exit_bitmap[index]);
>  		vmx->eoi_exitmap_changed = 0;
>  	}
> +	if (enable_apicv_pi)
> +		pi_clear_on(vmx->pi);
>  }
>  
>  static void vmx_update_eoi_exitmap(struct kvm_vcpu *vcpu,
> @@ -6626,6 +6737,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
>  
>  	free_vpid(vmx);
>  	free_nested(vmx);
> +	free_pi(vmx);
>  	free_loaded_vmcs(vmx->loaded_vmcs);
>  	kfree(vmx->guest_msrs);
>  	kvm_vcpu_uninit(vcpu);
> @@ -7520,8 +7632,11 @@ static struct kvm_x86_ops vmx_x86_ops = {
>  	.enable_irq_window = enable_irq_window,
>  	.update_cr8_intercept = update_cr8_intercept,
>  	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
> +	.has_posted_interrupt = vmx_has_posted_interrupt,
>  	.update_irq = vmx_update_irq,
>  	.update_eoi_exitmap = vmx_update_eoi_exitmap,
> +	.send_nv = vmx_send_nv,
> +	.update_irr = vmx_update_irr,
>  
SVM?

>  	.set_tss_addr = vmx_set_tss_addr,
>  	.get_tdp_level = get_ept_level,
> @@ -7618,7 +7733,7 @@ static int __init vmx_init(void)
>  		/* SELF-IPI */
>  		vmx_disable_intercept_for_msr_write(0x83f, false);
>  	}
> -
> +	posted_intr_callback = pi_handler;
>  	if (enable_ept) {
>  		kvm_mmu_set_mask_ptes(0ull,
>  			(enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
> -- 
> 1.7.1

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Jan. 31, 2013, 9:43 a.m. UTC | #9

On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
> Posted interrupt patch:
> 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
> vcpu_enter_guest function. Otherwise:
> 
> cpu0			vcpu1<->cpu1
> 
> 			vcpu->mode = IN_GUEST_MODE
> 
> if IN_GUEST_MODE == true
> 	send IPI
> 			local_irq_disable
> 
> PIR not transferred to VIRR, misses interrupt.
> 
cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after
local_irq_disable() by ->requests check.

> 3) Must check outstanding PIR notification bit unconditionally on
> every VM-entry, because:
> 
> 	1. local_irq_disable
> 	2. vcpu->mode = IN_GUEST_MODE
> 	3. vmenter
> 	4. vmexit
> 	5. vcpu->mode = OUTSIDE_GUEST_MODE
> 
> If PIR-IPI-interrupt is sent between an event which triggers VM-exit
> (for example, an external interrupt due to a device), and step 5
> (assignment of vcpu->mode), the PIR->VIRR transfer before vmentry must
> be made.
Not sure I understand, but I think KVM_REQ_EVENT will cover that too.

> 
> 4) Today, an interrupt notification is cached on IRR until its delivered - further
> interrupt injection is not generating further interrupt notification
> bits. With PIR, behaviour changes: Its possible to have one bit in PIR and another 
> on IRR APIC page (if timing is right). Is this harmless? Why?
> 
> 

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Jan. 31, 2013, 1:32 p.m. UTC | #10

On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote:
> On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
> > Posted interrupt patch:
> > 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
> > vcpu_enter_guest function. Otherwise:
> > 
> > cpu0			vcpu1<->cpu1
> > 
> > 			vcpu->mode = IN_GUEST_MODE
> > 
> > if IN_GUEST_MODE == true
> > 	send IPI
> > 			local_irq_disable
> > 
> > PIR not transferred to VIRR, misses interrupt.
> > 
> cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after
> local_irq_disable() by ->requests check.

Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose 
of posted interrupts. You want

if vcpu in guest mode
	send posted interrupt IPI
else
	KVM_REQ_EVENT+kick

> > 3) Must check outstanding PIR notification bit unconditionally on
> > every VM-entry, because:
> > 
> > 	1. local_irq_disable
> > 	2. vcpu->mode = IN_GUEST_MODE
> > 	3. vmenter
> > 	4. vmexit
> > 	5. vcpu->mode = OUTSIDE_GUEST_MODE
> > 
> > If PIR-IPI-interrupt is sent between an event which triggers VM-exit
> > (for example, an external interrupt due to a device), and step 5
> > (assignment of vcpu->mode), the PIR->VIRR transfer before vmentry must
> > be made.
> Not sure I understand, but I think KVM_REQ_EVENT will cover that too.

See above.

> > 
> > 4) Today, an interrupt notification is cached on IRR until its delivered - further
> > interrupt injection is not generating further interrupt notification
> > bits. With PIR, behaviour changes: Its possible to have one bit in PIR and another 
> > on IRR APIC page (if timing is right). Is this harmless? Why?
> > 
> > 
> 
> --
> 			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Jan. 31, 2013, 1:38 p.m. UTC | #11

On Thu, Jan 31, 2013 at 11:32:45AM -0200, Marcelo Tosatti wrote:
> On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote:
> > On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
> > > Posted interrupt patch:
> > > 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
> > > vcpu_enter_guest function. Otherwise:
> > > 
> > > cpu0			vcpu1<->cpu1
> > > 
> > > 			vcpu->mode = IN_GUEST_MODE
> > > 
> > > if IN_GUEST_MODE == true
> > > 	send IPI
> > > 			local_irq_disable
> > > 
> > > PIR not transferred to VIRR, misses interrupt.
> > > 
> > cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after
> > local_irq_disable() by ->requests check.
> 
> Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose 
> of posted interrupts. You want
> 
> if vcpu in guest mode
> 	send posted interrupt IPI
> else
> 	KVM_REQ_EVENT+kick
> 
I am thinking:

 set KVM_REQ_EVENT
 if pi is enabled
     send posted interrupt IPI
 else
     kick

> > > 3) Must check outstanding PIR notification bit unconditionally on
> > > every VM-entry, because:
> > > 
> > > 	1. local_irq_disable
> > > 	2. vcpu->mode = IN_GUEST_MODE
> > > 	3. vmenter
> > > 	4. vmexit
> > > 	5. vcpu->mode = OUTSIDE_GUEST_MODE
> > > 
> > > If PIR-IPI-interrupt is sent between an event which triggers VM-exit
> > > (for example, an external interrupt due to a device), and step 5
> > > (assignment of vcpu->mode), the PIR->VIRR transfer before vmentry must
> > > be made.
> > Not sure I understand, but I think KVM_REQ_EVENT will cover that too.
> 
> See above.
> 
> > > 
> > > 4) Today, an interrupt notification is cached on IRR until its delivered - further
> > > interrupt injection is not generating further interrupt notification
> > > bits. With PIR, behaviour changes: Its possible to have one bit in PIR and another 
> > > on IRR APIC page (if timing is right). Is this harmless? Why?
> > > 
> > > 
> > 
> > --
> > 			Gleb.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Jan. 31, 2013, 1:44 p.m. UTC | #12

On Thu, Jan 31, 2013 at 03:38:37PM +0200, Gleb Natapov wrote:
> On Thu, Jan 31, 2013 at 11:32:45AM -0200, Marcelo Tosatti wrote:
> > On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote:
> > > On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
> > > > Posted interrupt patch:
> > > > 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
> > > > vcpu_enter_guest function. Otherwise:
> > > > 
> > > > cpu0			vcpu1<->cpu1
> > > > 
> > > > 			vcpu->mode = IN_GUEST_MODE
> > > > 
> > > > if IN_GUEST_MODE == true
> > > > 	send IPI
> > > > 			local_irq_disable
> > > > 
> > > > PIR not transferred to VIRR, misses interrupt.
> > > > 
> > > cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after
> > > local_irq_disable() by ->requests check.
> > 
> > Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose 
> > of posted interrupts. You want
> > 
> > if vcpu in guest mode
> > 	send posted interrupt IPI
> > else
> > 	KVM_REQ_EVENT+kick
> > 
> I am thinking:
> 
>  set KVM_REQ_EVENT
>  if pi is enabled
>      send posted interrupt IPI
>  else
>      kick

KVM_REQ_EVENT must be after sending posted interrupt IPI. Otherwise on
the vcpu entry side

test_and_clear(KVM_REQ_EVENT) {
	No bits set in PIR
   }

What about item 4 below?

> > > > 3) Must check outstanding PIR notification bit unconditionally on
> > > > every VM-entry, because:
> > > > 
> > > > 	1. local_irq_disable
> > > > 	2. vcpu->mode = IN_GUEST_MODE
> > > > 	3. vmenter
> > > > 	4. vmexit
> > > > 	5. vcpu->mode = OUTSIDE_GUEST_MODE
> > > > 
> > > > If PIR-IPI-interrupt is sent between an event which triggers VM-exit
> > > > (for example, an external interrupt due to a device), and step 5
> > > > (assignment of vcpu->mode), the PIR->VIRR transfer before vmentry must
> > > > be made.
> > > Not sure I understand, but I think KVM_REQ_EVENT will cover that too.
> > 
> > See above.
> > 
> > > > 
> > > > 4) Today, an interrupt notification is cached on IRR until its delivered - further
> > > > interrupt injection is not generating further interrupt notification
> > > > bits. With PIR, behaviour changes: Its possible to have one bit in PIR and another 
> > > > on IRR APIC page (if timing is right). Is this harmless? Why?
> > > > 
> > > > 
> > > 
> > > --
> > > 			Gleb.
> 
> --
> 			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Jan. 31, 2013, 1:55 p.m. UTC | #13

On Thu, Jan 31, 2013 at 11:44:43AM -0200, Marcelo Tosatti wrote:
> On Thu, Jan 31, 2013 at 03:38:37PM +0200, Gleb Natapov wrote:
> > On Thu, Jan 31, 2013 at 11:32:45AM -0200, Marcelo Tosatti wrote:
> > > On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote:
> > > > On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
> > > > > Posted interrupt patch:
> > > > > 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
> > > > > vcpu_enter_guest function. Otherwise:
> > > > > 
> > > > > cpu0			vcpu1<->cpu1
> > > > > 
> > > > > 			vcpu->mode = IN_GUEST_MODE
> > > > > 
> > > > > if IN_GUEST_MODE == true
> > > > > 	send IPI
> > > > > 			local_irq_disable
> > > > > 
> > > > > PIR not transferred to VIRR, misses interrupt.
> > > > > 
> > > > cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after
> > > > local_irq_disable() by ->requests check.
> > > 
> > > Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose 
> > > of posted interrupts. You want
> > > 
> > > if vcpu in guest mode
> > > 	send posted interrupt IPI
> > > else
> > > 	KVM_REQ_EVENT+kick
> > > 
> > I am thinking:
> > 
> >  set KVM_REQ_EVENT
> >  if pi is enabled
> >      send posted interrupt IPI
> >  else
> >      kick
> 
> KVM_REQ_EVENT must be after sending posted interrupt IPI. Otherwise on
> the vcpu entry side
> 
> test_and_clear(KVM_REQ_EVENT) {
> 	No bits set in PIR
>    }
> 
It should be after updating PIR, but before sending posted interrupt
IPI. Otherwise:

 cpu0                             cpu1/vcpu

                                  KVM_REQ_EVENT is not set
set pir                         
send IPI
                                  irq_disable()
                                  ->request is empty.
set KVM_REQ_EVENT

That's the same sequence as with IRR update, KVM_REQ_EVENT and kick
today.

> What about item 4 below?
> 
That's for Yang to answer :)

> > > > > 3) Must check outstanding PIR notification bit unconditionally on
> > > > > every VM-entry, because:
> > > > > 
> > > > > 	1. local_irq_disable
> > > > > 	2. vcpu->mode = IN_GUEST_MODE
> > > > > 	3. vmenter
> > > > > 	4. vmexit
> > > > > 	5. vcpu->mode = OUTSIDE_GUEST_MODE
> > > > > 
> > > > > If PIR-IPI-interrupt is sent between an event which triggers VM-exit
> > > > > (for example, an external interrupt due to a device), and step 5
> > > > > (assignment of vcpu->mode), the PIR->VIRR transfer before vmentry must
> > > > > be made.
> > > > Not sure I understand, but I think KVM_REQ_EVENT will cover that too.
> > > 
> > > See above.
> > > 
> > > > > 
> > > > > 4) Today, an interrupt notification is cached on IRR until its delivered - further
> > > > > interrupt injection is not generating further interrupt notification
> > > > > bits. With PIR, behaviour changes: Its possible to have one bit in PIR and another 
> > > > > on IRR APIC page (if timing is right). Is this harmless? Why?
> > > > > 
> > > > > 
> > > > 
> > > > --
> > > > 			Gleb.
> > 
> > --
> > 			Gleb.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Feb. 4, 2013, 12:57 a.m. UTC | #14

On Thu, Jan 31, 2013 at 03:55:56PM +0200, Gleb Natapov wrote:
> On Thu, Jan 31, 2013 at 11:44:43AM -0200, Marcelo Tosatti wrote:
> > On Thu, Jan 31, 2013 at 03:38:37PM +0200, Gleb Natapov wrote:
> > > On Thu, Jan 31, 2013 at 11:32:45AM -0200, Marcelo Tosatti wrote:
> > > > On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote:
> > > > > On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
> > > > > > Posted interrupt patch:
> > > > > > 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
> > > > > > vcpu_enter_guest function. Otherwise:
> > > > > > 
> > > > > > cpu0			vcpu1<->cpu1
> > > > > > 
> > > > > > 			vcpu->mode = IN_GUEST_MODE
> > > > > > 
> > > > > > if IN_GUEST_MODE == true
> > > > > > 	send IPI
> > > > > > 			local_irq_disable
> > > > > > 
> > > > > > PIR not transferred to VIRR, misses interrupt.
> > > > > > 
> > > > > cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after
> > > > > local_irq_disable() by ->requests check.
> > > > 
> > > > Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose 
> > > > of posted interrupts. You want
> > > > 
> > > > if vcpu in guest mode
> > > > 	send posted interrupt IPI
> > > > else
> > > > 	KVM_REQ_EVENT+kick
> > > > 
> > > I am thinking:
> > > 
> > >  set KVM_REQ_EVENT
> > >  if pi is enabled
> > >      send posted interrupt IPI
> > >  else
> > >      kick
> > 
> > KVM_REQ_EVENT must be after sending posted interrupt IPI. Otherwise on
> > the vcpu entry side
> > 
> > test_and_clear(KVM_REQ_EVENT) {
> > 	No bits set in PIR
> >    }
> > 
> It should be after updating PIR, but before sending posted interrupt
> IPI. Otherwise:
> 
>  cpu0                             cpu1/vcpu
> 
>                                   KVM_REQ_EVENT is not set
> set pir                         
> send IPI
>                                   irq_disable()
>                                   ->request is empty.
> set KVM_REQ_EVENT
> 
> That's the same sequence as with IRR update, KVM_REQ_EVENT and kick
> today.

Can only send IPI if vcpu->mode == IN_GUEST_MODE, which must be set
after interrupt flag is cleared as noted.

Also KVM_REQ_EVENT is processed outside of interrupt disabled region today.

Or maybe i don't get what you say... write a complete
description?

> > What about item 4 below?
> > 
> That's for Yang to answer :)

"If more than one interrupt is generated with the same vector number,
the local APIC can set the bit for the vector both in the IRR and ISR.
This means that for the Pentium 4 and Intel Xeon processors, the IRR
and ISR can queue two interrupts for each interrupt vector: one in the
IRR and one in the ISR. Any additional interrupts issued for the same
interrupt vector are collapsed into the single bit in IRR"

Which would mean KVM emulation differs... "Eventually 3 interrupts can
be queued: one in IRR, one in ISR, one in PIR".

Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhang, Yang Z Feb. 4, 2013, 9:10 a.m. UTC | #15

Sorry for the late response.

Marcelo Tosatti wrote on 2013-02-04:
> On Thu, Jan 31, 2013 at 03:55:56PM +0200, Gleb Natapov wrote:
>> On Thu, Jan 31, 2013 at 11:44:43AM -0200, Marcelo Tosatti wrote:
>>> On Thu, Jan 31, 2013 at 03:38:37PM +0200, Gleb Natapov wrote:
>>>> On Thu, Jan 31, 2013 at 11:32:45AM -0200, Marcelo Tosatti wrote:
>>>>> On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote:
>>>>>> On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
>>>>>>> Posted interrupt patch:
>>>>>>> 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
>>>>>>> vcpu_enter_guest function. Otherwise:
>>>>>>> 
>>>>>>> cpu0			vcpu1<->cpu1
>>>>>>> 
>>>>>>> 			vcpu->mode = IN_GUEST_MODE
>>>>>>> 
>>>>>>> if IN_GUEST_MODE == true
>>>>>>> 	send IPI
>>>>>>> 			local_irq_disable
>>>>>>> 
>>>>>>> PIR not transferred to VIRR, misses interrupt.
>>>>>>> 
>>>>>> cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after
>>>>>> local_irq_disable() by ->requests check.
>>>>> 
>>>>> Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose
>>>>> of posted interrupts. You want
>>>>> 
>>>>> if vcpu in guest mode
>>>>> 	send posted interrupt IPI
>>>>> else
>>>>> 	KVM_REQ_EVENT+kick
>>>>> 
>>>> I am thinking:
>>>> 
>>>>  set KVM_REQ_EVENT
>>>>  if pi is enabled
>>>>      send posted interrupt IPI else kick
>>> 
>>> KVM_REQ_EVENT must be after sending posted interrupt IPI. Otherwise on
>>> the vcpu entry side
>>> 
>>> test_and_clear(KVM_REQ_EVENT) {
>>> 	No bits set in PIR
>>>    }
>> It should be after updating PIR, but before sending posted interrupt
>> IPI. Otherwise:
>> 
>>  cpu0                             cpu1/vcpu
>>                                   KVM_REQ_EVENT is not set
>> set pir
>> send IPI
>>                                   irq_disable()
>>                                   ->request is empty.
>> set KVM_REQ_EVENT
>> 
>> That's the same sequence as with IRR update, KVM_REQ_EVENT and kick
>> today.
> 
> Can only send IPI if vcpu->mode == IN_GUEST_MODE, which must be set
> after interrupt flag is cleared as noted.
> 
> Also KVM_REQ_EVENT is processed outside of interrupt disabled region today.
> 
> Or maybe i don't get what you say... write a complete description?
This version patch is too old. And I already noticed this problem and the current solution is like this:
+static bool vmx_send_notification_event(struct kvm_vcpu *vcpu,
+               int vector, int *result)
+{
+       struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+       *result = !pi_test_and_set_pir(vector, vmx->pi);
+       if (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
+               kvm_make_request(KVM_REQ_PENDING_PIR, vcpu);
+               apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
+                                   POSTED_INTR_VECTOR);
+               if (!pi_test_on(vmx->pi))
+                       clear_bit(KVM_REQ_PENDING_PIR, &vcpu->requests) ;
+               return true;
+       }
+       return false;
+}

For detail, please check the v2 patches.

>>> What about item 4 below?
>>> 
>> That's for Yang to answer :)
> 
> "If more than one interrupt is generated with the same vector number,
> the local APIC can set the bit for the vector both in the IRR and ISR.
> This means that for the Pentium 4 and Intel Xeon processors, the IRR
> and ISR can queue two interrupts for each interrupt vector: one in the
> IRR and one in the ISR. Any additional interrupts issued for the same
> interrupt vector are collapsed into the single bit in IRR"
> 
> Which would mean KVM emulation differs... "Eventually 3 interrupts can
> be queued: one in IRR, one in ISR, one in PIR".
>
> Any example how software relies on such two-interrupts-queued-in-IRR/ISR
> behaviour?
Before PI, the interrupt handle logic is IRR->ISR. And now it is : PIR->IRR->ISR.
So, if we can ensure PIR sync to IRR before each access to IRR, I think there should no problem.

Best regards,
Yang

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhang, Yang Z Feb. 4, 2013, 9:11 a.m. UTC | #16

This patch is too old. Some issues you point out already fixed in v2 patch.
Please review v2 patch and give some comments.

Gleb Natapov wrote on 2013-01-31:
> On Thu, Dec 13, 2012 at 03:29:40PM +0800, Yang Zhang wrote:
>> From: Yang Zhang <yang.z.zhang@Intel.com>
>> 
>> Posted Interrupt allows APIC interrupts to inject into guest directly
>> without any vmexit.
>> 
>> - When delivering a interrupt to guest, if target vcpu is running,
>>   update Posted-interrupt requests bitmap and send a notification event
>>   to the vcpu. Then the vcpu will handle this interrupt automatically,
>>   without any software involvemnt.
>> - If target vcpu is not running or there already a notification event
>>   pending in the vcpu, do nothing. The interrupt will be handled by
>>   next vm entry.
>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
>> ---
>>  arch/x86/include/asm/entry_arch.h  |    1 +
>>  arch/x86/include/asm/hw_irq.h      |    1 + arch/x86/include/asm/irq.h
>>          |    1 + arch/x86/include/asm/irq_vectors.h |    4 +
>>  arch/x86/include/asm/kvm_host.h    |    3 + arch/x86/include/asm/vmx.h
>>          |    4 + arch/x86/kernel/entry_64.S         |    2 +
>>  arch/x86/kernel/irq.c              |   25 +++++++
>>  arch/x86/kernel/irqinit.c          |    2 + arch/x86/kvm/lapic.c      
>>          |   16 +++- arch/x86/kvm/lapic.h               |    1 +
>>  arch/x86/kvm/vmx.c                 |  133
>>  +++++++++++++++++++++++++++++++++--- 12 files changed, 180
>>  insertions(+), 13 deletions(-)
>> diff --git a/arch/x86/include/asm/entry_arch.h
>> b/arch/x86/include/asm/entry_arch.h index 40afa00..7b0a29e 100644 ---
>> a/arch/x86/include/asm/entry_arch.h +++
>> b/arch/x86/include/asm/entry_arch.h @@ -18,6 +18,7 @@
>> BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)
>>  #endif
>>  
>>  BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
>> +BUILD_INTERRUPT(posted_intr_ipi, POSTED_INTR_VECTOR)
>> 
>>  /*
>>   * every pentium local APIC has two 'local interrupts', with a
>> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
>> index eb92a6e..ee61af3 100644
>> --- a/arch/x86/include/asm/hw_irq.h
>> +++ b/arch/x86/include/asm/hw_irq.h
>> @@ -28,6 +28,7 @@
>>  /* Interrupt handlers registered during init_IRQ */ extern void
>>  apic_timer_interrupt(void); extern void x86_platform_ipi(void);
>>  +extern void posted_intr_ipi(void); extern void error_interrupt(void);
>>  extern void irq_work_interrupt(void);
>> diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
>> index ba870bb..cff9933 100644
>> --- a/arch/x86/include/asm/irq.h
>> +++ b/arch/x86/include/asm/irq.h
>> @@ -30,6 +30,7 @@ extern void irq_force_complete_move(int);
>>  #endif
>>  
>>  extern void (*x86_platform_ipi_callback)(void); +extern void
>>  (*posted_intr_callback)(void); extern void native_init_IRQ(void);
>>  extern bool handle_irq(unsigned irq, struct pt_regs *regs);
>> diff --git a/arch/x86/include/asm/irq_vectors.h
>> b/arch/x86/include/asm/irq_vectors.h index 1508e51..8f2e383 100644 ---
>> a/arch/x86/include/asm/irq_vectors.h +++
>> b/arch/x86/include/asm/irq_vectors.h @@ -102,6 +102,10 @@
>>   */
>>  #define X86_PLATFORM_IPI_VECTOR		0xf7
>> +#ifdef CONFIG_HAVE_KVM
> Users of POSTED_INTR_VECTOR are not under ifdef, which means compilation
> will fails with kvm disabled. Test it please.
> 
>> +#define POSTED_INTR_VECTOR 		0xf2
>> +#endif
>> +
>>  /*
>>   * IRQ work vector:
>>   */
>> diff --git a/arch/x86/include/asm/kvm_host.h
>> b/arch/x86/include/asm/kvm_host.h index 7e26d1a..82423a8 100644 ---
>> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -700,6 +700,9 @@ struct kvm_x86_ops {
>>  	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
>>  	void (*update_irq)(struct kvm_vcpu *vcpu);
>>  	void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector, bool set);
>> +	int (*has_posted_interrupt)(struct kvm_vcpu *vcpu);
>> +	int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
>> +	void (*update_irr)(struct kvm_vcpu *vcpu);
>>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>>  	int (*get_tdp_level)(void);
>>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>> index 1003341..7b9e1d0 100644
>> --- a/arch/x86/include/asm/vmx.h
>> +++ b/arch/x86/include/asm/vmx.h
>> @@ -152,6 +152,7 @@
>>  #define PIN_BASED_EXT_INTR_MASK                 0x00000001
>>  #define PIN_BASED_NMI_EXITING                   0x00000008
>>  #define PIN_BASED_VIRTUAL_NMIS                  0x00000020
>> +#define PIN_BASED_POSTED_INTR                   0x00000080
>> 
>>  #define VM_EXIT_SAVE_DEBUG_CONTROLS             0x00000002 #define
>>  VM_EXIT_HOST_ADDR_SPACE_SIZE            0x00000200 @@ -174,6 +175,7 @@
>>  /* VMCS Encodings */ enum vmcs_field { 	VIRTUAL_PROCESSOR_ID          
>>   = 0x00000000, +	POSTED_INTR_NV                  = 0x00000002,
>>  	GUEST_ES_SELECTOR               = 0x00000800, 	GUEST_CS_SELECTOR     
>>           = 0x00000802, 	GUEST_SS_SELECTOR               = 0x00000804,
>>  @@ -208,6 +210,8 @@ enum vmcs_field { 	VIRTUAL_APIC_PAGE_ADDR_HIGH    
>>  = 0x00002013, 	APIC_ACCESS_ADDR		= 0x00002014,
>>  	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
>> +	POSTED_INTR_DESC_ADDR           = 0x00002016,
>> +	POSTED_INTR_DESC_ADDR_HIGH      = 0x00002017,
>>  	EPT_POINTER                     = 0x0000201a,
>>  	EPT_POINTER_HIGH                = 0x0000201b,
>>  	EOI_EXIT_BITMAP0                = 0x0000201c,
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index b51b2c7..d06eea1 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -1160,6 +1160,8 @@ apicinterrupt LOCAL_TIMER_VECTOR \
>>  	apic_timer_interrupt smp_apic_timer_interrupt
>>  apicinterrupt X86_PLATFORM_IPI_VECTOR \
>>  	x86_platform_ipi smp_x86_platform_ipi
>> +apicinterrupt POSTED_INTR_VECTOR \
>> +	posted_intr_ipi smp_posted_intr_ipi
>> 
>>  apicinterrupt THRESHOLD_APIC_VECTOR \
>>  	threshold_interrupt smp_threshold_interrupt
>> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
>> index e4595f1..781d324 100644
>> --- a/arch/x86/kernel/irq.c
>> +++ b/arch/x86/kernel/irq.c
>> @@ -22,6 +22,9 @@ atomic_t irq_err_count;
>> 
>>  /* Function pointer for generic interrupt vector handling */
>>  void (*x86_platform_ipi_callback)(void) = NULL;
>> +/* Function pointer for posted interrupt vector handling */
>> +void (*posted_intr_callback)(void) = NULL;
>> +EXPORT_SYMBOL_GPL(posted_intr_callback);
>> 
>>  /*
>>   * 'what should we do if we get a hw irq event on an illegal vector'.
>> @@ -228,6 +231,28 @@ void smp_x86_platform_ipi(struct pt_regs *regs)
>>  	set_irq_regs(old_regs);
>>  }
>> +/*
>> + * Handler for POSTED_INTERRUPT_VECTOR.
>> + */
>> +void smp_posted_intr_ipi(struct pt_regs *regs)
>> +{
>> +	struct pt_regs *old_regs = set_irq_regs(regs);
>> +
>> +	ack_APIC_irq();
>> +
>> +	irq_enter();
>> +
>> +	exit_idle();
>> +
>> +	if (posted_intr_callback)
>> +		posted_intr_callback();
>> +
> The callback does nothing. Drop it.
> 
>> +	irq_exit();
>> +
>> +	set_irq_regs(old_regs);
>> +}
>> +
>> +
>>  EXPORT_SYMBOL_GPL(vector_used_by_percpu_irq);
>>  
>>  #ifdef CONFIG_HOTPLUG_CPU
>> diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
>> index 6e03b0d..d15ca4f 100644
>> --- a/arch/x86/kernel/irqinit.c
>> +++ b/arch/x86/kernel/irqinit.c
>> @@ -205,6 +205,8 @@ static void __init apic_intr_init(void)
>> 
>>  	/* IPI for X86 platform specific use */
>>  	alloc_intr_gate(X86_PLATFORM_IPI_VECTOR, x86_platform_ipi);
>> +	/* IPI for posted interrupt use */
>> +	alloc_intr_gate(POSTED_INTR_VECTOR, posted_intr_ipi);
>> 
>>  	/* IPI vectors for APIC spurious and error interrupts */
>>  	alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 2109a6a..d660b9d 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -350,6 +350,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic
> *apic)
>>  	if (!apic->irr_pending)
>>  		return -1;
>> +	kvm_x86_ops->update_irr(apic->vcpu);
>>  	result = apic_search_irr(apic);
>>  	ASSERT(result == -1 || result >= 16);
>> @@ -725,18 +726,25 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int
> delivery_mode,
>>  		if (trig_mode) {
>>  			apic_debug("level trig mode for vector %d", vector);
>>  			apic_set_vector(vector, apic->regs + APIC_TMR);
>> -		} else
>> +		} else {
>>  			apic_clear_vector(vector, apic->regs + APIC_TMR);
>> -
> Why doing pi only for edge triggered interrupts?
> 
>> +			if (kvm_x86_ops->has_posted_interrupt(vcpu)) {
> Drop has_posted_interrupt(). Just call send_nv() directly. And give it
> more descriptive name.
> 
> 
>> +				result = 1;
>> +				apic->irr_pending = true;
> This is always true with vid anyway.
> 
>> +				kvm_x86_ops->send_nv(vcpu, vector);
>> +				goto out;
>> +			}
>> +		}
>>  		result = !apic_test_and_set_irr(vector, apic);
>> -		trace_kvm_apic_accept_irq(vcpu->vcpu_id, delivery_mode,
>> -					  trig_mode, vector, !result);
>>  		if (!result) {
>>  			if (trig_mode)
>>  				apic_debug("level trig mode repeatedly for "
>>  						"vector %d", vector);
>>  			break;
>>  		}
>> +out:
>> +		trace_kvm_apic_accept_irq(vcpu->vcpu_id, delivery_mode,
>> +					  trig_mode, vector, !result);
> No trace if !result now.
> 
>> 
>>  		kvm_make_request(KVM_REQ_EVENT, vcpu);
>>  		kvm_vcpu_kick(vcpu);
> What is the point of sending notification vector to vcpu is you kick it
> out of a guest mode immediately after?
> 
>> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
>> index 10e3f66..0f8361e 100644
>> --- a/arch/x86/kvm/lapic.h
>> +++ b/arch/x86/kvm/lapic.h
>> @@ -42,6 +42,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
>>  int kvm_cpu_has_extint(struct kvm_vcpu *v); int
>>  kvm_cpu_get_extint(struct kvm_vcpu *v); int
>>  kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu); +void
>>  kvm_apic_update_irr(struct kvm_vcpu *vcpu, unsigned int *pir); void
>>  kvm_lapic_reset(struct kvm_vcpu *vcpu); u64 kvm_lapic_get_cr8(struct
>>  kvm_vcpu *vcpu); void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu,
>>  unsigned long cr8);
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 6b6bd03..07dbde6 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -31,6 +31,7 @@
>>  #include <linux/ftrace_event.h> #include <linux/slab.h> #include
>>  <linux/tboot.h> +#include <linux/interrupt.h> #include
>>  "kvm_cache_regs.h" #include "x86.h"
>> @@ -86,6 +87,8 @@ module_param(fasteoi, bool, S_IRUGO);
>>  static bool __read_mostly enable_apicv_reg_vid = 1;
>>  module_param(enable_apicv_reg_vid, bool, S_IRUGO);
>> +static bool __read_mostly enable_apicv_pi = 1;
>> +module_param(enable_apicv_pi, bool, S_IRUGO);
>>  /*
>>   * If nested=1, nested virtualization is supported, i.e., guests may use
>>   * VMX and be a hypervisor for its own guests. If nested=0, guests may not
>> @@ -369,6 +372,35 @@ struct nested_vmx {
>>  	struct page *apic_access_page;
>>  };
>> +#define POSTED_INTR_ON  0
>> +/* Posted-Interrupt Descriptor */
>> +struct pi_desc {
>> +	u32 pir[8];     /* Posted interrupt requested */
>> +	union {
>> +		struct {
>> +			u8  on:1,
>> +			    rsvd:7;
>> +		} control;
>> +		u32 rsvd[8];
>> +	} u;
>> +} __aligned(64);
>> +
>> +static void pi_clear_on(struct pi_desc *pi_desc)
>> +{
>> +	clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
>> +}
>> +
>> +static u8 pi_test_and_set_on(struct pi_desc *pi_desc)
>> +{
>> +	return test_and_set_bit(POSTED_INTR_ON,
>> +			(unsigned long *)&pi_desc->u.control);
>> +}
>> +
>> +static void pi_set_pir(int vector, struct pi_desc *pi_desc)
>> +{
>> +	set_bit(vector, (unsigned long *)pi_desc->pir);
>> +}
>> +
>>  struct vcpu_vmx { 	struct kvm_vcpu       vcpu; 	unsigned long        
>>  host_rsp; @@ -435,6 +467,9 @@ struct vcpu_vmx { 	u8
>>  eoi_exitmap_changed; 	u32 eoi_exit_bitmap[8];
>> +	/* Posted interrupt descriptor */
>> +	struct pi_desc *pi;
>> +
> I am not convinced we should try to save 46 bytes here when !pi.
> 
>>  	/* Support for a guest hypervisor (nested VMX) */
>>  	struct nested_vmx nested;
>>  };
>> @@ -779,6 +814,11 @@ static inline bool
> cpu_has_vmx_virtual_intr_delivery(void)
>>  		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>>  }
>> +static inline bool cpu_has_vmx_posted_intr(void)
>> +{
>> +	return vmcs_config.pin_based_exec_ctrl & PIN_BASED_POSTED_INTR;
>> +}
>> +
>>  static inline bool cpu_has_vmx_flexpriority(void)
>>  {
>>  	return cpu_has_vmx_tpr_shadow() &&
>> @@ -2475,12 +2515,6 @@ static __init int setup_vmcs_config(struct
> vmcs_config *vmcs_conf)
>>  	u32 _vmexit_control = 0;
>>  	u32 _vmentry_control = 0;
>> -	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
>> -	opt = PIN_BASED_VIRTUAL_NMIS;
>> -	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
>> -				&_pin_based_exec_control) < 0)
>> -		return -EIO;
>> -
>>  	min = CPU_BASED_HLT_EXITING |
>>  #ifdef CONFIG_X86_64
>>  	      CPU_BASED_CR8_LOAD_EXITING |
>> @@ -2554,6 +2588,17 @@ static __init int setup_vmcs_config(struct
> vmcs_config *vmcs_conf)
>>  				&_vmexit_control) < 0)
>>  		return -EIO;
>> +	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
>> +	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR;
>> +	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
>> +				&_pin_based_exec_control) < 0)
>> +		return -EIO;
>> +
>> +	if (!(_cpu_based_2nd_exec_control &
>> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) ||
>> +		!(_vmexit_control & VM_EXIT_ACK_INTR_ON_EXIT))
>> +		_pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;
>> +
>>  	min = 0; 	opt = VM_ENTRY_LOAD_IA32_PAT; 	if (adjust_vmx_controls(min,
>>  opt, MSR_IA32_VMX_ENTRY_CTLS, @@ -2739,6 +2784,9 @@ static __init int
>>  hardware_setup(void) 	if (enable_apicv_reg_vid)
>>  		kvm_x86_ops->update_cr8_intercept = NULL;
>> +	if (!cpu_has_vmx_posted_intr() || !enable_apicv_reg_vid)
>> +		enable_apicv_pi = 0;
>> +
>>  	if (nested)
>>  		nested_vmx_setup_ctls_msrs();
>> @@ -3904,6 +3952,57 @@ static void ept_set_mmio_spte_mask(void)
>>  	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull);
>>  }
>> +static void pi_handler(void) +{ +	; +} + +static int
>> vmx_has_posted_interrupt(struct kvm_vcpu *vcpu) +{ +	return
>> irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi; +} + +static int
>> vmx_send_nv(struct kvm_vcpu *vcpu, +		int vector) +{ +	struct vcpu_vmx
>> *vmx = to_vmx(vcpu); + +	pi_set_pir(vector, vmx->pi); +	if
>> (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
>> +		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), POSTED_INTR_VECTOR);
>> +		return 1; +	} +	return 0; +}
> Return value is not checked by the caller.
> 
>> +
>> +static void vmx_update_irr(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +        struct kvm_lapic *apic = vcpu->arch.apic;
>> +        unsigned int i, old, new, val, irr_off;
> Spaces instead of tabs. Not only here. Run through checkpatch.
> 
>> +
>> +	if (!enable_apicv_pi)
>> +		return;
>> +
> Why no check for pit->on before syncing?
> 
>> +        for (i = 0; i <= 7; i++) {
>> +		if (vmx->pi->pir[i]) {
>> +			irr_off = APIC_IRR + i * 0x10;
>> +			do {
>> +				old = kvm_apic_get_reg(apic, irr_off);
>> +				new = old | vmx->pi->pir[i];
>> +				val = cmpxchg((u32 *)(apic->regs + irr_off), old, new);
>> +			} while (unlikely (val != old));
>> +			vmx->pi->pir[i] = 0;
>> +		}
>> +        }
>> +}
>> +
>> +static void free_pi(struct vcpu_vmx *vmx)
>> +{
>> +	if (enable_apicv_pi)
>> +		kfree(vmx->pi);
>> +}
>> +
>>  /*
>>   * Sets up the vmcs for emulated real mode.
>>   */
>> @@ -3913,6 +4012,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>>  	unsigned long a; #endif 	int i; +	u32 pin_based_exec_ctrl =
>>  vmcs_config.pin_based_exec_ctrl; 	u32 vmexit_ctrl =
>>  vmcs_config.vmexit_ctrl;
>>  
>>  	/* I/O */ @@ -3925,8 +4025,10 @@ static int vmx_vcpu_setup(struct
>>  vcpu_vmx *vmx) 	vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
>>  
>>  	/* Control */
>> -	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
>> -		vmcs_config.pin_based_exec_ctrl);
>> +	if (!enable_apicv_pi)
>> +		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR;
>> +
>> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
>> 
>>  	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
> vmx_exec_control(vmx));
>> 
>> @@ -3944,6 +4046,13 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>>  		vmcs_write16(GUEST_INTR_STATUS, 0);
>>  	}
>> +	if (enable_apicv_pi) {
>> +		vmx->pi = kmalloc(sizeof(struct pi_desc),
>> +				GFP_KERNEL | __GFP_ZERO);
> kzalloc().
> 
>> +		vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
>> +		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi)));
>> +	}
>> +
>>  	if (ple_gap) { 		vmcs_write32(PLE_GAP, ple_gap);
>>  		vmcs_write32(PLE_WINDOW, ple_window); @@ -6220,6 +6329,8 @@ static
>>  void vmx_update_irq(struct kvm_vcpu *vcpu)
>>  					vmx->eoi_exit_bitmap[index]); 		vmx->eoi_exitmap_changed = 0; 	}
>> +	if (enable_apicv_pi)
>> +		pi_clear_on(vmx->pi);
>>  }
>>  
>>  static void vmx_update_eoi_exitmap(struct kvm_vcpu *vcpu,
>> @@ -6626,6 +6737,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
>> 
>>  	free_vpid(vmx); 	free_nested(vmx); +	free_pi(vmx);
>>  	free_loaded_vmcs(vmx->loaded_vmcs); 	kfree(vmx->guest_msrs);
>>  	kvm_vcpu_uninit(vcpu); @@ -7520,8 +7632,11 @@ static struct
>>  kvm_x86_ops vmx_x86_ops = { 	.enable_irq_window = enable_irq_window,
>>  	.update_cr8_intercept = update_cr8_intercept,
>>  	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
>>  +	.has_posted_interrupt = vmx_has_posted_interrupt, 	.update_irq =
>>  vmx_update_irq, 	.update_eoi_exitmap = vmx_update_eoi_exitmap,
>> +	.send_nv = vmx_send_nv,
>> +	.update_irr = vmx_update_irr,
>> 
> SVM?
> 
>>  	.set_tss_addr = vmx_set_tss_addr, 	.get_tdp_level = get_ept_level, @@
>>  -7618,7 +7733,7 @@ static int __init vmx_init(void) 		/* SELF-IPI */
>>  		vmx_disable_intercept_for_msr_write(0x83f, false); 	}
>> -
>> +	posted_intr_callback = pi_handler;
>>  	if (enable_ept) {
>>  		kvm_mmu_set_mask_ptes(0ull,
>>  			(enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
>> --
>> 1.7.1
> 
> --
> 			Gleb.


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Feb. 4, 2013, 9:55 a.m. UTC | #17

On Sun, Feb 03, 2013 at 10:57:00PM -0200, Marcelo Tosatti wrote:
> On Thu, Jan 31, 2013 at 03:55:56PM +0200, Gleb Natapov wrote:
> > On Thu, Jan 31, 2013 at 11:44:43AM -0200, Marcelo Tosatti wrote:
> > > On Thu, Jan 31, 2013 at 03:38:37PM +0200, Gleb Natapov wrote:
> > > > On Thu, Jan 31, 2013 at 11:32:45AM -0200, Marcelo Tosatti wrote:
> > > > > On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote:
> > > > > > On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
> > > > > > > Posted interrupt patch:
> > > > > > > 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
> > > > > > > vcpu_enter_guest function. Otherwise:
> > > > > > > 
> > > > > > > cpu0			vcpu1<->cpu1
> > > > > > > 
> > > > > > > 			vcpu->mode = IN_GUEST_MODE
> > > > > > > 
> > > > > > > if IN_GUEST_MODE == true
> > > > > > > 	send IPI
> > > > > > > 			local_irq_disable
> > > > > > > 
> > > > > > > PIR not transferred to VIRR, misses interrupt.
> > > > > > > 
> > > > > > cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after
> > > > > > local_irq_disable() by ->requests check.
> > > > > 
> > > > > Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose 
> > > > > of posted interrupts. You want
> > > > > 
> > > > > if vcpu in guest mode
> > > > > 	send posted interrupt IPI
> > > > > else
> > > > > 	KVM_REQ_EVENT+kick
> > > > > 
> > > > I am thinking:
> > > > 
> > > >  set KVM_REQ_EVENT
> > > >  if pi is enabled
> > > >      send posted interrupt IPI
> > > >  else
> > > >      kick
> > > 
> > > KVM_REQ_EVENT must be after sending posted interrupt IPI. Otherwise on
> > > the vcpu entry side
> > > 
> > > test_and_clear(KVM_REQ_EVENT) {
> > > 	No bits set in PIR
> > >    }
> > > 
> > It should be after updating PIR, but before sending posted interrupt
> > IPI. Otherwise:
> > 
> >  cpu0                             cpu1/vcpu
> > 
> >                                   KVM_REQ_EVENT is not set
> > set pir                         
> > send IPI
> >                                   irq_disable()
> >                                   ->request is empty.
> > set KVM_REQ_EVENT
> > 
> > That's the same sequence as with IRR update, KVM_REQ_EVENT and kick
> > today.
> 
> Can only send IPI if vcpu->mode == IN_GUEST_MODE, which must be set
> after interrupt flag is cleared as noted.
> 
> Also KVM_REQ_EVENT is processed outside of interrupt disabled region today.
But it is checked in interrupt disabled section and vcpu entry is
aborted if event is pending.

> 
> Or maybe i don't get what you say... write a complete
> description?
> 
I am saying that we do not need to move  vcpu->mode = IN_GUEST_MODE to
irq_disable() section to make things work. Just do:

set bit in pir
set KVM_REQ_EVENT
if in guest mode do IPI

> > > What about item 4 below?
> > > 
> > That's for Yang to answer :)
> 
> "If more than one interrupt is generated with the same vector number,
> the local APIC can set the bit for the vector both in the IRR and ISR.
> This means that for the Pentium 4 and Intel Xeon processors, the IRR
> and ISR can queue two interrupts for each interrupt vector: one in the
> IRR and one in the ISR. Any additional interrupts issued for the same
> interrupt vector are collapsed into the single bit in IRR"
> 
> Which would mean KVM emulation differs... "Eventually 3 interrupts can
> be queued: one in IRR, one in ISR, one in PIR".
I do not think this is the case though. PIR will be folded into IRR
either during a guest entry or by vcpu itself on receiving of
notification vector IPI.

> 
> Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour?
Don't know about guests, but KVM relies on it to detect interrupt
coalescing. So if interrupt is set in IRR but not in PIR interrupt will
not be reported as coalesced, but it will be coalesced during PIR->IRR
merge.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Feb. 4, 2013, 2:43 p.m. UTC | #18

On Mon, Feb 04, 2013 at 11:55:53AM +0200, Gleb Natapov wrote:
> On Sun, Feb 03, 2013 at 10:57:00PM -0200, Marcelo Tosatti wrote:
> > On Thu, Jan 31, 2013 at 03:55:56PM +0200, Gleb Natapov wrote:
> > > On Thu, Jan 31, 2013 at 11:44:43AM -0200, Marcelo Tosatti wrote:
> > > > On Thu, Jan 31, 2013 at 03:38:37PM +0200, Gleb Natapov wrote:
> > > > > On Thu, Jan 31, 2013 at 11:32:45AM -0200, Marcelo Tosatti wrote:
> > > > > > On Thu, Jan 31, 2013 at 11:43:48AM +0200, Gleb Natapov wrote:
> > > > > > > On Wed, Jan 30, 2013 at 09:03:11PM -0200, Marcelo Tosatti wrote:
> > > > > > > > Posted interrupt patch:
> > > > > > > > 2) Must move IN_GUEST_MODE assignment after local_irq_disable, in
> > > > > > > > vcpu_enter_guest function. Otherwise:
> > > > > > > > 
> > > > > > > > cpu0			vcpu1<->cpu1
> > > > > > > > 
> > > > > > > > 			vcpu->mode = IN_GUEST_MODE
> > > > > > > > 
> > > > > > > > if IN_GUEST_MODE == true
> > > > > > > > 	send IPI
> > > > > > > > 			local_irq_disable
> > > > > > > > 
> > > > > > > > PIR not transferred to VIRR, misses interrupt.
> > > > > > > > 
> > > > > > > cpu0 will set KVM_REQ_EVENT, so vmentry will be aborted after
> > > > > > > local_irq_disable() by ->requests check.
> > > > > > 
> > > > > > Yes, but you don't want KVM_REQ_EVENT+kick. It defeats the purpose 
> > > > > > of posted interrupts. You want
> > > > > > 
> > > > > > if vcpu in guest mode
> > > > > > 	send posted interrupt IPI
> > > > > > else
> > > > > > 	KVM_REQ_EVENT+kick
> > > > > > 
> > > > > I am thinking:
> > > > > 
> > > > >  set KVM_REQ_EVENT
> > > > >  if pi is enabled
> > > > >      send posted interrupt IPI
> > > > >  else
> > > > >      kick
> > > > 
> > > > KVM_REQ_EVENT must be after sending posted interrupt IPI. Otherwise on
> > > > the vcpu entry side
> > > > 
> > > > test_and_clear(KVM_REQ_EVENT) {
> > > > 	No bits set in PIR
> > > >    }
> > > > 
> > > It should be after updating PIR, but before sending posted interrupt
> > > IPI. Otherwise:
> > > 
> > >  cpu0                             cpu1/vcpu
> > > 
> > >                                   KVM_REQ_EVENT is not set
> > > set pir                         
> > > send IPI
> > >                                   irq_disable()
> > >                                   ->request is empty.
> > > set KVM_REQ_EVENT
> > > 
> > > That's the same sequence as with IRR update, KVM_REQ_EVENT and kick
> > > today.
> > 
> > Can only send IPI if vcpu->mode == IN_GUEST_MODE, which must be set
> > after interrupt flag is cleared as noted.
> > 
> > Also KVM_REQ_EVENT is processed outside of interrupt disabled region today.
> But it is checked in interrupt disabled section and vcpu entry is
> aborted if event is pending.
> 
> > 
> > Or maybe i don't get what you say... write a complete
> > description?
> > 
> I am saying that we do not need to move  vcpu->mode = IN_GUEST_MODE to
> irq_disable() section to make things work. Just do:
> 
> set bit in pir
> set KVM_REQ_EVENT
> if in guest mode do IPI

I see. Yeah, probably.

> > > > What about item 4 below?
> > > > 
> > > That's for Yang to answer :)
> > 
> > "If more than one interrupt is generated with the same vector number,
> > the local APIC can set the bit for the vector both in the IRR and ISR.
> > This means that for the Pentium 4 and Intel Xeon processors, the IRR
> > and ISR can queue two interrupts for each interrupt vector: one in the
> > IRR and one in the ISR. Any additional interrupts issued for the same
> > interrupt vector are collapsed into the single bit in IRR"
> > 
> > Which would mean KVM emulation differs... "Eventually 3 interrupts can
> > be queued: one in IRR, one in ISR, one in PIR".
> I do not think this is the case though. PIR will be folded into IRR
> either during a guest entry or by vcpu itself on receiving of
> notification vector IPI.
> 
> > 
> > Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour?
> Don't know about guests, but KVM relies on it to detect interrupt
> coalescing. So if interrupt is set in IRR but not in PIR interrupt will
> not be reported as coalesced, but it will be coalesced during PIR->IRR
> merge.

Yes, so:

1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
3. vcpu outside of guest mode.
4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
5. vcpu enters guest mode.
6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
7. HW transfers PIR into IRR.

set_irq return value at 7 is incorrect, interrupt event was _not_
queued.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Feb. 4, 2013, 5:13 p.m. UTC | #19

On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
> > > Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour?
> > Don't know about guests, but KVM relies on it to detect interrupt
> > coalescing. So if interrupt is set in IRR but not in PIR interrupt will
> > not be reported as coalesced, but it will be coalesced during PIR->IRR
> > merge.
> 
> Yes, so:
> 
> 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
> 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
> 3. vcpu outside of guest mode.
> 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
> 5. vcpu enters guest mode.
> 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
> 7. HW transfers PIR into IRR.
> 
> set_irq return value at 7 is incorrect, interrupt event was _not_
> queued.
Not sure I understand the flow of events in your description correctly. As I
understand it at 4 set_irq() will return incorrect result. Basically
when PIR is set to 1 while IRR has 1 for the vector the value of
set_irq() will be incorrect. Frankly I do not see how it can be fixed
without any race with present HW PIR design.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Feb. 4, 2013, 7:59 p.m. UTC | #20

On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
> On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
> > > > Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour?
> > > Don't know about guests, but KVM relies on it to detect interrupt
> > > coalescing. So if interrupt is set in IRR but not in PIR interrupt will
> > > not be reported as coalesced, but it will be coalesced during PIR->IRR
> > > merge.
> > 
> > Yes, so:
> > 
> > 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
> > 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
> > 3. vcpu outside of guest mode.
> > 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
> > 5. vcpu enters guest mode.
> > 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
> > 7. HW transfers PIR into IRR.
> > 
> > set_irq return value at 7 is incorrect, interrupt event was _not_
> > queued.
> Not sure I understand the flow of events in your description correctly. As I
> understand it at 4 set_irq() will return incorrect result. Basically
> when PIR is set to 1 while IRR has 1 for the vector the value of
> set_irq() will be incorrect.

At 4 it has not been coalesced: it has been queued to IRR.
At 6 it has been coalesced: PIR bit merged into IRR bit.

> Frankly I do not see how it can be fixed
> without any race with present HW PIR design.

At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
already set, don't set PIR.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Feb. 4, 2013, 8:47 p.m. UTC | #21

On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
> On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
> > On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
> > > > > Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour?
> > > > Don't know about guests, but KVM relies on it to detect interrupt
> > > > coalescing. So if interrupt is set in IRR but not in PIR interrupt will
> > > > not be reported as coalesced, but it will be coalesced during PIR->IRR
> > > > merge.
> > > 
> > > Yes, so:
> > > 
> > > 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
> > > 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
> > > 3. vcpu outside of guest mode.
> > > 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
> > > 5. vcpu enters guest mode.
> > > 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
> > > 7. HW transfers PIR into IRR.
> > > 
> > > set_irq return value at 7 is incorrect, interrupt event was _not_
> > > queued.
> > Not sure I understand the flow of events in your description correctly. As I
> > understand it at 4 set_irq() will return incorrect result. Basically
> > when PIR is set to 1 while IRR has 1 for the vector the value of
> > set_irq() will be incorrect.
> 
> At 4 it has not been coalesced: it has been queued to IRR.
> At 6 it has been coalesced: PIR bit merged into IRR bit.
> 
> > Frankly I do not see how it can be fixed
> > without any race with present HW PIR design.
> 
> At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
> already set, don't set PIR.

Or:

apic_accept_interrupt() {

1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
Never set IRR when HWAPIC enabled, even if outside of guest mode.
2. Set PIR and let HW or SW VM-entry transfer it to IRR.
3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
}

Two or more concurrent set_irq can race with each other, though. Can
either document the race or add a lock.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhang, Yang Z Feb. 5, 2013, 5:57 a.m. UTC | #22

Marcelo Tosatti wrote on 2013-02-05:
> On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
>> On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
>>> On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
>>>>>> Any example how software relies on such
> two-interrupts-queued-in-IRR/ISR behaviour?
>>>>> Don't know about guests, but KVM relies on it to detect interrupt
>>>>> coalescing. So if interrupt is set in IRR but not in PIR interrupt will
>>>>> not be reported as coalesced, but it will be coalesced during PIR->IRR
>>>>> merge.
>>>> 
>>>> Yes, so:
>>>> 
>>>> 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
>>>> 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
>>>> 3. vcpu outside of guest mode.
>>>> 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
>>>> 5. vcpu enters guest mode.
>>>> 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
>>>> 7. HW transfers PIR into IRR.
>>>> 
>>>> set_irq return value at 7 is incorrect, interrupt event was _not_
>>>> queued.
>>> Not sure I understand the flow of events in your description correctly. As I
>>> understand it at 4 set_irq() will return incorrect result. Basically
>>> when PIR is set to 1 while IRR has 1 for the vector the value of
>>> set_irq() will be incorrect.
>> 
>> At 4 it has not been coalesced: it has been queued to IRR.
>> At 6 it has been coalesced: PIR bit merged into IRR bit.
>> 
>>> Frankly I do not see how it can be fixed
>>> without any race with present HW PIR design.
>> 
>> At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
>> already set, don't set PIR.
> 
> Or:
> 
> apic_accept_interrupt() {
> 
> 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> Never set IRR when HWAPIC enabled, even if outside of guest mode.
> 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> }
> 
> Two or more concurrent set_irq can race with each other, though. Can
> either document the race or add a lock.
According the SDM, software should not touch the IRR when target vcpu is running. Instead, use locked way to access PIR. So your solution may wrong.
The only problem is the step 6, but at that point, there already an interrupt pending in IRR. This means the interrupt will be handled not lost. And even in real hardware, this case do exist. So I think it should not be a problem. 

Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Feb. 5, 2013, 7:32 a.m. UTC | #23

On Mon, Feb 04, 2013 at 06:47:30PM -0200, Marcelo Tosatti wrote:
> On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
> > On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
> > > On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
> > > > > > Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour?
> > > > > Don't know about guests, but KVM relies on it to detect interrupt
> > > > > coalescing. So if interrupt is set in IRR but not in PIR interrupt will
> > > > > not be reported as coalesced, but it will be coalesced during PIR->IRR
> > > > > merge.
> > > > 
> > > > Yes, so:
> > > > 
> > > > 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
> > > > 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
> > > > 3. vcpu outside of guest mode.
> > > > 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
> > > > 5. vcpu enters guest mode.
> > > > 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
> > > > 7. HW transfers PIR into IRR.
> > > > 
> > > > set_irq return value at 7 is incorrect, interrupt event was _not_
> > > > queued.
> > > Not sure I understand the flow of events in your description correctly. As I
> > > understand it at 4 set_irq() will return incorrect result. Basically
> > > when PIR is set to 1 while IRR has 1 for the vector the value of
> > > set_irq() will be incorrect.
> > 
> > At 4 it has not been coalesced: it has been queued to IRR.
> > At 6 it has been coalesced: PIR bit merged into IRR bit.
> > 
Yes, that's the case.

> > > Frankly I do not see how it can be fixed
> > > without any race with present HW PIR design.
> > 
> > At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
> > already set, don't set PIR.
Need to check both IRR and PIR. Something like that:

apic_accept_interrupt() {
 if (PIR || IRR)
   return coalesced;
 else
   set PIR;
}

This has two problems. Firs is that interrupt that can be delivered will
be not (IRR is cleared just after it was tested), but it will be reported
as coalesced, so this is benign race. Second is that interrupt may be
reported as delivered, but it will be coalesced (possible only with the self
IPI with the same vector):

Starting condition: PIR=0, IRR=0 vcpu is in a guest mode

io thread                 |           vcpu
accept_apic_interrupt()   |
 PIR and IRR is zero      |
 set PIR                  |
 return delivered         |
                          |      self IPI
                          |      set IRR
                          |     merge PIR to IRR (*)

At (*) interrupt that was reported as delivered is coalesced.

> 
> Or:
> 
> apic_accept_interrupt() {
> 
> 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> Never set IRR when HWAPIC enabled, even if outside of guest mode.
> 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> }
> 
This can report interrupt as coalesced, but it will be eventually delivered
as separate interrupt:

Starting condition: PIR=0, IRR=1 vcpu is in a guest mode

 io thread              |         vcpu
                        |         
accept_apic_interrupt() |
ORIG_PIR=0, ORIG_IRR=1  |
                        |    EOI
                        |    clear IRR, set ISR
set PIR                 |
return coalesced        |
                        |    clear PIR, set IRR
                        |    EOI
                        |    clear IRR, set ISR (*)

At (*) interrupt that was reported as coalesced is delivered.


So still no perfect solution. But first one has much less serious
problems for our practical needs.

> Two or more concurrent set_irq can race with each other, though. Can
> either document the race or add a lock.
> 

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Feb. 5, 2013, 8 a.m. UTC | #24

On Tue, Feb 05, 2013 at 05:57:14AM +0000, Zhang, Yang Z wrote:
> Marcelo Tosatti wrote on 2013-02-05:
> > On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
> >> On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
> >>> On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
> >>>>>> Any example how software relies on such
> > two-interrupts-queued-in-IRR/ISR behaviour?
> >>>>> Don't know about guests, but KVM relies on it to detect interrupt
> >>>>> coalescing. So if interrupt is set in IRR but not in PIR interrupt will
> >>>>> not be reported as coalesced, but it will be coalesced during PIR->IRR
> >>>>> merge.
> >>>> 
> >>>> Yes, so:
> >>>> 
> >>>> 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
> >>>> 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
> >>>> 3. vcpu outside of guest mode.
> >>>> 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
> >>>> 5. vcpu enters guest mode.
> >>>> 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
> >>>> 7. HW transfers PIR into IRR.
> >>>> 
> >>>> set_irq return value at 7 is incorrect, interrupt event was _not_
> >>>> queued.
> >>> Not sure I understand the flow of events in your description correctly. As I
> >>> understand it at 4 set_irq() will return incorrect result. Basically
> >>> when PIR is set to 1 while IRR has 1 for the vector the value of
> >>> set_irq() will be incorrect.
> >> 
> >> At 4 it has not been coalesced: it has been queued to IRR.
> >> At 6 it has been coalesced: PIR bit merged into IRR bit.
> >> 
> >>> Frankly I do not see how it can be fixed
> >>> without any race with present HW PIR design.
> >> 
> >> At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
> >> already set, don't set PIR.
> > 
> > Or:
> > 
> > apic_accept_interrupt() {
> > 
> > 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> > Never set IRR when HWAPIC enabled, even if outside of guest mode.
> > 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> > 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> > }
> > 
> > Two or more concurrent set_irq can race with each other, though. Can
> > either document the race or add a lock.
> According the SDM, software should not touch the IRR when target vcpu is running. Instead, use locked way to access PIR. So your solution may wrong.
Then your apicv patches are broken, because they do exactly that.

> The only problem is the step 6, but at that point, there already an interrupt pending in IRR. This means the interrupt will be handled not lost. And even in real hardware, this case do exist. So I think it should not be a problem. 
> 
This is not the problem we are trying to fix. Sometimes we need to make
sure that each interrupt device generates result in an interrupt handler
invocation in a guest. If interrupt is coalesced (meaning it will not
correspond to separate invocation of a guest interrupt handler) it needs
to be re-injected. With PIR detection of such condition is broken.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhang, Yang Z Feb. 5, 2013, 10:35 a.m. UTC | #25

Gleb Natapov wrote on 2013-02-05:
> On Tue, Feb 05, 2013 at 05:57:14AM +0000, Zhang, Yang Z wrote:
>> Marcelo Tosatti wrote on 2013-02-05:
>>> On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
>>>> On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
>>>>> On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
>>>>>>>> Any example how software relies on such
>>> two-interrupts-queued-in-IRR/ISR behaviour?
>>>>>>> Don't know about guests, but KVM relies on it to detect interrupt
>>>>>>> coalescing. So if interrupt is set in IRR but not in PIR interrupt will
>>>>>>> not be reported as coalesced, but it will be coalesced during PIR->IRR
>>>>>>> merge.
>>>>>> 
>>>>>> Yes, so:
>>>>>> 
>>>>>> 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
>>>>>> 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
>>>>>> 3. vcpu outside of guest mode.
>>>>>> 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
>>>>>> 5. vcpu enters guest mode.
>>>>>> 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
>>>>>> 7. HW transfers PIR into IRR.
>>>>>> 
>>>>>> set_irq return value at 7 is incorrect, interrupt event was _not_
>>>>>> queued.
>>>>> Not sure I understand the flow of events in your description correctly. As I
>>>>> understand it at 4 set_irq() will return incorrect result. Basically
>>>>> when PIR is set to 1 while IRR has 1 for the vector the value of
>>>>> set_irq() will be incorrect.
>>>> 
>>>> At 4 it has not been coalesced: it has been queued to IRR.
>>>> At 6 it has been coalesced: PIR bit merged into IRR bit.
>>>> 
>>>>> Frankly I do not see how it can be fixed
>>>>> without any race with present HW PIR design.
>>>> 
>>>> At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
>>>> already set, don't set PIR.
>>> 
>>> Or:
>>> 
>>> apic_accept_interrupt() {
>>> 
>>> 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
>>> Never set IRR when HWAPIC enabled, even if outside of guest mode.
>>> 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
>>> 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
>>> }
>>> 
>>> Two or more concurrent set_irq can race with each other, though. Can
>>> either document the race or add a lock.
>> According the SDM, software should not touch the IRR when target vcpu is
> running. Instead, use locked way to access PIR. So your solution may wrong.
> Then your apicv patches are broken, because they do exactly that.
Which code is broken?
 
>> The only problem is the step 6, but at that point, there already an interrupt
> pending in IRR. This means the interrupt will be handled not lost. And even in real
> hardware, this case do exist. So I think it should not be a problem.
>> 
> This is not the problem we are trying to fix. Sometimes we need to make
> sure that each interrupt device generates result in an interrupt handler
> invocation in a guest. If interrupt is coalesced (meaning it will not
> correspond to separate invocation of a guest interrupt handler) it needs
> to be re-injected. With PIR detection of such condition is broken.


Best regards,
Yang

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Feb. 5, 2013, 10:54 a.m. UTC | #26

On Tue, Feb 05, 2013 at 10:35:55AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2013-02-05:
> > On Tue, Feb 05, 2013 at 05:57:14AM +0000, Zhang, Yang Z wrote:
> >> Marcelo Tosatti wrote on 2013-02-05:
> >>> On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
> >>>> On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
> >>>>> On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
> >>>>>>>> Any example how software relies on such
> >>> two-interrupts-queued-in-IRR/ISR behaviour?
> >>>>>>> Don't know about guests, but KVM relies on it to detect interrupt
> >>>>>>> coalescing. So if interrupt is set in IRR but not in PIR interrupt will
> >>>>>>> not be reported as coalesced, but it will be coalesced during PIR->IRR
> >>>>>>> merge.
> >>>>>> 
> >>>>>> Yes, so:
> >>>>>> 
> >>>>>> 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
> >>>>>> 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
> >>>>>> 3. vcpu outside of guest mode.
> >>>>>> 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
> >>>>>> 5. vcpu enters guest mode.
> >>>>>> 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
> >>>>>> 7. HW transfers PIR into IRR.
> >>>>>> 
> >>>>>> set_irq return value at 7 is incorrect, interrupt event was _not_
> >>>>>> queued.
> >>>>> Not sure I understand the flow of events in your description correctly. As I
> >>>>> understand it at 4 set_irq() will return incorrect result. Basically
> >>>>> when PIR is set to 1 while IRR has 1 for the vector the value of
> >>>>> set_irq() will be incorrect.
> >>>> 
> >>>> At 4 it has not been coalesced: it has been queued to IRR.
> >>>> At 6 it has been coalesced: PIR bit merged into IRR bit.
> >>>> 
> >>>>> Frankly I do not see how it can be fixed
> >>>>> without any race with present HW PIR design.
> >>>> 
> >>>> At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
> >>>> already set, don't set PIR.
> >>> 
> >>> Or:
> >>> 
> >>> apic_accept_interrupt() {
> >>> 
> >>> 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> >>> Never set IRR when HWAPIC enabled, even if outside of guest mode.
> >>> 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> >>> 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> >>> }
> >>> 
> >>> Two or more concurrent set_irq can race with each other, though. Can
> >>> either document the race or add a lock.
> >> According the SDM, software should not touch the IRR when target vcpu is
> > running. Instead, use locked way to access PIR. So your solution may wrong.
> > Then your apicv patches are broken, because they do exactly that.
> Which code is broken?
>  
The one that updates IRR directly on the apic page.

> >> The only problem is the step 6, but at that point, there already an interrupt
> > pending in IRR. This means the interrupt will be handled not lost. And even in real
> > hardware, this case do exist. So I think it should not be a problem.
> >> 
> > This is not the problem we are trying to fix. Sometimes we need to make
> > sure that each interrupt device generates result in an interrupt handler
> > invocation in a guest. If interrupt is coalesced (meaning it will not
> > correspond to separate invocation of a guest interrupt handler) it needs
> > to be re-injected. With PIR detection of such condition is broken.
> 
> 
> Best regards,
> Yang

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhang, Yang Z Feb. 5, 2013, 10:58 a.m. UTC | #27

Gleb Natapov wrote on 2013-02-05:
> On Tue, Feb 05, 2013 at 10:35:55AM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2013-02-05:
>>> On Tue, Feb 05, 2013 at 05:57:14AM +0000, Zhang, Yang Z wrote:
>>>> Marcelo Tosatti wrote on 2013-02-05:
>>>>> On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
>>>>>> On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
>>>>>>> On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
>>>>>>>>>> Any example how software relies on such
>>>>> two-interrupts-queued-in-IRR/ISR behaviour?
>>>>>>>>> Don't know about guests, but KVM relies on it to detect interrupt
>>>>>>>>> coalescing. So if interrupt is set in IRR but not in PIR interrupt will
>>>>>>>>> not be reported as coalesced, but it will be coalesced during PIR->IRR
>>>>>>>>> merge.
>>>>>>>> 
>>>>>>>> Yes, so:
>>>>>>>> 
>>>>>>>> 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
>>>>>>>> 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
>>>>>>>> 3. vcpu outside of guest mode.
>>>>>>>> 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
>>>>>>>> 5. vcpu enters guest mode.
>>>>>>>> 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
>>>>>>>> 7. HW transfers PIR into IRR.
>>>>>>>> 
>>>>>>>> set_irq return value at 7 is incorrect, interrupt event was _not_
>>>>>>>> queued.
>>>>>>> Not sure I understand the flow of events in your description
>>>>>>> correctly. As I understand it at 4 set_irq() will return incorrect
>>>>>>> result. Basically when PIR is set to 1 while IRR has 1 for the
>>>>>>> vector the value of set_irq() will be incorrect.
>>>>>> 
>>>>>> At 4 it has not been coalesced: it has been queued to IRR.
>>>>>> At 6 it has been coalesced: PIR bit merged into IRR bit.
>>>>>> 
>>>>>>> Frankly I do not see how it can be fixed
>>>>>>> without any race with present HW PIR design.
>>>>>> 
>>>>>> At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
>>>>>> already set, don't set PIR.
>>>>> 
>>>>> Or:
>>>>> 
>>>>> apic_accept_interrupt() {
>>>>> 
>>>>> 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
>>>>> Never set IRR when HWAPIC enabled, even if outside of guest mode.
>>>>> 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
>>>>> 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
>>>>> }
>>>>> 
>>>>> Two or more concurrent set_irq can race with each other, though. Can
>>>>> either document the race or add a lock.
>>>> According the SDM, software should not touch the IRR when target vcpu is
>>> running. Instead, use locked way to access PIR. So your solution may wrong.
>>> Then your apicv patches are broken, because they do exactly that.
>> Which code is broken?
>> 
> The one that updates IRR directly on the apic page.
No, all the updates are ensuring the target vcpu is not running. So it's safe to touch IRR.

>>>> The only problem is the step 6, but at that point, there already an interrupt
>>> pending in IRR. This means the interrupt will be handled not lost. And
>>> even in real hardware, this case do exist. So I think it should not be
>>> a problem.
>>>> 
>>> This is not the problem we are trying to fix. Sometimes we need to make
>>> sure that each interrupt device generates result in an interrupt handler
>>> invocation in a guest. If interrupt is coalesced (meaning it will not
>>> correspond to separate invocation of a guest interrupt handler) it needs
>>> to be re-injected. With PIR detection of such condition is broken.
>> 
>> 
>> Best regards,
>> Yang
> 
> --
> 			Gleb.


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Feb. 5, 2013, 11:16 a.m. UTC | #28

On Tue, Feb 05, 2013 at 10:58:28AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2013-02-05:
> > On Tue, Feb 05, 2013 at 10:35:55AM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2013-02-05:
> >>> On Tue, Feb 05, 2013 at 05:57:14AM +0000, Zhang, Yang Z wrote:
> >>>> Marcelo Tosatti wrote on 2013-02-05:
> >>>>> On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
> >>>>>> On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
> >>>>>>> On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
> >>>>>>>>>> Any example how software relies on such
> >>>>> two-interrupts-queued-in-IRR/ISR behaviour?
> >>>>>>>>> Don't know about guests, but KVM relies on it to detect interrupt
> >>>>>>>>> coalescing. So if interrupt is set in IRR but not in PIR interrupt will
> >>>>>>>>> not be reported as coalesced, but it will be coalesced during PIR->IRR
> >>>>>>>>> merge.
> >>>>>>>> 
> >>>>>>>> Yes, so:
> >>>>>>>> 
> >>>>>>>> 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
> >>>>>>>> 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
> >>>>>>>> 3. vcpu outside of guest mode.
> >>>>>>>> 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
> >>>>>>>> 5. vcpu enters guest mode.
> >>>>>>>> 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
> >>>>>>>> 7. HW transfers PIR into IRR.
> >>>>>>>> 
> >>>>>>>> set_irq return value at 7 is incorrect, interrupt event was _not_
> >>>>>>>> queued.
> >>>>>>> Not sure I understand the flow of events in your description
> >>>>>>> correctly. As I understand it at 4 set_irq() will return incorrect
> >>>>>>> result. Basically when PIR is set to 1 while IRR has 1 for the
> >>>>>>> vector the value of set_irq() will be incorrect.
> >>>>>> 
> >>>>>> At 4 it has not been coalesced: it has been queued to IRR.
> >>>>>> At 6 it has been coalesced: PIR bit merged into IRR bit.
> >>>>>> 
> >>>>>>> Frankly I do not see how it can be fixed
> >>>>>>> without any race with present HW PIR design.
> >>>>>> 
> >>>>>> At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
> >>>>>> already set, don't set PIR.
> >>>>> 
> >>>>> Or:
> >>>>> 
> >>>>> apic_accept_interrupt() {
> >>>>> 
> >>>>> 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> >>>>> Never set IRR when HWAPIC enabled, even if outside of guest mode.
> >>>>> 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> >>>>> 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> >>>>> }
> >>>>> 
> >>>>> Two or more concurrent set_irq can race with each other, though. Can
> >>>>> either document the race or add a lock.
> >>>> According the SDM, software should not touch the IRR when target vcpu is
> >>> running. Instead, use locked way to access PIR. So your solution may wrong.
> >>> Then your apicv patches are broken, because they do exactly that.
> >> Which code is broken?
> >> 
> > The one that updates IRR directly on the apic page.
> No, all the updates are ensuring the target vcpu is not running. So it's safe to touch IRR.
> 
Not at all. Read the code.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhang, Yang Z Feb. 5, 2013, 1:26 p.m. UTC | #29

Gleb Natapov wrote on 2013-02-05:
> On Tue, Feb 05, 2013 at 10:58:28AM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2013-02-05:
>>> On Tue, Feb 05, 2013 at 10:35:55AM +0000, Zhang, Yang Z wrote:
>>>> Gleb Natapov wrote on 2013-02-05:
>>>>> On Tue, Feb 05, 2013 at 05:57:14AM +0000, Zhang, Yang Z wrote:
>>>>>> Marcelo Tosatti wrote on 2013-02-05:
>>>>>>> On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
>>>>>>>> On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
>>>>>>>>> On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
>>>>>>>>>>>> Any example how software relies on such
>>>>>>> two-interrupts-queued-in-IRR/ISR behaviour?
>>>>>>>>>>> Don't know about guests, but KVM relies on it to detect
>>>>>>>>>>> interrupt coalescing. So if interrupt is set in IRR but not in
>>>>>>>>>>> PIR interrupt will not be reported as coalesced, but it will
>>>>>>>>>>> be coalesced during PIR->IRR merge.
>>>>>>>>>> 
>>>>>>>>>> Yes, so:
>>>>>>>>>> 
>>>>>>>>>> 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
>>>>>>>>>> 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
>>>>>>>>>> 3. vcpu outside of guest mode.
>>>>>>>>>> 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
>>>>>>>>>> 5. vcpu enters guest mode.
>>>>>>>>>> 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
>>>>>>>>>> 7. HW transfers PIR into IRR.
>>>>>>>>>> 
>>>>>>>>>> set_irq return value at 7 is incorrect, interrupt event was _not_
>>>>>>>>>> queued.
>>>>>>>>> Not sure I understand the flow of events in your description
>>>>>>>>> correctly. As I understand it at 4 set_irq() will return incorrect
>>>>>>>>> result. Basically when PIR is set to 1 while IRR has 1 for the
>>>>>>>>> vector the value of set_irq() will be incorrect.
>>>>>>>> 
>>>>>>>> At 4 it has not been coalesced: it has been queued to IRR.
>>>>>>>> At 6 it has been coalesced: PIR bit merged into IRR bit.
>>>>>>>> 
>>>>>>>>> Frankly I do not see how it can be fixed
>>>>>>>>> without any race with present HW PIR design.
>>>>>>>> 
>>>>>>>> At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
>>>>>>>> already set, don't set PIR.
>>>>>>> 
>>>>>>> Or:
>>>>>>> 
>>>>>>> apic_accept_interrupt() {
>>>>>>> 
>>>>>>> 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
>>>>>>> Never set IRR when HWAPIC enabled, even if outside of guest mode.
>>>>>>> 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
>>>>>>> 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
>>>>>>> }
>>>>>>> 
>>>>>>> Two or more concurrent set_irq can race with each other, though. Can
>>>>>>> either document the race or add a lock.
>>>>>> According the SDM, software should not touch the IRR when target vcpu
> is
>>>>> running. Instead, use locked way to access PIR. So your solution may
>>>>> wrong. Then your apicv patches are broken, because they do exactly
>>>>> that.
>>>> Which code is broken?
>>>> 
>>> The one that updates IRR directly on the apic page.
>> No, all the updates are ensuring the target vcpu is not running. So
>> it's safe to touch IRR.
>> 
> Not at all. Read the code.
Sorry. I still cannot figure out which code is wrong. All the places call sync_pir_to_irr() are on target vcpu.
Can you point out the code? Thanks.

Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Feb. 5, 2013, 1:29 p.m. UTC | #30

On Tue, Feb 05, 2013 at 01:26:42PM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2013-02-05:
> > On Tue, Feb 05, 2013 at 10:58:28AM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2013-02-05:
> >>> On Tue, Feb 05, 2013 at 10:35:55AM +0000, Zhang, Yang Z wrote:
> >>>> Gleb Natapov wrote on 2013-02-05:
> >>>>> On Tue, Feb 05, 2013 at 05:57:14AM +0000, Zhang, Yang Z wrote:
> >>>>>> Marcelo Tosatti wrote on 2013-02-05:
> >>>>>>> On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
> >>>>>>>> On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
> >>>>>>>>> On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
> >>>>>>>>>>>> Any example how software relies on such
> >>>>>>> two-interrupts-queued-in-IRR/ISR behaviour?
> >>>>>>>>>>> Don't know about guests, but KVM relies on it to detect
> >>>>>>>>>>> interrupt coalescing. So if interrupt is set in IRR but not in
> >>>>>>>>>>> PIR interrupt will not be reported as coalesced, but it will
> >>>>>>>>>>> be coalesced during PIR->IRR merge.
> >>>>>>>>>> 
> >>>>>>>>>> Yes, so:
> >>>>>>>>>> 
> >>>>>>>>>> 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
> >>>>>>>>>> 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
> >>>>>>>>>> 3. vcpu outside of guest mode.
> >>>>>>>>>> 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
> >>>>>>>>>> 5. vcpu enters guest mode.
> >>>>>>>>>> 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
> >>>>>>>>>> 7. HW transfers PIR into IRR.
> >>>>>>>>>> 
> >>>>>>>>>> set_irq return value at 7 is incorrect, interrupt event was _not_
> >>>>>>>>>> queued.
> >>>>>>>>> Not sure I understand the flow of events in your description
> >>>>>>>>> correctly. As I understand it at 4 set_irq() will return incorrect
> >>>>>>>>> result. Basically when PIR is set to 1 while IRR has 1 for the
> >>>>>>>>> vector the value of set_irq() will be incorrect.
> >>>>>>>> 
> >>>>>>>> At 4 it has not been coalesced: it has been queued to IRR.
> >>>>>>>> At 6 it has been coalesced: PIR bit merged into IRR bit.
> >>>>>>>> 
> >>>>>>>>> Frankly I do not see how it can be fixed
> >>>>>>>>> without any race with present HW PIR design.
> >>>>>>>> 
> >>>>>>>> At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
> >>>>>>>> already set, don't set PIR.
> >>>>>>> 
> >>>>>>> Or:
> >>>>>>> 
> >>>>>>> apic_accept_interrupt() {
> >>>>>>> 
> >>>>>>> 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> >>>>>>> Never set IRR when HWAPIC enabled, even if outside of guest mode.
> >>>>>>> 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> >>>>>>> 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> >>>>>>> }
> >>>>>>> 
> >>>>>>> Two or more concurrent set_irq can race with each other, though. Can
> >>>>>>> either document the race or add a lock.
> >>>>>> According the SDM, software should not touch the IRR when target vcpu
> > is
> >>>>> running. Instead, use locked way to access PIR. So your solution may
> >>>>> wrong. Then your apicv patches are broken, because they do exactly
> >>>>> that.
> >>>> Which code is broken?
> >>>> 
> >>> The one that updates IRR directly on the apic page.
> >> No, all the updates are ensuring the target vcpu is not running. So
> >> it's safe to touch IRR.
> >> 
> > Not at all. Read the code.
> Sorry. I still cannot figure out which code is wrong. All the places call sync_pir_to_irr() are on target vcpu.
> Can you point out the code? Thanks.
> 
I am taking about vapic patches which are already in, not pir patches.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhang, Yang Z Feb. 5, 2013, 1:40 p.m. UTC | #31

Gleb Natapov wrote on 2013-02-05:
> On Tue, Feb 05, 2013 at 01:26:42PM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2013-02-05:
>>> On Tue, Feb 05, 2013 at 10:58:28AM +0000, Zhang, Yang Z wrote:
>>>> Gleb Natapov wrote on 2013-02-05:
>>>>> On Tue, Feb 05, 2013 at 10:35:55AM +0000, Zhang, Yang Z wrote:
>>>>>> Gleb Natapov wrote on 2013-02-05:
>>>>>>> On Tue, Feb 05, 2013 at 05:57:14AM +0000, Zhang, Yang Z wrote:
>>>>>>>> Marcelo Tosatti wrote on 2013-02-05:
>>>>>>>>> On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
>>>>>>>>>> On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
>>>>>>>>>>> On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
>>>>>>>>>>>>>> Any example how software relies on such
>>>>>>>>> two-interrupts-queued-in-IRR/ISR behaviour?
>>>>>>>>>>>>> Don't know about guests, but KVM relies on it to detect
>>>>>>>>>>>>> interrupt coalescing. So if interrupt is set in IRR but not in
>>>>>>>>>>>>> PIR interrupt will not be reported as coalesced, but it will
>>>>>>>>>>>>> be coalesced during PIR->IRR merge.
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, so:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
>>>>>>>>>>>> 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
>>>>>>>>>>>> 3. vcpu outside of guest mode.
>>>>>>>>>>>> 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
>>>>>>>>>>>> 5. vcpu enters guest mode.
>>>>>>>>>>>> 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
>>>>>>>>>>>> 7. HW transfers PIR into IRR.
>>>>>>>>>>>> 
>>>>>>>>>>>> set_irq return value at 7 is incorrect, interrupt event was _not_
>>>>>>>>>>>> queued.
>>>>>>>>>>> Not sure I understand the flow of events in your description
>>>>>>>>>>> correctly. As I understand it at 4 set_irq() will return incorrect
>>>>>>>>>>> result. Basically when PIR is set to 1 while IRR has 1 for the
>>>>>>>>>>> vector the value of set_irq() will be incorrect.
>>>>>>>>>> 
>>>>>>>>>> At 4 it has not been coalesced: it has been queued to IRR.
>>>>>>>>>> At 6 it has been coalesced: PIR bit merged into IRR bit.
>>>>>>>>>> 
>>>>>>>>>>> Frankly I do not see how it can be fixed
>>>>>>>>>>> without any race with present HW PIR design.
>>>>>>>>>> 
>>>>>>>>>> At kvm_accept_apic_interrupt, check IRR before setting PIR bit,
>>>>>>>>>> if IRR already set, don't set PIR.
>>>>>>>>> 
>>>>>>>>> Or:
>>>>>>>>> 
>>>>>>>>> apic_accept_interrupt() {
>>>>>>>>> 
>>>>>>>>> 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
>>>>>>>>> Never set IRR when HWAPIC enabled, even if outside of guest mode.
>>>>>>>>> 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
>>>>>>>>> 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> Two or more concurrent set_irq can race with each other, though. Can
>>>>>>>>> either document the race or add a lock.
>>>>>>>> According the SDM, software should not touch the IRR when target
> vcpu
>>> is
>>>>>>> running. Instead, use locked way to access PIR. So your solution may
>>>>>>> wrong. Then your apicv patches are broken, because they do exactly
>>>>>>> that.
>>>>>> Which code is broken?
>>>>>> 
>>>>> The one that updates IRR directly on the apic page.
>>>> No, all the updates are ensuring the target vcpu is not running. So
>>>> it's safe to touch IRR.
>>>> 
>>> Not at all. Read the code.
>> Sorry. I still cannot figure out which code is wrong. All the places
>> call sync_pir_to_irr() are on target vcpu. Can you point out the code?
>> Thanks.
>> 
> I am taking about vapic patches which are already in, not pir patches.
Yes, but the issue will be fixed with pir patches. With posted interrupt, it will touch PIR instead IRR and access PIR is allowed by HW.

Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Feb. 5, 2013, 1:43 p.m. UTC | #32

On Tue, Feb 05, 2013 at 01:40:44PM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2013-02-05:
> > On Tue, Feb 05, 2013 at 01:26:42PM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2013-02-05:
> >>> On Tue, Feb 05, 2013 at 10:58:28AM +0000, Zhang, Yang Z wrote:
> >>>> Gleb Natapov wrote on 2013-02-05:
> >>>>> On Tue, Feb 05, 2013 at 10:35:55AM +0000, Zhang, Yang Z wrote:
> >>>>>> Gleb Natapov wrote on 2013-02-05:
> >>>>>>> On Tue, Feb 05, 2013 at 05:57:14AM +0000, Zhang, Yang Z wrote:
> >>>>>>>> Marcelo Tosatti wrote on 2013-02-05:
> >>>>>>>>> On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
> >>>>>>>>>> On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
> >>>>>>>>>>> On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
> >>>>>>>>>>>>>> Any example how software relies on such
> >>>>>>>>> two-interrupts-queued-in-IRR/ISR behaviour?
> >>>>>>>>>>>>> Don't know about guests, but KVM relies on it to detect
> >>>>>>>>>>>>> interrupt coalescing. So if interrupt is set in IRR but not in
> >>>>>>>>>>>>> PIR interrupt will not be reported as coalesced, but it will
> >>>>>>>>>>>>> be coalesced during PIR->IRR merge.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Yes, so:
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
> >>>>>>>>>>>> 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
> >>>>>>>>>>>> 3. vcpu outside of guest mode.
> >>>>>>>>>>>> 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
> >>>>>>>>>>>> 5. vcpu enters guest mode.
> >>>>>>>>>>>> 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
> >>>>>>>>>>>> 7. HW transfers PIR into IRR.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> set_irq return value at 7 is incorrect, interrupt event was _not_
> >>>>>>>>>>>> queued.
> >>>>>>>>>>> Not sure I understand the flow of events in your description
> >>>>>>>>>>> correctly. As I understand it at 4 set_irq() will return incorrect
> >>>>>>>>>>> result. Basically when PIR is set to 1 while IRR has 1 for the
> >>>>>>>>>>> vector the value of set_irq() will be incorrect.
> >>>>>>>>>> 
> >>>>>>>>>> At 4 it has not been coalesced: it has been queued to IRR.
> >>>>>>>>>> At 6 it has been coalesced: PIR bit merged into IRR bit.
> >>>>>>>>>> 
> >>>>>>>>>>> Frankly I do not see how it can be fixed
> >>>>>>>>>>> without any race with present HW PIR design.
> >>>>>>>>>> 
> >>>>>>>>>> At kvm_accept_apic_interrupt, check IRR before setting PIR bit,
> >>>>>>>>>> if IRR already set, don't set PIR.
> >>>>>>>>> 
> >>>>>>>>> Or:
> >>>>>>>>> 
> >>>>>>>>> apic_accept_interrupt() {
> >>>>>>>>> 
> >>>>>>>>> 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> >>>>>>>>> Never set IRR when HWAPIC enabled, even if outside of guest mode.
> >>>>>>>>> 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> >>>>>>>>> 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> >>>>>>>>> }
> >>>>>>>>> 
> >>>>>>>>> Two or more concurrent set_irq can race with each other, though. Can
> >>>>>>>>> either document the race or add a lock.
> >>>>>>>> According the SDM, software should not touch the IRR when target
> > vcpu
> >>> is
> >>>>>>> running. Instead, use locked way to access PIR. So your solution may
> >>>>>>> wrong. Then your apicv patches are broken, because they do exactly
> >>>>>>> that.
> >>>>>> Which code is broken?
> >>>>>> 
> >>>>> The one that updates IRR directly on the apic page.
> >>>> No, all the updates are ensuring the target vcpu is not running. So
> >>>> it's safe to touch IRR.
> >>>> 
> >>> Not at all. Read the code.
> >> Sorry. I still cannot figure out which code is wrong. All the places
> >> call sync_pir_to_irr() are on target vcpu. Can you point out the code?
> >> Thanks.
> >> 
> > I am taking about vapic patches which are already in, not pir patches.
> Yes, but the issue will be fixed with pir patches. With posted interrupt, it will touch PIR instead IRR and access PIR is allowed by HW.
> 
So the current code is broken?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Feb. 6, 2013, 10:49 p.m. UTC | #33

On Tue, Feb 05, 2013 at 09:32:50AM +0200, Gleb Natapov wrote:
> On Mon, Feb 04, 2013 at 06:47:30PM -0200, Marcelo Tosatti wrote:
> > On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
> > > On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
> > > > On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
> > > > > > > Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour?
> > > > > > Don't know about guests, but KVM relies on it to detect interrupt
> > > > > > coalescing. So if interrupt is set in IRR but not in PIR interrupt will
> > > > > > not be reported as coalesced, but it will be coalesced during PIR->IRR
> > > > > > merge.
> > > > > 
> > > > > Yes, so:
> > > > > 
> > > > > 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
> > > > > 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
> > > > > 3. vcpu outside of guest mode.
> > > > > 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
> > > > > 5. vcpu enters guest mode.
> > > > > 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
> > > > > 7. HW transfers PIR into IRR.
> > > > > 
> > > > > set_irq return value at 7 is incorrect, interrupt event was _not_
> > > > > queued.
> > > > Not sure I understand the flow of events in your description correctly. As I
> > > > understand it at 4 set_irq() will return incorrect result. Basically
> > > > when PIR is set to 1 while IRR has 1 for the vector the value of
> > > > set_irq() will be incorrect.
> > > 
> > > At 4 it has not been coalesced: it has been queued to IRR.
> > > At 6 it has been coalesced: PIR bit merged into IRR bit.
> > > 
> Yes, that's the case.
> 
> > > > Frankly I do not see how it can be fixed
> > > > without any race with present HW PIR design.
> > > 
> > > At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
> > > already set, don't set PIR.
> Need to check both IRR and PIR. Something like that:
> 
> apic_accept_interrupt() {
>  if (PIR || IRR)
>    return coalesced;
>  else
>    set PIR;
> }
> 
> This has two problems. Firs is that interrupt that can be delivered will
> be not (IRR is cleared just after it was tested), but it will be reported
> as coalesced, so this is benign race. 

Yes, and the same condition exists today with IRR, its fine.

> Second is that interrupt may be
> reported as delivered, but it will be coalesced (possible only with the self
> IPI with the same vector):
> 
> Starting condition: PIR=0, IRR=0 vcpu is in a guest mode
> 
> io thread                 |           vcpu
> accept_apic_interrupt()   |
>  PIR and IRR is zero      |
>  set PIR                  |
>  return delivered         |
>                           |      self IPI
>                           |      set IRR
>                           |     merge PIR to IRR (*)
> 
> At (*) interrupt that was reported as delivered is coalesced.

Only vcpu itself should send self-IPI, so its fine.

> > Or:
> > 
> > apic_accept_interrupt() {
> > 
> > 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> > Never set IRR when HWAPIC enabled, even if outside of guest mode.
> > 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> > 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> > }
> > 
> This can report interrupt as coalesced, but it will be eventually delivered
> as separate interrupt:
> 
> Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
> 
>  io thread              |         vcpu
>                         |         
> accept_apic_interrupt() |
> ORIG_PIR=0, ORIG_IRR=1  |
>                         |    EOI
>                         |    clear IRR, set ISR
> set PIR                 |
> return coalesced        |
>                         |    clear PIR, set IRR
>                         |    EOI
>                         |    clear IRR, set ISR (*)
> 
> At (*) interrupt that was reported as coalesced is delivered.
> 
> 
> So still no perfect solution. But first one has much less serious
> problems for our practical needs.
> 
> > Two or more concurrent set_irq can race with each other, though. Can
> > either document the race or add a lock.
> > 
> 
> --
> 			Gleb.

Ok, then:

accept_apic_irq:
1. coalesced = test_and_set_bit(PIR)
2. set KVM_REQ_EVENT bit 	(*)
3. if (vcpu->in_guest_mode)
4.	if (test_and_set_bit(pir notification bit))
5.		send PIR IPI
6. return coalesced

Other sites:
A: On VM-entry, after disabling interrupts, but before
the last check for ->requests, clear pir notification bit 
(unconditionally).

(*) This is _necessary_ also because during VM-exit a PIR IPI interrupt can 
be missed, so the KVM_REQ_EVENT indicates that SW is responsible for
PIR->IRR transfer.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Feb. 7, 2013, 12:24 a.m. UTC | #34

On Wed, Feb 06, 2013 at 08:49:23PM -0200, Marcelo Tosatti wrote:
> On Tue, Feb 05, 2013 at 09:32:50AM +0200, Gleb Natapov wrote:
> > On Mon, Feb 04, 2013 at 06:47:30PM -0200, Marcelo Tosatti wrote:
> > > On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
> > > > On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
> > > > > On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
> > > > > > > > Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour?
> > > > > > > Don't know about guests, but KVM relies on it to detect interrupt
> > > > > > > coalescing. So if interrupt is set in IRR but not in PIR interrupt will
> > > > > > > not be reported as coalesced, but it will be coalesced during PIR->IRR
> > > > > > > merge.
> > > > > > 
> > > > > > Yes, so:
> > > > > > 
> > > > > > 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
> > > > > > 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
> > > > > > 3. vcpu outside of guest mode.
> > > > > > 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
> > > > > > 5. vcpu enters guest mode.
> > > > > > 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
> > > > > > 7. HW transfers PIR into IRR.
> > > > > > 
> > > > > > set_irq return value at 7 is incorrect, interrupt event was _not_
> > > > > > queued.
> > > > > Not sure I understand the flow of events in your description correctly. As I
> > > > > understand it at 4 set_irq() will return incorrect result. Basically
> > > > > when PIR is set to 1 while IRR has 1 for the vector the value of
> > > > > set_irq() will be incorrect.
> > > > 
> > > > At 4 it has not been coalesced: it has been queued to IRR.
> > > > At 6 it has been coalesced: PIR bit merged into IRR bit.
> > > > 
> > Yes, that's the case.
> > 
> > > > > Frankly I do not see how it can be fixed
> > > > > without any race with present HW PIR design.
> > > > 
> > > > At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
> > > > already set, don't set PIR.
> > Need to check both IRR and PIR. Something like that:
> > 
> > apic_accept_interrupt() {
> >  if (PIR || IRR)
> >    return coalesced;
> >  else
> >    set PIR;
> > }
> > 
> > This has two problems. Firs is that interrupt that can be delivered will
> > be not (IRR is cleared just after it was tested), but it will be reported
> > as coalesced, so this is benign race. 
> 
> Yes, and the same condition exists today with IRR, its fine.
> 
> > Second is that interrupt may be
> > reported as delivered, but it will be coalesced (possible only with the self
> > IPI with the same vector):
> > 
> > Starting condition: PIR=0, IRR=0 vcpu is in a guest mode
> > 
> > io thread                 |           vcpu
> > accept_apic_interrupt()   |
> >  PIR and IRR is zero      |
> >  set PIR                  |
> >  return delivered         |
> >                           |      self IPI
> >                           |      set IRR
> >                           |     merge PIR to IRR (*)
> > 
> > At (*) interrupt that was reported as delivered is coalesced.
> 
> Only vcpu itself should send self-IPI, so its fine.
> 
> > > Or:
> > > 
> > > apic_accept_interrupt() {
> > > 
> > > 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> > > Never set IRR when HWAPIC enabled, even if outside of guest mode.
> > > 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> > > 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> > > }
> > > 
> > This can report interrupt as coalesced, but it will be eventually delivered
> > as separate interrupt:
> > 
> > Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
> > 
> >  io thread              |         vcpu
> >                         |         
> > accept_apic_interrupt() |
> > ORIG_PIR=0, ORIG_IRR=1  |
> >                         |    EOI
> >                         |    clear IRR, set ISR
> > set PIR                 |
> > return coalesced        |
> >                         |    clear PIR, set IRR
> >                         |    EOI
> >                         |    clear IRR, set ISR (*)
> > 
> > At (*) interrupt that was reported as coalesced is delivered.
> > 
> > 
> > So still no perfect solution. But first one has much less serious
> > problems for our practical needs.
> > 
> > > Two or more concurrent set_irq can race with each other, though. Can
> > > either document the race or add a lock.
> > > 
> > 
> > --
> > 			Gleb.
> 
> Ok, then:
> 
> accept_apic_irq:
> 1. coalesced = test_and_set_bit(PIR)
> 2. set KVM_REQ_EVENT bit 	(*)
> 3. if (vcpu->in_guest_mode)
> 4.	if (test_and_set_bit(pir notification bit))
> 5.		send PIR IPI
> 6. return coalesced
> 
> Other sites:
> A: On VM-entry, after disabling interrupts, but before
> the last check for ->requests, clear pir notification bit 
> (unconditionally).
> 
> (*) This is _necessary_ also because during VM-exit a PIR IPI interrupt can 
> be missed, so the KVM_REQ_EVENT indicates that SW is responsible for
> PIR->IRR transfer.

Its not a bad idea to have a new KVM_REQ_ bit for PIR processing (just
as the current patches do).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Feb. 7, 2013, 1:23 a.m. UTC | #35

> >>>>>>>> According the SDM, software should not touch the IRR when target
> > vcpu
> >>> is
> >>>>>>> running. Instead, use locked way to access PIR. So your solution may
> >>>>>>> wrong. Then your apicv patches are broken, because they do exactly
> >>>>>>> that.
> >>>>>> Which code is broken?
> >>>>>> 
> >>>>> The one that updates IRR directly on the apic page.
> >>>> No, all the updates are ensuring the target vcpu is not running. So
> >>>> it's safe to touch IRR.
> >>>> 
> >>> Not at all. Read the code.
> >> Sorry. I still cannot figure out which code is wrong. All the places
> >> call sync_pir_to_irr() are on target vcpu. Can you point out the code?
> >> Thanks.
> >> 
> > I am taking about vapic patches which are already in, not pir patches.
> Yes, but the issue will be fixed with pir patches. With posted interrupt, it will touch PIR instead IRR and access PIR is allowed by HW.
> 
> Best regards,
> Yang
> 

From http://www.mail-archive.com/kvm@vger.kernel.org/msg82824.html:

"
> 2. Section 29.6 mentions that "Use of the posted-interrupt descriptor
> differs from that of other data structures that are referenced by
> pointers in a VMCS. There is a general requirement that software
> ensure
> that each such data structure is modified only when no logical
> processor
> with a current VMCS that references it is in VMX non-root operation.
> That requirement does not apply to the posted-interrupt descriptor.
> There is a requirement, however, that such modifications be done using
> locked read-modify-write instructions."
>
> The APIC virtual page is being modified by a CPU while a logical
> processor with current VMCS that references it is in VMX non-root
> operation, in fact even modifying the APIC virtual page with EOI
> virtualizaton, virtual interrupt delivery, etc. What are the
> requirements in this case?
It should be same with posted interrupt. Software must ensure to use
atomic access to virtual apic page.
"

Can this point be clarified? Software can or cannot access virtual APIC
page while VMCS that references it is in VMX non-root operation?

Because if it cannot, then it means the current code is broken and
VID usage without PIR should not be allowed.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Feb. 7, 2013, 1:52 p.m. UTC | #36

On Wed, Feb 06, 2013 at 10:24:06PM -0200, Marcelo Tosatti wrote:
> On Wed, Feb 06, 2013 at 08:49:23PM -0200, Marcelo Tosatti wrote:
> > On Tue, Feb 05, 2013 at 09:32:50AM +0200, Gleb Natapov wrote:
> > > On Mon, Feb 04, 2013 at 06:47:30PM -0200, Marcelo Tosatti wrote:
> > > > On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote:
> > > > > On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote:
> > > > > > On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote:
> > > > > > > > > Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour?
> > > > > > > > Don't know about guests, but KVM relies on it to detect interrupt
> > > > > > > > coalescing. So if interrupt is set in IRR but not in PIR interrupt will
> > > > > > > > not be reported as coalesced, but it will be coalesced during PIR->IRR
> > > > > > > > merge.
> > > > > > > 
> > > > > > > Yes, so:
> > > > > > > 
> > > > > > > 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no.
> > > > > > > 2. IRR=0, ISR=1, PIR=0. Event: IRR->ISR transfer.
> > > > > > > 3. vcpu outside of guest mode.
> > > > > > > 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no.
> > > > > > > 5. vcpu enters guest mode.
> > > > > > > 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no.
> > > > > > > 7. HW transfers PIR into IRR.
> > > > > > > 
> > > > > > > set_irq return value at 7 is incorrect, interrupt event was _not_
> > > > > > > queued.
> > > > > > Not sure I understand the flow of events in your description correctly. As I
> > > > > > understand it at 4 set_irq() will return incorrect result. Basically
> > > > > > when PIR is set to 1 while IRR has 1 for the vector the value of
> > > > > > set_irq() will be incorrect.
> > > > > 
> > > > > At 4 it has not been coalesced: it has been queued to IRR.
> > > > > At 6 it has been coalesced: PIR bit merged into IRR bit.
> > > > > 
> > > Yes, that's the case.
> > > 
> > > > > > Frankly I do not see how it can be fixed
> > > > > > without any race with present HW PIR design.
> > > > > 
> > > > > At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR
> > > > > already set, don't set PIR.
> > > Need to check both IRR and PIR. Something like that:
> > > 
> > > apic_accept_interrupt() {
> > >  if (PIR || IRR)
> > >    return coalesced;
> > >  else
> > >    set PIR;
> > > }
> > > 
> > > This has two problems. Firs is that interrupt that can be delivered will
> > > be not (IRR is cleared just after it was tested), but it will be reported
> > > as coalesced, so this is benign race. 
> > 
> > Yes, and the same condition exists today with IRR, its fine.
> > 
> > > Second is that interrupt may be
> > > reported as delivered, but it will be coalesced (possible only with the self
> > > IPI with the same vector):
> > > 
> > > Starting condition: PIR=0, IRR=0 vcpu is in a guest mode
> > > 
> > > io thread                 |           vcpu
> > > accept_apic_interrupt()   |
> > >  PIR and IRR is zero      |
> > >  set PIR                  |
> > >  return delivered         |
> > >                           |      self IPI
> > >                           |      set IRR
> > >                           |     merge PIR to IRR (*)
> > > 
> > > At (*) interrupt that was reported as delivered is coalesced.
> > 
> > Only vcpu itself should send self-IPI, so its fine.
> > 
> > > > Or:
> > > > 
> > > > apic_accept_interrupt() {
> > > > 
> > > > 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> > > > Never set IRR when HWAPIC enabled, even if outside of guest mode.
> > > > 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> > > > 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> > > > }
> > > > 
> > > This can report interrupt as coalesced, but it will be eventually delivered
> > > as separate interrupt:
> > > 
> > > Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
> > > 
> > >  io thread              |         vcpu
> > >                         |         
> > > accept_apic_interrupt() |
> > > ORIG_PIR=0, ORIG_IRR=1  |
> > >                         |    EOI
> > >                         |    clear IRR, set ISR
> > > set PIR                 |
> > > return coalesced        |
> > >                         |    clear PIR, set IRR
> > >                         |    EOI
> > >                         |    clear IRR, set ISR (*)
> > > 
> > > At (*) interrupt that was reported as coalesced is delivered.
> > > 
> > > 
> > > So still no perfect solution. But first one has much less serious
> > > problems for our practical needs.
> > > 
> > > > Two or more concurrent set_irq can race with each other, though. Can
> > > > either document the race or add a lock.
> > > > 
> > > 
> > > --
> > > 			Gleb.
> > 
> > Ok, then:
> > 
> > accept_apic_irq:
> > 1. coalesced = test_and_set_bit(PIR)
> > 2. set KVM_REQ_EVENT bit 	(*)
> > 3. if (vcpu->in_guest_mode)
> > 4.	if (test_and_set_bit(pir notification bit))
> > 5.		send PIR IPI
> > 6. return coalesced
> > 
> > Other sites:
> > A: On VM-entry, after disabling interrupts, but before
> > the last check for ->requests, clear pir notification bit 
> > (unconditionally).
> > 
> > (*) This is _necessary_ also because during VM-exit a PIR IPI interrupt can 
> > be missed, so the KVM_REQ_EVENT indicates that SW is responsible for
> > PIR->IRR transfer.
> 
> Its not a bad idea to have a new KVM_REQ_ bit for PIR processing (just
> as the current patches do).
Without the numbers I do not see why.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Feb. 7, 2013, 2:01 p.m. UTC | #37

On Wed, Feb 06, 2013 at 08:49:23PM -0200, Marcelo Tosatti wrote:
> > Second is that interrupt may be
> > reported as delivered, but it will be coalesced (possible only with the self
> > IPI with the same vector):
> > 
> > Starting condition: PIR=0, IRR=0 vcpu is in a guest mode
> > 
> > io thread                 |           vcpu
> > accept_apic_interrupt()   |
> >  PIR and IRR is zero      |
> >  set PIR                  |
> >  return delivered         |
> >                           |      self IPI
> >                           |      set IRR
> >                           |     merge PIR to IRR (*)
> > 
> > At (*) interrupt that was reported as delivered is coalesced.
> 
> Only vcpu itself should send self-IPI, so its fine.
> 
It is fine only because this is not happening in practice (I hope) for single interrupt
we care about. Otherwise this is serious problem.

> > > Or:
> > > 
> > > apic_accept_interrupt() {
> > > 
> > > 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> > > Never set IRR when HWAPIC enabled, even if outside of guest mode.
> > > 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> > > 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> > > }
> > > 
> > This can report interrupt as coalesced, but it will be eventually delivered
> > as separate interrupt:
> > 
> > Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
> > 
> >  io thread              |         vcpu
> >                         |         
> > accept_apic_interrupt() |
> > ORIG_PIR=0, ORIG_IRR=1  |
> >                         |    EOI
> >                         |    clear IRR, set ISR
> > set PIR                 |
> > return coalesced        |
> >                         |    clear PIR, set IRR
> >                         |    EOI
> >                         |    clear IRR, set ISR (*)
> > 
> > At (*) interrupt that was reported as coalesced is delivered.
> > 
> > 
> > So still no perfect solution. But first one has much less serious
> > problems for our practical needs.
> > 
> > > Two or more concurrent set_irq can race with each other, though. Can
> > > either document the race or add a lock.
> > > 
> > 
> > --
> > 			Gleb.
> 
> Ok, then:
> 
> accept_apic_irq:
> 1. coalesced = test_and_set_bit(PIR)
> 2. set KVM_REQ_EVENT bit 	(*)
> 3. if (vcpu->in_guest_mode)
> 4.	if (test_and_set_bit(pir notification bit))
> 5.		send PIR IPI
> 6. return coalesced
Do not see how it will help.

Starting condition: PIR=0, IRR=1 vcpu is in a guest mode

io thread                  |              vcpu
accept_apic_interrupt()    |
coalesced = 0, PIR=1       |
vcpu in a guest mode:      |
    send PIR IPI           |
                           |      receive PIR IPI
                           |      merge PIR to IRR (*)
return not coalesced       |

At (*) interrupt that was reported as delivered is coalesced.

The point is that we need to check PIR and IRR atomically and this is
impossible.

> 
> Other sites:
> A: On VM-entry, after disabling interrupts, but before
> the last check for ->requests, clear pir notification bit 
> (unconditionally).
> 
> (*) This is _necessary_ also because during VM-exit a PIR IPI interrupt can 
> be missed, so the KVM_REQ_EVENT indicates that SW is responsible for
> PIR->IRR transfer.
> 

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Feb. 7, 2013, 9:49 p.m. UTC | #38

On Thu, Feb 07, 2013 at 04:01:11PM +0200, Gleb Natapov wrote:
> On Wed, Feb 06, 2013 at 08:49:23PM -0200, Marcelo Tosatti wrote:
> > > Second is that interrupt may be
> > > reported as delivered, but it will be coalesced (possible only with the self
> > > IPI with the same vector):
> > > 
> > > Starting condition: PIR=0, IRR=0 vcpu is in a guest mode
> > > 
> > > io thread                 |           vcpu
> > > accept_apic_interrupt()   |
> > >  PIR and IRR is zero      |
> > >  set PIR                  |
> > >  return delivered         |
> > >                           |      self IPI
> > >                           |      set IRR
> > >                           |     merge PIR to IRR (*)
> > > 
> > > At (*) interrupt that was reported as delivered is coalesced.
> > 
> > Only vcpu itself should send self-IPI, so its fine.
> > 
> It is fine only because this is not happening in practice (I hope) for single interrupt
> we care about. Otherwise this is serious problem.

Coalesced information is only interesting for non IPI cases, that
is, device emulation (at the moment, at least).

The above cause can happen when loading APIC registers, but delivered
is not interesting in that case. Good to document, however.

> > > > Or:
> > > > 
> > > > apic_accept_interrupt() {
> > > > 
> > > > 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> > > > Never set IRR when HWAPIC enabled, even if outside of guest mode.
> > > > 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> > > > 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> > > > }
> > > > 
> > > This can report interrupt as coalesced, but it will be eventually delivered
> > > as separate interrupt:
> > > 
> > > Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
> > > 
> > >  io thread              |         vcpu
> > >                         |         
> > > accept_apic_interrupt() |
> > > ORIG_PIR=0, ORIG_IRR=1  |
> > >                         |    EOI
> > >                         |    clear IRR, set ISR
> > > set PIR                 |
> > > return coalesced        |
> > >                         |    clear PIR, set IRR
> > >                         |    EOI
> > >                         |    clear IRR, set ISR (*)
> > > 
> > > At (*) interrupt that was reported as coalesced is delivered.
> > > 
> > > 
> > > So still no perfect solution. But first one has much less serious
> > > problems for our practical needs.
> > > 
> > > > Two or more concurrent set_irq can race with each other, though. Can
> > > > either document the race or add a lock.
> > > > 
> > > 
> > > --
> > > 			Gleb.
> > 
> > Ok, then:
> > 
> > accept_apic_irq:
> > 1. coalesced = test_and_set_bit(PIR)
> > 2. set KVM_REQ_EVENT bit 	(*)
> > 3. if (vcpu->in_guest_mode)
> > 4.	if (test_and_set_bit(pir notification bit))
> > 5.		send PIR IPI
> > 6. return coalesced
> Do not see how it will help.
> 
> Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
> 
> io thread                  |              vcpu
> accept_apic_interrupt()    |
> coalesced = 0, PIR=1       |
> vcpu in a guest mode:      |
>     send PIR IPI           |
>                            |      receive PIR IPI
>                            |      merge PIR to IRR (*)
> return not coalesced       |
> 
> At (*) interrupt that was reported as delivered is coalesced.

Of course!

> The point is that we need to check PIR and IRR atomically and this is
> impossible.

Ok, next try:

1. orig_irr = read irr from vapic page
2. if (orig_irr == 0)
3. 	return test_and_test_bit(pir)
4. return 0

To summarize, given irr and pir (the bit for the vector in question) possibilities:

irr=0, pir=0, inject, return coalesced=0.
irr=0, pir=1, not injected, return coalesced=1.
irr=1, pir=0, not injected, return coalesced=1.
irr=1, pir=1, not injected, return coalesced=1.

Note the behaviour regarding whether to inject or not is the same as
today: it depends on IRR being set. If PIR=1 and IRR=0, IRR will be
eventually set.

> > Other sites:
> > A: On VM-entry, after disabling interrupts, but before
> > the last check for ->requests, clear pir notification bit 
> > (unconditionally).
> > 
> > (*) This is _necessary_ also because during VM-exit a PIR IPI interrupt can 
> > be missed, so the KVM_REQ_EVENT indicates that SW is responsible for
> > PIR->IRR transfer.
> > 
> 
> --
> 			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Feb. 8, 2013, 2:07 a.m. UTC | #39

On Thu, Feb 07, 2013 at 03:52:24PM +0200, Gleb Natapov wrote:
> > Its not a bad idea to have a new KVM_REQ_ bit for PIR processing (just
> > as the current patches do).
> Without the numbers I do not see why.

KVM_REQ_EVENT already means... counting... many things. Its a well
defined request, to sync PIR->VIRR, don't see your point about
performance.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Feb. 8, 2013, 12:18 p.m. UTC | #40

On Fri, Feb 08, 2013 at 12:07:36AM -0200, Marcelo Tosatti wrote:
> On Thu, Feb 07, 2013 at 03:52:24PM +0200, Gleb Natapov wrote:
> > > Its not a bad idea to have a new KVM_REQ_ bit for PIR processing (just
> > > as the current patches do).
> > Without the numbers I do not see why.
> 
> KVM_REQ_EVENT already means... counting... many things. Its a well
Exactly my point. KVM_REQ_EVENT means many things, all of them event
injection related. It can be split may be to 2/3 smaller request, but
it will complicate the code and why would we do that without actually
able to show performance improvements that split provides?

> defined request, to sync PIR->VIRR, don't see your point about
> performance.
And it is just one more things that needs to be done during event
injection. Without providing a good reason no need to handle it
specially. Performance is convincing enough reason, but I what to
see numbers.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Feb. 8, 2013, 12:28 p.m. UTC | #41

On Thu, Feb 07, 2013 at 07:49:47PM -0200, Marcelo Tosatti wrote:
> On Thu, Feb 07, 2013 at 04:01:11PM +0200, Gleb Natapov wrote:
> > On Wed, Feb 06, 2013 at 08:49:23PM -0200, Marcelo Tosatti wrote:
> > > > Second is that interrupt may be
> > > > reported as delivered, but it will be coalesced (possible only with the self
> > > > IPI with the same vector):
> > > > 
> > > > Starting condition: PIR=0, IRR=0 vcpu is in a guest mode
> > > > 
> > > > io thread                 |           vcpu
> > > > accept_apic_interrupt()   |
> > > >  PIR and IRR is zero      |
> > > >  set PIR                  |
> > > >  return delivered         |
> > > >                           |      self IPI
> > > >                           |      set IRR
> > > >                           |     merge PIR to IRR (*)
> > > > 
> > > > At (*) interrupt that was reported as delivered is coalesced.
> > > 
> > > Only vcpu itself should send self-IPI, so its fine.
> > > 
> > It is fine only because this is not happening in practice (I hope) for single interrupt
> > we care about. Otherwise this is serious problem.
> 
> Coalesced information is only interesting for non IPI cases, that
> is, device emulation (at the moment, at least).
> 
And incorrect result will be returned for an interrupt injected by an emulated device
in the scenario above.

> The above cause can happen when loading APIC registers, but delivered
> is not interesting in that case. Good to document, however.
> 
> > > > > Or:
> > > > > 
> > > > > apic_accept_interrupt() {
> > > > > 
> > > > > 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> > > > > Never set IRR when HWAPIC enabled, even if outside of guest mode.
> > > > > 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> > > > > 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> > > > > }
> > > > > 
> > > > This can report interrupt as coalesced, but it will be eventually delivered
> > > > as separate interrupt:
> > > > 
> > > > Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
> > > > 
> > > >  io thread              |         vcpu
> > > >                         |         
> > > > accept_apic_interrupt() |
> > > > ORIG_PIR=0, ORIG_IRR=1  |
> > > >                         |    EOI
> > > >                         |    clear IRR, set ISR
> > > > set PIR                 |
> > > > return coalesced        |
> > > >                         |    clear PIR, set IRR
> > > >                         |    EOI
> > > >                         |    clear IRR, set ISR (*)
> > > > 
> > > > At (*) interrupt that was reported as coalesced is delivered.
> > > > 
> > > > 
> > > > So still no perfect solution. But first one has much less serious
> > > > problems for our practical needs.
> > > > 
> > > > > Two or more concurrent set_irq can race with each other, though. Can
> > > > > either document the race or add a lock.
> > > > > 
> > > > 
> > > > --
> > > > 			Gleb.
> > > 
> > > Ok, then:
> > > 
> > > accept_apic_irq:
> > > 1. coalesced = test_and_set_bit(PIR)
> > > 2. set KVM_REQ_EVENT bit 	(*)
> > > 3. if (vcpu->in_guest_mode)
> > > 4.	if (test_and_set_bit(pir notification bit))
> > > 5.		send PIR IPI
> > > 6. return coalesced
> > Do not see how it will help.
> > 
> > Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
> > 
> > io thread                  |              vcpu
> > accept_apic_interrupt()    |
> > coalesced = 0, PIR=1       |
> > vcpu in a guest mode:      |
> >     send PIR IPI           |
> >                            |      receive PIR IPI
> >                            |      merge PIR to IRR (*)
> > return not coalesced       |
> > 
> > At (*) interrupt that was reported as delivered is coalesced.
> 
> Of course!
> 
> > The point is that we need to check PIR and IRR atomically and this is
> > impossible.
> 
> Ok, next try:
> 
> 1. orig_irr = read irr from vapic page
> 2. if (orig_irr == 0)
> 3. 	return test_and_test_bit(pir)
> 4. return 0
> 
I think this is exactly same solution we are discussing above:

apic_accept_interrupt() {
 if (PIR || IRR)
   return coalesced;
 else
   set PIR;
}

with the same self-IPI problem. IMO this is the best we can do and will
work correctly for RTC interrupt re-injection case.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Feb. 8, 2013, 1:46 p.m. UTC | #42

On Fri, Feb 08, 2013 at 02:28:44PM +0200, Gleb Natapov wrote:
> On Thu, Feb 07, 2013 at 07:49:47PM -0200, Marcelo Tosatti wrote:
> > On Thu, Feb 07, 2013 at 04:01:11PM +0200, Gleb Natapov wrote:
> > > On Wed, Feb 06, 2013 at 08:49:23PM -0200, Marcelo Tosatti wrote:
> > > > > Second is that interrupt may be
> > > > > reported as delivered, but it will be coalesced (possible only with the self
> > > > > IPI with the same vector):
> > > > > 
> > > > > Starting condition: PIR=0, IRR=0 vcpu is in a guest mode
> > > > > 
> > > > > io thread                 |           vcpu
> > > > > accept_apic_interrupt()   |
> > > > >  PIR and IRR is zero      |
> > > > >  set PIR                  |
> > > > >  return delivered         |
> > > > >                           |      self IPI
> > > > >                           |      set IRR
> > > > >                           |     merge PIR to IRR (*)
> > > > > 
> > > > > At (*) interrupt that was reported as delivered is coalesced.
> > > > 
> > > > Only vcpu itself should send self-IPI, so its fine.
> > > > 
> > > It is fine only because this is not happening in practice (I hope) for single interrupt
> > > we care about. Otherwise this is serious problem.
> > 
> > Coalesced information is only interesting for non IPI cases, that
> > is, device emulation (at the moment, at least).
> > 
> And incorrect result will be returned for an interrupt injected by an emulated device
> in the scenario above.
>
> > The above cause can happen when loading APIC registers, but delivered
> > is not interesting in that case. Good to document, however.
> > 
> > > > > > Or:
> > > > > > 
> > > > > > apic_accept_interrupt() {
> > > > > > 
> > > > > > 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR.
> > > > > > Never set IRR when HWAPIC enabled, even if outside of guest mode.
> > > > > > 2. Set PIR and let HW or SW VM-entry transfer it to IRR.
> > > > > > 3. set_irq return value: (ORIG_PIR or ORIG_IRR set).
> > > > > > }
> > > > > > 
> > > > > This can report interrupt as coalesced, but it will be eventually delivered
> > > > > as separate interrupt:
> > > > > 
> > > > > Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
> > > > > 
> > > > >  io thread              |         vcpu
> > > > >                         |         
> > > > > accept_apic_interrupt() |
> > > > > ORIG_PIR=0, ORIG_IRR=1  |
> > > > >                         |    EOI
> > > > >                         |    clear IRR, set ISR
> > > > > set PIR                 |
> > > > > return coalesced        |
> > > > >                         |    clear PIR, set IRR
> > > > >                         |    EOI
> > > > >                         |    clear IRR, set ISR (*)
> > > > > 
> > > > > At (*) interrupt that was reported as coalesced is delivered.
> > > > > 
> > > > > 
> > > > > So still no perfect solution. But first one has much less serious
> > > > > problems for our practical needs.
> > > > > 
> > > > > > Two or more concurrent set_irq can race with each other, though. Can
> > > > > > either document the race or add a lock.
> > > > > > 
> > > > > 
> > > > > --
> > > > > 			Gleb.
> > > > 
> > > > Ok, then:
> > > > 
> > > > accept_apic_irq:
> > > > 1. coalesced = test_and_set_bit(PIR)
> > > > 2. set KVM_REQ_EVENT bit 	(*)
> > > > 3. if (vcpu->in_guest_mode)
> > > > 4.	if (test_and_set_bit(pir notification bit))
> > > > 5.		send PIR IPI
> > > > 6. return coalesced
> > > Do not see how it will help.
> > > 
> > > Starting condition: PIR=0, IRR=1 vcpu is in a guest mode
> > > 
> > > io thread                  |              vcpu
> > > accept_apic_interrupt()    |
> > > coalesced = 0, PIR=1       |
> > > vcpu in a guest mode:      |
> > >     send PIR IPI           |
> > >                            |      receive PIR IPI
> > >                            |      merge PIR to IRR (*)
> > > return not coalesced       |
> > > 
> > > At (*) interrupt that was reported as delivered is coalesced.
> > 
> > Of course!
> > 
> > > The point is that we need to check PIR and IRR atomically and this is
> > > impossible.
> > 
> > Ok, next try:
> > 
> > 1. orig_irr = read irr from vapic page
> > 2. if (orig_irr == 0)
> > 3. 	return test_and_test_bit(pir)
> > 4. return 0
> > 
> I think this is exactly same solution we are discussing above:
> 
> apic_accept_interrupt() {
>  if (PIR || IRR)
>    return coalesced;
>  else
>    set PIR;
> }

Its not exactly the same. Without test_and_set_bit(PIR) two vcpus can race.

> with the same self-IPI problem. IMO this is the best we can do and will
> work correctly for RTC interrupt re-injection case.

Yes.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[2/2] x86, apicv: Add Posted Interrupt supporting

Commit Message

Comments

Patch