[v1,1/3] x86/tdx: Add TDX Guest event notify interrupt support

Message ID	20230326062039.341479-2-sathyanarayanan.kuppuswamy@linux.intel.com (mailing list archive)
State	New
Headers	show Return-Path: <linux-kselftest-owner@vger.kernel.org> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> To: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, Shuah Khan <shuah@kernel.org>, Jonathan Corbet <corbet@lwn.net> Cc: "H . Peter Anvin" <hpa@zytor.com>, Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Tony Luck <tony.luck@intel.com>, Wander Lairson Costa <wander@redhat.com>, Erdem Aktas <erdemaktas@google.com>, Guorui Yu <GuoRui.Yu@linux.alibaba.com>, Du Fan <fan.du@intel.com>, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org Subject: [PATCH v1 1/3] x86/tdx: Add TDX Guest event notify interrupt support Date: Sat, 25 Mar 2023 23:20:37 -0700 Message-Id: <20230326062039.341479-2-sathyanarayanan.kuppuswamy@linux.intel.com> In-Reply-To: <20230326062039.341479-1-sathyanarayanan.kuppuswamy@linux.intel.com> References: <20230326062039.341479-1-sathyanarayanan.kuppuswamy@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	TDX Guest Quote generation support \| expand [v1,0/3] TDX Guest Quote generation support [v1,1/3] x86/tdx: Add TDX Guest event notify interrupt support [v1,2/3] virt: tdx-guest: Add Quote generation support [v1,3/3] selftests/tdx: Test GetQuote TDX attestation feature

Kuppuswamy Sathyanarayanan March 26, 2023, 6:20 a.m. UTC

Host-guest event notification via configured interrupt vector is useful
in cases where a guest makes an asynchronous request and needs a
callback from the host to indicate the completion or to let the host
notify the guest about events like device removal. One usage example is,
callback requirement of GetQuote asynchronous hypercall.

In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
guest to specify which interrupt vector to use as an event-notify
vector to the VMM. Details about the SetupEventNotifyInterrupt
hypercall can be found in TDX Guest-Host Communication Interface
(GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".

As per design, VMM will post the event completion IRQ using the same
CPU in which SetupEventNotifyInterrupt hypercall request is received.
So allocate an IRQ vector from "x86_vector_domain", and set the CPU
affinity of the IRQ vector to the current CPU.

Add tdx_register_event_irq_cb()/tdx_unregister_event_irq_cb()
interfaces to allow drivers register/unregister event noficiation
handlers.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c    | 163 +++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/tdx.h |   6 ++
 2 files changed, 169 insertions(+)

Huang, Kai March 28, 2023, 2:38 a.m. UTC | #1

On Sat, 2023-03-25 at 23:20 -0700, Kuppuswamy Sathyanarayanan wrote:
> Host-guest event notification via configured interrupt vector is useful
> in cases where a guest makes an asynchronous request and needs a
> callback from the host to indicate the completion or to let the host
> notify the guest about events like device removal. One usage example is,
> callback requirement of GetQuote asynchronous hypercall.
> 
> In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
> guest to specify which interrupt vector to use as an event-notify
> vector to the VMM. 
> 

"to the VMM" -> "from the VMM"?

> Details about the SetupEventNotifyInterrupt
> hypercall can be found in TDX Guest-Host Communication Interface
> (GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".

It seems we shouldn't mention the exact section number.

> 
> As per design, VMM will post the event completion IRQ using the same
> CPU in which SetupEventNotifyInterrupt hypercall request is received.

"in which" -> "on which"

> So allocate an IRQ vector from "x86_vector_domain", and set the CPU
> affinity of the IRQ vector to the current CPU.

IMHO "current CPU" is a little bit vague.  Perhaps just "to the CPU on which
SetupEventNotifyInterrupt hypercall is made".

Also, perhaps it's better to mention to use IRQF_NOBALANCING to prevent the IRQ
from being migrated to another cpu.

> 
> Add tdx_register_event_irq_cb()/tdx_unregister_event_irq_cb()
> interfaces to allow drivers register/unregister event noficiation
> handlers.
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Wander Lairson Costa <wander@redhat.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/coco/tdx/tdx.c    | 163 +++++++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/tdx.h |   6 ++
>  2 files changed, 169 insertions(+)
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 055300e08fb3..d03985952d45 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -7,12 +7,18 @@
>  #include <linux/cpufeature.h>
>  #include <linux/export.h>
>  #include <linux/io.h>
> +#include <linux/string.h>
> +#include <linux/uaccess.h>

Do you need above two headers?

Also, perhaps you should explicitly include <.../list.h> and <.../spinlock.h>.


> +#include <linux/interrupt.h>
> +#include <linux/irq.h>
> +#include <linux/numa.h>
>  #include <asm/coco.h>
>  #include <asm/tdx.h>
>  #include <asm/vmx.h>
>  #include <asm/insn.h>
>  #include <asm/insn-eval.h>
>  #include <asm/pgtable.h>
> +#include <asm/irqdomain.h>
>  
>  /* TDX module Call Leaf IDs */
>  #define TDX_GET_INFO			1
> @@ -27,6 +33,7 @@
>  /* TDX hypercall Leaf IDs */
>  #define TDVMCALL_MAP_GPA		0x10001
>  #define TDVMCALL_REPORT_FATAL_ERROR	0x10003
> +#define TDVMCALL_SETUP_NOTIFY_INTR	0x10004
>  
>  /* MMIO direction */
>  #define EPT_READ	0
> @@ -51,6 +58,16 @@
>  
>  #define TDREPORT_SUBTYPE_0	0
>  
> +struct event_irq_entry {
> +	tdx_event_irq_cb_t handler;
> +	void *data;
> +	struct list_head head;
> +};
> +
> +static int tdx_event_irq;


__ro_after_init?

> +static LIST_HEAD(event_irq_cb_list);
> +static DEFINE_SPINLOCK(event_irq_cb_lock);
> +
>  /*
>   * Wrapper for standard use of __tdx_hypercall with no output aside from
>   * return code.
> @@ -873,3 +890,149 @@ void __init tdx_early_init(void)
>  
>  	pr_info("Guest detected\n");
>  }
> +
> +static irqreturn_t tdx_event_irq_handler(int irq, void *dev_id)
> +{
> +	struct event_irq_entry *entry;
> +
> +	spin_lock(&event_irq_cb_lock);
> +	list_for_each_entry(entry, &event_irq_cb_list, head) {
> +		if (entry->handler)
> +			entry->handler(entry->data);
> +	}
> +	spin_unlock(&event_irq_cb_lock);
> +
> +	return IRQ_HANDLED;
> +}
> +
> +/* Reserve an IRQ from x86_vector_domain for TD event notification */
> +static int __init tdx_event_irq_init(void)
> +{
> +	struct irq_alloc_info info;
> +	cpumask_t saved_cpumask;
> +	struct irq_cfg *cfg;
> +	int cpu, irq;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> +		return 0;
> +
> +	init_irq_alloc_info(&info, NULL);
> +
> +	/*
> +	 * Event notification vector will be delivered to the CPU
> +	 * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested.
> +	 * So set the IRQ affinity to the current CPU.
> +	 */
> +	cpu = get_cpu();
> +	cpumask_copy(&saved_cpumask, current->cpus_ptr);
> +	info.mask = cpumask_of(cpu);
> +	put_cpu();

The 'saved_cpumask' related code is ugly.  If you move put_cpu() to the end of
this function, I think you can remove all related code:

	cpu = get_cpu();

	/*
	 * Set @info->mask to local cpu to make sure a valid vector is
	 * pre-allocated when TDX event notification IRQ is allocated
	 * from x86_vector_domain.
	 */
	init_irq_alloc_info(&info, cpumask_of(cpu));

	// rest staff: request_irq(), hypercall ...

	put_cpu();
	
> +
> +	irq = irq_domain_alloc_irqs(x86_vector_domain, 1, NUMA_NO_NODE, &info);

Should you use cpu_to_node(cpu) instead of NUMA_NO_NODE?

> +	if (irq <= 0) {
> +		pr_err("Event notification IRQ allocation failed %d\n", irq);
> +		return -EIO;
> +	}
> +
> +	irq_set_handler(irq, handle_edge_irq);
> +
> +	cfg = irq_cfg(irq);
> +	if (!cfg) {
> +		pr_err("Event notification IRQ config not found\n");
> +		goto err_free_irqs;
> +	}

You are depending on irq_domain_alloc_irqs() to have already allocated a vector
on the local cpu.  Then if !cfg, your code of calling irq_domain_alloc_irqs() to
allocate vector is broken.

So, perhaps you should just WARN() if vector hasn't been allocated to catch bug.

	WARN(!(irq_cfg(irq)->vector));


> +
> +	if (request_irq(irq, tdx_event_irq_handler, IRQF_NOBALANCING,

It's better to add a comment to explain why using IRQF_NOBALANCING.

> +			"tdx_event_irq", NULL)) {
> +		pr_err("Event notification IRQ request failed\n");
> +		goto err_free_irqs;
> +	}
> +
> +	set_cpus_allowed_ptr(current, cpumask_of(cpu));
> +
> +	/*
> +	 * Register callback vector address with VMM. More details
> +	 * about the ABI can be found in TDX Guest-Host-Communication
> +	 * Interface (GHCI), sec titled
> +	 * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
> +	 */
> +	if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, cfg->vector, 0, 0, 0)) {
> +		pr_err("Event notification hypercall failed\n");
> +		goto err_restore_cpus;
> +	}
> +
> +	set_cpus_allowed_ptr(current, &saved_cpumask);
> +
> +	tdx_event_irq = irq;
> +
> +	return 0;
> +
> +err_restore_cpus:
> +	set_cpus_allowed_ptr(current, &saved_cpumask);
> +	free_irq(irq, NULL);
> +err_free_irqs:
> +	irq_domain_free_irqs(irq, 1);
> +
> +	return -EIO;
> +}
> +arch_initcall(tdx_event_irq_init)
> +

Kuppuswamy Sathyanarayanan March 28, 2023, 2:50 a.m. UTC | #2

Hi Kai,

On 3/27/23 7:38 PM, Huang, Kai wrote:
>> +/* Reserve an IRQ from x86_vector_domain for TD event notification */
>> +static int __init tdx_event_irq_init(void)
>> +{
>> +	struct irq_alloc_info info;
>> +	cpumask_t saved_cpumask;
>> +	struct irq_cfg *cfg;
>> +	int cpu, irq;
>> +
>> +	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
>> +		return 0;
>> +
>> +	init_irq_alloc_info(&info, NULL);
>> +
>> +	/*
>> +	 * Event notification vector will be delivered to the CPU
>> +	 * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested.
>> +	 * So set the IRQ affinity to the current CPU.
>> +	 */
>> +	cpu = get_cpu();
>> +	cpumask_copy(&saved_cpumask, current->cpus_ptr);
>> +	info.mask = cpumask_of(cpu);
>> +	put_cpu();
> The 'saved_cpumask' related code is ugly.  If you move put_cpu() to the end of
> this function, I think you can remove all related code:
> 
> 	cpu = get_cpu();
> 
> 	/*
> 	 * Set @info->mask to local cpu to make sure a valid vector is
> 	 * pre-allocated when TDX event notification IRQ is allocated
> 	 * from x86_vector_domain.
> 	 */
> 	init_irq_alloc_info(&info, cpumask_of(cpu));
> 
> 	// rest staff: request_irq(), hypercall ...
> 
> 	put_cpu();
> 	

init_irq_alloc_info() is a sleeping function. Since get_cpu() disables
preemption, we cannot call sleeping function after it. Initially, I
have implemented it like you have mentioned. However, I discovered the
following error.

[    2.400755] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
[    2.404664] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
[    2.408671] preempt_count: 1, expected: 0
[    2.412650] RCU nest depth: 0, expected: 0
[    2.412666] no locks held by swapper/0/1.
[    2.416650] Preemption disabled at:
[    2.416650] [<ffffffff83b8089f>] tdx_arch_init+0x38/0x117
[    2.420670] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc4-00117-g672ca073d9f9-dirty #2527
[    2.424650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[    2.424650] Call Trace:
[    2.424650]  <TASK>
[    2.424650]  dump_stack_lvl+0x6a/0x86
[    2.424650]  __might_resched.cold+0xf4/0x12f
[    2.424650]  __mutex_lock+0x50/0x810
[    2.424650]  ? lock_is_held_type+0xd8/0x130
[    2.424650]  ? __irq_alloc_descs+0xcf/0x310
[    2.424650]  ? find_held_lock+0x2b/0x80
[    2.424650]  ? __irq_alloc_descs+0xcf/0x310
[    2.424650]  __irq_alloc_descs+0xcf/0x310
[    2.424650]  irq_domain_alloc_descs.part.0+0x49/0xa0
[    2.424650]  __irq_domain_alloc_irqs+0x2a0/0x4f0
[    2.424650]  ? next_arg+0x129/0x1f0
[    2.424650]  ? tdx_guest_init+0x5b/0x5b
[    2.424650]  tdx_arch_init+0x8e/0x117
[    2.424650]  do_one_initcall+0x137/0x2ec
[    2.424650]  ? rcu_read_lock_sched_held+0x36/0x60
[    2.424650]  kernel_init_freeable+0x1e3/0x241
[    2.424650]  ? rest_init+0x1a0/0x1a0
[    2.424650]  kernel_init+0x17/0x170
[    2.424650]  ? rest_init+0x1a0/0x1a0
[    2.424650]  ret_from_fork+0x1f/0x30
[    2.424650]  </TASK>

Huang, Kai March 28, 2023, 4:02 a.m. UTC | #3

On Mon, 2023-03-27 at 19:50 -0700, Sathyanarayanan Kuppuswamy wrote:
> Hi Kai,
> 
> On 3/27/23 7:38 PM, Huang, Kai wrote:
> > > +/* Reserve an IRQ from x86_vector_domain for TD event notification */
> > > +static int __init tdx_event_irq_init(void)
> > > +{
> > > +	struct irq_alloc_info info;
> > > +	cpumask_t saved_cpumask;
> > > +	struct irq_cfg *cfg;
> > > +	int cpu, irq;
> > > +
> > > +	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> > > +		return 0;
> > > +
> > > +	init_irq_alloc_info(&info, NULL);
> > > +
> > > +	/*
> > > +	 * Event notification vector will be delivered to the CPU
> > > +	 * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested.
> > > +	 * So set the IRQ affinity to the current CPU.
> > > +	 */
> > > +	cpu = get_cpu();
> > > +	cpumask_copy(&saved_cpumask, current->cpus_ptr);
> > > +	info.mask = cpumask_of(cpu);
> > > +	put_cpu();
> > The 'saved_cpumask' related code is ugly.  If you move put_cpu() to the end of
> > this function, I think you can remove all related code:
> > 
> > 	cpu = get_cpu();
> > 
> > 	/*
> > 	 * Set @info->mask to local cpu to make sure a valid vector is
> > 	 * pre-allocated when TDX event notification IRQ is allocated
> > 	 * from x86_vector_domain.
> > 	 */
> > 	init_irq_alloc_info(&info, cpumask_of(cpu));
> > 
> > 	// rest staff: request_irq(), hypercall ...
> > 
> > 	put_cpu();
> > 	
> 
> init_irq_alloc_info() is a sleeping function. Since get_cpu() disables
> preemption, we cannot call sleeping function after it. Initially, I
> have implemented it like you have mentioned. However, I discovered the
> following error.

Oh sorry I forgot this.  So I think we should use migrate_disable() instead:

	migrate_disable();

	init_irq_alloc_info(&info, cpumask_of(smp_processor_id()));

	...

	migrate_enable();

Or, should we just use early_initcall() so that only BSP is running?  IMHO it's
OK to always allocate the vector from BSP.

Anyway, either way is fine to me.

> 
> [    2.400755] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
> [    2.404664] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
> [    2.408671] preempt_count: 1, expected: 0
> [    2.412650] RCU nest depth: 0, expected: 0
> [    2.412666] no locks held by swapper/0/1.
> [    2.416650] Preemption disabled at:
> [    2.416650] [<ffffffff83b8089f>] tdx_arch_init+0x38/0x117
> [    2.420670] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc4-00117-g672ca073d9f9-dirty #2527
> [    2.424650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> [    2.424650] Call Trace:
> [    2.424650]  <TASK>
> [    2.424650]  dump_stack_lvl+0x6a/0x86
> [    2.424650]  __might_resched.cold+0xf4/0x12f
> [    2.424650]  __mutex_lock+0x50/0x810
> [    2.424650]  ? lock_is_held_type+0xd8/0x130
> [    2.424650]  ? __irq_alloc_descs+0xcf/0x310
> [    2.424650]  ? find_held_lock+0x2b/0x80
> [    2.424650]  ? __irq_alloc_descs+0xcf/0x310
> [    2.424650]  __irq_alloc_descs+0xcf/0x310
> [    2.424650]  irq_domain_alloc_descs.part.0+0x49/0xa0
> [    2.424650]  __irq_domain_alloc_irqs+0x2a0/0x4f0
> [    2.424650]  ? next_arg+0x129/0x1f0
> [    2.424650]  ? tdx_guest_init+0x5b/0x5b
> [    2.424650]  tdx_arch_init+0x8e/0x117
> [    2.424650]  do_one_initcall+0x137/0x2ec
> [    2.424650]  ? rcu_read_lock_sched_held+0x36/0x60
> [    2.424650]  kernel_init_freeable+0x1e3/0x241
> [    2.424650]  ? rest_init+0x1a0/0x1a0
> [    2.424650]  kernel_init+0x17/0x170
> [    2.424650]  ? rest_init+0x1a0/0x1a0
> [    2.424650]  ret_from_fork+0x1f/0x30
> [    2.424650]  </TASK>

Kuppuswamy Sathyanarayanan April 5, 2023, 1:02 a.m. UTC | #4

On 3/27/23 9:02 PM, Huang, Kai wrote:
> On Mon, 2023-03-27 at 19:50 -0700, Sathyanarayanan Kuppuswamy wrote:
>> Hi Kai,
>>
>> On 3/27/23 7:38 PM, Huang, Kai wrote:
>>>> +/* Reserve an IRQ from x86_vector_domain for TD event notification */
>>>> +static int __init tdx_event_irq_init(void)
>>>> +{
>>>> +	struct irq_alloc_info info;
>>>> +	cpumask_t saved_cpumask;
>>>> +	struct irq_cfg *cfg;
>>>> +	int cpu, irq;
>>>> +
>>>> +	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
>>>> +		return 0;
>>>> +
>>>> +	init_irq_alloc_info(&info, NULL);
>>>> +
>>>> +	/*
>>>> +	 * Event notification vector will be delivered to the CPU
>>>> +	 * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested.
>>>> +	 * So set the IRQ affinity to the current CPU.
>>>> +	 */
>>>> +	cpu = get_cpu();
>>>> +	cpumask_copy(&saved_cpumask, current->cpus_ptr);
>>>> +	info.mask = cpumask_of(cpu);
>>>> +	put_cpu();
>>> The 'saved_cpumask' related code is ugly.  If you move put_cpu() to the end of
>>> this function, I think you can remove all related code:
>>>
>>> 	cpu = get_cpu();
>>>
>>> 	/*
>>> 	 * Set @info->mask to local cpu to make sure a valid vector is
>>> 	 * pre-allocated when TDX event notification IRQ is allocated
>>> 	 * from x86_vector_domain.
>>> 	 */
>>> 	init_irq_alloc_info(&info, cpumask_of(cpu));
>>>
>>> 	// rest staff: request_irq(), hypercall ...
>>>
>>> 	put_cpu();
>>> 	
>>
>> init_irq_alloc_info() is a sleeping function. Since get_cpu() disables
>> preemption, we cannot call sleeping function after it. Initially, I
>> have implemented it like you have mentioned. However, I discovered the
>> following error.
> 
> Oh sorry I forgot this.  So I think we should use migrate_disable() instead:
> 
> 	migrate_disable();
> 
> 	init_irq_alloc_info(&info, cpumask_of(smp_processor_id()));
> 
> 	...
> 
> 	migrate_enable();
> 
> Or, should we just use early_initcall() so that only BSP is running?  IMHO it's
> OK to always allocate the vector from BSP.
> 
> Anyway, either way is fine to me.

Final version looks like below. 

static int __init tdx_event_irq_init(void)
{
        struct irq_alloc_info info;
        struct irq_cfg *cfg;
        int irq;

        if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
                return 0;

        init_irq_alloc_info(&info, NULL);

        /*
         * Event notification vector will be delivered to the CPU
         * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested.
         * So set the IRQ affinity to the current CPU.
         */
        info.mask = cpumask_of(0);

        irq = irq_domain_alloc_irqs(x86_vector_domain, 1, cpu_to_node(0), &info);
        if (irq <= 0) {
                pr_err("Event notification IRQ allocation failed %d\n", irq);
                return -EIO;
        }

        irq_set_handler(irq, handle_edge_irq);

        /* Since the IRQ affinity is set, it cannot be balanced */
        if (request_irq(irq, tdx_event_irq_handler, IRQF_NOBALANCING,
                        "tdx_event_irq", NULL)) {
                pr_err("Event notification IRQ request failed\n");
                goto err_free_domain_irqs;
        }

        cfg = irq_cfg(irq);

        /*
         * Since tdx_event_irq_init() is triggered via early_initcall(),
         * it will called before secondary CPUs bringup. Since there is
         * only one CPU, it complies with the requirement of executing
         * the TDVMCALL_SETUP_NOTIFY_INTR hypercall on the same CPU where
         * the IRQ vector is allocated.
         *
         * Register callback vector address with VMM. More details
         * about the ABI can be found in TDX Guest-Host-Communication
         * Interface (GHCI), sec titled
         * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
         */
        if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, cfg->vector, 0, 0, 0)) {
                pr_err("Event notification hypercall failed\n");
                goto err_free_irqs;
        }

        tdx_event_irq = irq;

        return 0;

err_free_irqs:
        free_irq(irq, NULL);
err_free_domain_irqs:
        irq_domain_free_irqs(irq, 1);

        return -EIO;
}
early_initcall(tdx_event_irq_init)


> 
>>
>> [    2.400755] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
>> [    2.404664] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
>> [    2.408671] preempt_count: 1, expected: 0
>> [    2.412650] RCU nest depth: 0, expected: 0
>> [    2.412666] no locks held by swapper/0/1.
>> [    2.416650] Preemption disabled at:
>> [    2.416650] [<ffffffff83b8089f>] tdx_arch_init+0x38/0x117
>> [    2.420670] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc4-00117-g672ca073d9f9-dirty #2527
>> [    2.424650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>> [    2.424650] Call Trace:
>> [    2.424650]  <TASK>
>> [    2.424650]  dump_stack_lvl+0x6a/0x86
>> [    2.424650]  __might_resched.cold+0xf4/0x12f
>> [    2.424650]  __mutex_lock+0x50/0x810
>> [    2.424650]  ? lock_is_held_type+0xd8/0x130
>> [    2.424650]  ? __irq_alloc_descs+0xcf/0x310
>> [    2.424650]  ? find_held_lock+0x2b/0x80
>> [    2.424650]  ? __irq_alloc_descs+0xcf/0x310
>> [    2.424650]  __irq_alloc_descs+0xcf/0x310
>> [    2.424650]  irq_domain_alloc_descs.part.0+0x49/0xa0
>> [    2.424650]  __irq_domain_alloc_irqs+0x2a0/0x4f0
>> [    2.424650]  ? next_arg+0x129/0x1f0
>> [    2.424650]  ? tdx_guest_init+0x5b/0x5b
>> [    2.424650]  tdx_arch_init+0x8e/0x117
>> [    2.424650]  do_one_initcall+0x137/0x2ec
>> [    2.424650]  ? rcu_read_lock_sched_held+0x36/0x60
>> [    2.424650]  kernel_init_freeable+0x1e3/0x241
>> [    2.424650]  ? rest_init+0x1a0/0x1a0
>> [    2.424650]  kernel_init+0x17/0x170
>> [    2.424650]  ? rest_init+0x1a0/0x1a0
>> [    2.424650]  ret_from_fork+0x1f/0x30
>> [    2.424650]  </TASK>
>

Huang, Kai April 5, 2023, 2:45 a.m. UTC | #5

On Tue, 2023-04-04 at 18:02 -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 3/27/23 9:02 PM, Huang, Kai wrote:
> > On Mon, 2023-03-27 at 19:50 -0700, Sathyanarayanan Kuppuswamy wrote:
> > > Hi Kai,
> > > 
> > > On 3/27/23 7:38 PM, Huang, Kai wrote:
> > > > > +/* Reserve an IRQ from x86_vector_domain for TD event notification */
> > > > > +static int __init tdx_event_irq_init(void)
> > > > > +{
> > > > > +	struct irq_alloc_info info;
> > > > > +	cpumask_t saved_cpumask;
> > > > > +	struct irq_cfg *cfg;
> > > > > +	int cpu, irq;
> > > > > +
> > > > > +	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> > > > > +		return 0;
> > > > > +
> > > > > +	init_irq_alloc_info(&info, NULL);
> > > > > +
> > > > > +	/*
> > > > > +	 * Event notification vector will be delivered to the CPU
> > > > > +	 * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested.
> > > > > +	 * So set the IRQ affinity to the current CPU.
> > > > > +	 */
> > > > > +	cpu = get_cpu();
> > > > > +	cpumask_copy(&saved_cpumask, current->cpus_ptr);
> > > > > +	info.mask = cpumask_of(cpu);
> > > > > +	put_cpu();
> > > > The 'saved_cpumask' related code is ugly.  If you move put_cpu() to the end of
> > > > this function, I think you can remove all related code:
> > > > 
> > > > 	cpu = get_cpu();
> > > > 
> > > > 	/*
> > > > 	 * Set @info->mask to local cpu to make sure a valid vector is
> > > > 	 * pre-allocated when TDX event notification IRQ is allocated
> > > > 	 * from x86_vector_domain.
> > > > 	 */
> > > > 	init_irq_alloc_info(&info, cpumask_of(cpu));
> > > > 
> > > > 	// rest staff: request_irq(), hypercall ...
> > > > 
> > > > 	put_cpu();
> > > > 	
> > > 
> > > init_irq_alloc_info() is a sleeping function. Since get_cpu() disables
> > > preemption, we cannot call sleeping function after it. Initially, I
> > > have implemented it like you have mentioned. However, I discovered the
> > > following error.
> > 
> > Oh sorry I forgot this.  So I think we should use migrate_disable() instead:
> > 
> > 	migrate_disable();
> > 
> > 	init_irq_alloc_info(&info, cpumask_of(smp_processor_id()));
> > 
> > 	...
> > 
> > 	migrate_enable();
> > 
> > Or, should we just use early_initcall() so that only BSP is running?  IMHO it's
> > OK to always allocate the vector from BSP.
> > 
> > Anyway, either way is fine to me.
> 
> Final version looks like below. 
> 
> static int __init tdx_event_irq_init(void)
> {
>         struct irq_alloc_info info;
>         struct irq_cfg *cfg;
>         int irq;
> 
>         if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
>                 return 0;
> 
>         init_irq_alloc_info(&info, NULL);
> 
>         /*
>          * Event notification vector will be delivered to the CPU
>          * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested.
>          * So set the IRQ affinity to the current CPU.
>          */
>         info.mask = cpumask_of(0);
> 
>         irq = irq_domain_alloc_irqs(x86_vector_domain, 1, cpu_to_node(0), &info);
>         if (irq <= 0) {
>                 pr_err("Event notification IRQ allocation failed %d\n", irq);
>                 return -EIO;
>         }
> 
>         irq_set_handler(irq, handle_edge_irq);
> 
>         /* Since the IRQ affinity is set, it cannot be balanced */
>         if (request_irq(irq, tdx_event_irq_handler, IRQF_NOBALANCING,
>                         "tdx_event_irq", NULL)) {
>                 pr_err("Event notification IRQ request failed\n");
>                 goto err_free_domain_irqs;
>         }
> 
>         cfg = irq_cfg(irq);
> 
>         /*
>          * Since tdx_event_irq_init() is triggered via early_initcall(),
>          * it will called before secondary CPUs bringup. Since there is
>          * only one CPU, it complies with the requirement of executing
>          * the TDVMCALL_SETUP_NOTIFY_INTR hypercall on the same CPU where
>          * the IRQ vector is allocated.
>          *
>          * Register callback vector address with VMM. More details
>          * about the ABI can be found in TDX Guest-Host-Communication
>          * Interface (GHCI), sec titled
>          * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
>          */
>         if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, cfg->vector, 0, 0, 0)) {
>                 pr_err("Event notification hypercall failed\n");
>                 goto err_free_irqs;
>         }
> 
>         tdx_event_irq = irq;
> 
>         return 0;
> 
> err_free_irqs:
>         free_irq(irq, NULL);
> err_free_domain_irqs:
>         irq_domain_free_irqs(irq, 1);
> 
>         return -EIO;
> }
> early_initcall(tdx_event_irq_init)

I found there's another series also doing similar thing, and it seems Thomas
wasn't happy about using x86_vector_domain directly:

https://lore.kernel.org/lkml/877cv99k0y.ffs@tglx/

An alternative was also posted (creating IRQ domain on top of
x86_vector_domain):

https://lore.kernel.org/lkml/20230328182933.GA1403032@vm02.guest.corp.microsoft.com/

I think we should monitor that and hear from others more.

[v1,1/3] x86/tdx: Add TDX Guest event notify interrupt support

Commit Message

Comments

Patch