mbox series

[v3,00/15] Introduce /dev/mshv drivers

Message ID 1695407915-12216-1-git-send-email-nunodasneves@linux.microsoft.com (mailing list archive)
Headers show
Series Introduce /dev/mshv drivers | expand

Message

Nuno Das Neves Sept. 22, 2023, 6:38 p.m. UTC
This series introduces support for creating and running guest machines
while running on the Microsoft Hypervisor. [0]
This is done via an IOCTL interface accessed through /dev/mshv, similar to
/dev/kvm. Another series introducing this support was previously posted.
[1]

These interfaces support VMMs running in:
1. The root patition - provided in the mshv_root module, and
2. VTL 2 - provided in the mshv_vtl module [2]

Patches breakdown
-----------------
The first 7 patches are refactoring and adding some helper functions.
They provide some benefit on their own and could be applied independently
as cleanup patches.

Patches 8-12 just set things up for the driver code to come. These are very
small. They come first so that the remaining patches are more self-contained.

The final 3 patches are the meat of the series:
- Patch 13 contains new header files used by the driver.
  These are designed to mirror the ABI headers exported by Hyper-V. This is
  done to avoid polluting hyperv-tlfs.h and help track changes to the ABIs
  that are still unstable. (See FAQ below).
- Patch 14 conditionally includes these new header files into mshyperv.h
  and linux/hyperv.h, in order to be able to use these files in the new
  drivers while remaining independent from hyperv-tlfs.h.
- Patch 15 contains the new driver code located in drivers/hv. This is a
  large amount of code and new files, but it is mostly self-contained and
  all within drivers/hv - apart from the IOCTL interface itself in uapi.

Patch 15 is rather big and has bounced back from some mailing lists. If you
did not get a copy in your inbox, you can view it here instead:
https://github.com/NunoDasNeves/linux/commit/2ae1cabd82257cb303296ab6af707d1bac867e42

FAQ on include/uapi/hyperv/*.h
------------------------------
Q:
Why not just add these definitions to hyperv-tlfs.h?
A:
The intention of hyperv-tlfs.h is to contain stable definitions documented
in the public TLFS document. These new definitions don't fit that criteria,
so they should be separate.

Q:
The new headers redefine many things that are already in hyperv-tlfs.h - why?
A:
Some definitions are extended compared to what is documented in the TLFS.
In order to avoid adding undocumented or unstable definitions to hyperv-tlfs.h,
the new headers must compile independently.
Therefore, the new headers must redefine many things in hyperv-tlfs.h in order
to compile.

Q:
Why are these files named hvgdk.h, hvgdk_mini.h, hvhdk.h and hvhdk_mini.h?
A:
The precise meaning of the names reflects conventions used internally at
Microsoft.
Naming them this way makes it easy to find where particular Hyper-V
definitions come from, and check their correctness.
It also facilitates the future work of automatically generating these files.

Q:
Why are they in uapi?
A:
In short, to keep things simple. There are many definitions needed in both
the kernel and the VMM in userspace. Separating them doesn't serve much
purpose, and makes it more laborious to import definitions from Hyper-V
code.

--------------------------
[0] "Hyper-V" is more well-known, but it really refers to the whole stack
    including the hypervisor and other components that run in Windows
    kernel and userspace.
[1] Previous /dev/mshv patch series and discussion:
    https://lore.kernel.org/linux-hyperv/1632853875-20261-1-git-send-email-nunodasneves@linux.microsoft.com/
[2] Virtual Secure Mode (VSM) and Virtual Trust Levels (VTL):
    https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/vsm
--------------------------

Changes since v2:
    * Fix some commit message wrapping
    * Fix many checkpatch.pl --strict style issues
    * Replace uapi ints with __s32
    * Replace uapi enums with __u32
    * Replace uapi pointers with __u64
    * Add explicit padding to uapi structures
    * Initialize status in get/set registers hypercall helpers
    * Add missing return on error in get_vp_signaled_count
    * Remove select TRANSPARENT_HUGEPAGES for mshv_vtl
    * Use __func__ prefix consistently in printks
    * Use single generic cpuid() to get all 4 registers instead of 4 calls
    * Change hv_proximity_domain_info from union to struct for clarity
Changes since v1:
    * Clean up formatting, capitalization in commit messages
    * Add detail to commit message for patch 15
    * Remove errant lines in Makefile and Kconfig in patch 15
    * Move a reference to CONFIG_MSHV_VTL from patch 9 to 15

Nuno Das Neves (15):
  hyperv-tlfs: Change shared HV_REGISTER_* defines to HV_MSR_*
  mshyperv: Introduce hv_get_hypervisor_version function
  mshyperv: Introduce numa_node_to_proximity_domain_info
  asm-generic/mshyperv: Introduce hv_recommend_using_aeoi()
  hyperv: Move hv_connection_id to hyperv-tlfs
  hyperv-tlfs: Introduce hv_status_to_string and hv_status_to_errno
  Drivers: hv: Move hv_call_deposit_pages and hv_call_create_vp to
    common code
  Drivers: hv: Introduce per-cpu event ring tail
  Drivers: hv: Introduce hv_output_arg_exists in hv_common.c
  x86: hyperv: Add mshv_handler irq handler and setup function
  Drivers: hv: export vmbus_isr, hv_context and hv_post_message
  Documentation: Reserve ioctl number for mshv driver
  uapi: hyperv: Add mshv driver headers defining hypervisor ABIs
  asm-generic: hyperv: Use new Hyper-V headers conditionally.
  Drivers: hv: Add modules to expose /dev/mshv to VMMs running on
    Hyper-V

 .../userspace-api/ioctl/ioctl-number.rst      |    2 +
 arch/arm64/hyperv/mshyperv.c                  |   23 +-
 arch/arm64/include/asm/hyperv-tlfs.h          |   25 +
 arch/arm64/include/asm/mshyperv.h             |    2 +-
 arch/x86/hyperv/hv_init.c                     |    2 +-
 arch/x86/hyperv/hv_proc.c                     |  166 +-
 arch/x86/include/asm/hyperv-tlfs.h            |  137 +-
 arch/x86/include/asm/mshyperv.h               |   12 +-
 arch/x86/kernel/cpu/mshyperv.c                |   67 +-
 drivers/acpi/numa/srat.c                      |    1 +
 drivers/clocksource/hyperv_timer.c            |   24 +-
 drivers/hv/Kconfig                            |   49 +
 drivers/hv/Makefile                           |   20 +
 drivers/hv/hv.c                               |   50 +-
 drivers/hv/hv_call.c                          |  113 +
 drivers/hv/hv_common.c                        |  223 +-
 drivers/hv/hyperv_vmbus.h                     |    2 +-
 drivers/hv/mshv.h                             |  123 ++
 drivers/hv/mshv_eventfd.c                     |  761 +++++++
 drivers/hv/mshv_eventfd.h                     |   80 +
 drivers/hv/mshv_main.c                        |  208 ++
 drivers/hv/mshv_msi.c                         |  129 ++
 drivers/hv/mshv_portid_table.c                |   84 +
 drivers/hv/mshv_root.h                        |  193 ++
 drivers/hv/mshv_root_hv_call.c                | 1015 +++++++++
 drivers/hv/mshv_root_main.c                   | 1920 +++++++++++++++++
 drivers/hv/mshv_synic.c                       |  688 ++++++
 drivers/hv/mshv_vtl.h                         |   52 +
 drivers/hv/mshv_vtl_main.c                    | 1517 +++++++++++++
 drivers/hv/vmbus_drv.c                        |    3 +-
 drivers/hv/xfer_to_guest.c                    |   28 +
 include/asm-generic/hyperv-defs.h             |   26 +
 include/asm-generic/hyperv-tlfs.h             |   93 +-
 include/asm-generic/mshyperv.h                |   73 +-
 include/linux/hyperv.h                        |   11 +-
 include/uapi/hyperv/hvgdk.h                   |   41 +
 include/uapi/hyperv/hvgdk_mini.h              | 1076 +++++++++
 include/uapi/hyperv/hvhdk.h                   | 1342 ++++++++++++
 include/uapi/hyperv/hvhdk_mini.h              |  160 ++
 include/uapi/linux/mshv.h                     |  306 +++
 40 files changed, 10483 insertions(+), 364 deletions(-)
 create mode 100644 drivers/hv/hv_call.c
 create mode 100644 drivers/hv/mshv.h
 create mode 100644 drivers/hv/mshv_eventfd.c
 create mode 100644 drivers/hv/mshv_eventfd.h
 create mode 100644 drivers/hv/mshv_main.c
 create mode 100644 drivers/hv/mshv_msi.c
 create mode 100644 drivers/hv/mshv_portid_table.c
 create mode 100644 drivers/hv/mshv_root.h
 create mode 100644 drivers/hv/mshv_root_hv_call.c
 create mode 100644 drivers/hv/mshv_root_main.c
 create mode 100644 drivers/hv/mshv_synic.c
 create mode 100644 drivers/hv/mshv_vtl.h
 create mode 100644 drivers/hv/mshv_vtl_main.c
 create mode 100644 drivers/hv/xfer_to_guest.c
 create mode 100644 include/asm-generic/hyperv-defs.h
 create mode 100644 include/uapi/hyperv/hvgdk.h
 create mode 100644 include/uapi/hyperv/hvgdk_mini.h
 create mode 100644 include/uapi/hyperv/hvhdk.h
 create mode 100644 include/uapi/hyperv/hvhdk_mini.h
 create mode 100644 include/uapi/linux/mshv.h

Comments

Wei Liu Sept. 22, 2023, 8:02 p.m. UTC | #1
On Fri, Sep 22, 2023 at 11:38:35AM -0700, Nuno Das Neves wrote:
> Add mshv, mshv_root, and mshv_vtl modules:
> 
> Module mshv is the parent module to the other two. It provides /dev/mshv,
> plus some common hypercall helper code. When one of the child modules is
> loaded, it is registered with the mshv module, which then provides entry
> point(s) to the child module via the IOCTLs defined in uapi/linux/mshv.h.
> 
> E.g. When the mshv_root module is loaded, it registers itself, and the
> MSHV_CREATE_PARTITION IOCTL becomes available in /dev/mshv. That is used to
> get a partition fd managed by mshv_root.
> 
> Similarly for mshv_vtl module, there is MSHV_CREATE_VTL, which creates
> an fd representing the lower vtl, managed by mshv_vtl.
> 
> Module mshv_root provides APIs for creating and managing child partitions.
> It defines abstractions for partitions (vms), vps (vcpus), and other things
> related to running a guest. It exposes the userspace interfaces for a VMM
> to manage the guest.
> 
> Module mshv_vtl provides VTL (Virtual Trust Level) support for VMMs. In
> this scenario, the host kernel and VMM run in a higher trust level than the
> guest, but within the same partition. This provides better isolation and
> performance.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>

As far as I can tell, all my comments from the previous version are
addressed. I believe Saurabh and Boqun's comments are addressed, too.

The code looks good to me, so:

Acked-by: Wei Liu <wei.liu@kernel.org>

I will wait for some time for others to chime in, just in case the
community has more comments.
Greg Kroah-Hartman Sept. 23, 2023, 7:56 a.m. UTC | #2
On Fri, Sep 22, 2023 at 11:38:35AM -0700, Nuno Das Neves wrote:
> +static int __init mshv_vtl_init(void)
> +{
> +	int ret;
> +
> +	tasklet_init(&msg_dpc, mshv_vtl_sint_on_msg_dpc, 0);
> +	init_waitqueue_head(&fd_wait_queue);
> +
> +	if (mshv_vtl_get_vsm_regs()) {
> +		pr_emerg("%s: Unable to get VSM capabilities !!\n", __func__);
> +		BUG();
> +	}


So you crash the whole kernel if someone loads this module on a non-mshv
system?

That seems quite excessive and hostile :(

Or am I somehow reading this incorrectly?

thanks,

greg k-h
Greg Kroah-Hartman Sept. 23, 2023, 7:58 a.m. UTC | #3
On Fri, Sep 22, 2023 at 11:38:35AM -0700, Nuno Das Neves wrote:
> +static int mshv_vtl_get_vsm_regs(void)
> +{
> +	struct hv_register_assoc registers[2];
> +	union hv_input_vtl input_vtl;
> +	int ret, count = 2;
> +
> +	input_vtl.as_uint8 = 0;
> +	registers[0].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
> +	registers[1].name = HV_REGISTER_VSM_CAPABILITIES;
> +
> +	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
> +				       count, input_vtl, registers);
> +	if (ret)
> +		return ret;
> +
> +	mshv_vsm_page_offsets.as_uint64 = registers[0].value.reg64;
> +	mshv_vsm_capabilities.as_uint64 = registers[1].value.reg64;
> +
> +	pr_debug("%s: VSM code page offsets: %#016llx\n", __func__,
> +		 mshv_vsm_page_offsets.as_uint64);
> +	pr_info("%s: VSM capabilities: %#016llx\n", __func__,
> +		mshv_vsm_capabilities.as_uint64);

When drivers are working properly, they are quiet.  This is very noisy
and probably is leaking memory addresses to userspace?

Also, there is NEVER a need for __func__ in a pr_debug() line, it has
that for you automatically.

Also, drivers should never call pr_*() calls, always use the proper
dev_*() calls instead.



> +
> +	return ret;
> +}
> +
> +static int mshv_vtl_configure_vsm_partition(void)
> +{
> +	union hv_register_vsm_partition_config config;
> +	struct hv_register_assoc reg_assoc;
> +	union hv_input_vtl input_vtl;
> +
> +	config.as_u64 = 0;
> +	config.default_vtl_protection_mask = HV_MAP_GPA_PERMISSIONS_MASK;
> +	config.enable_vtl_protection = 1;
> +	config.zero_memory_on_reset = 1;
> +	config.intercept_vp_startup = 1;
> +	config.intercept_cpuid_unimplemented = 1;
> +
> +	if (mshv_vsm_capabilities.intercept_page_available) {
> +		pr_debug("%s: using intercept page", __func__);

Again, __func__ is not needed, you are providing it twice here for no
real reason except to waste storage space :)

> +		config.intercept_page = 1;
> +	}
> +
> +	reg_assoc.name = HV_REGISTER_VSM_PARTITION_CONFIG;
> +	reg_assoc.value.reg64 = config.as_u64;
> +	input_vtl.as_uint8 = 0;
> +
> +	return hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
> +				       1, input_vtl, &reg_assoc);


None of this needs to be unwound if initialization fails later on?

thanks,

greg k-h
Wei Liu Sept. 23, 2023, 8:58 p.m. UTC | #4
Greg, thank you for looking at the code.

On Sat, Sep 23, 2023 at 09:56:13AM +0200, Greg KH wrote:
> On Fri, Sep 22, 2023 at 11:38:35AM -0700, Nuno Das Neves wrote:
> > +static int __init mshv_vtl_init(void)
> > +{
> > +	int ret;
> > +
> > +	tasklet_init(&msg_dpc, mshv_vtl_sint_on_msg_dpc, 0);
> > +	init_waitqueue_head(&fd_wait_queue);
> > +
> > +	if (mshv_vtl_get_vsm_regs()) {
> > +		pr_emerg("%s: Unable to get VSM capabilities !!\n", __func__);
> > +		BUG();
> > +	}
> 
> 
> So you crash the whole kernel if someone loads this module on a non-mshv
> system?
> 
> That seems quite excessive and hostile :(
> 

I agree this can be more lenient. It should be safe to just return an
error code (ENODEV) and let user space decide the next action.

Saurabh, let me know what you think.

Thanks,
Wei.

> Or am I somehow reading this incorrectly?
> 
> thanks,
> 
> greg k-h
>
Saurabh Singh Sengar Sept. 24, 2023, 4:48 a.m. UTC | #5
On Sat, Sep 23, 2023 at 08:58:00PM +0000, Wei Liu wrote:
> Greg, thank you for looking at the code.
> 
> On Sat, Sep 23, 2023 at 09:56:13AM +0200, Greg KH wrote:
> > On Fri, Sep 22, 2023 at 11:38:35AM -0700, Nuno Das Neves wrote:
> > > +static int __init mshv_vtl_init(void)
> > > +{
> > > +	int ret;
> > > +
> > > +	tasklet_init(&msg_dpc, mshv_vtl_sint_on_msg_dpc, 0);
> > > +	init_waitqueue_head(&fd_wait_queue);
> > > +
> > > +	if (mshv_vtl_get_vsm_regs()) {
> > > +		pr_emerg("%s: Unable to get VSM capabilities !!\n", __func__);
> > > +		BUG();
> > > +	}
> > 
> > 
> > So you crash the whole kernel if someone loads this module on a non-mshv
> > system?
> > 
> > That seems quite excessive and hostile :(
> > 
> 
> I agree this can be more lenient. It should be safe to just return an
> error code (ENODEV) and let user space decide the next action.
> 
> Saurabh, let me know what you think.

Thanks for reporting this. I agree, returning an error makes more sense here.
We should do this for both BUG() in this init.

- Saurabh

> 
> Thanks,
> Wei.
> 
> > Or am I somehow reading this incorrectly?
> > 
> > thanks,
> > 
> > greg k-h
> >
Nuno Das Neves Sept. 26, 2023, 12:07 a.m. UTC | #6
Resend in plain text instead of HTML - oops!

On 9/23/2023 12:58 AM, Greg KH wrote:
> On Fri, Sep 22, 2023 at 11:38:35AM -0700, Nuno Das Neves wrote:
>> +static int mshv_vtl_get_vsm_regs(void)
>> +{
>> +	struct hv_register_assoc registers[2];
>> +	union hv_input_vtl input_vtl;
>> +	int ret, count = 2;
>> +
>> +	input_vtl.as_uint8 = 0;
>> +	registers[0].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
>> +	registers[1].name = HV_REGISTER_VSM_CAPABILITIES;
>> +
>> +	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>> +				       count, input_vtl, registers);
>> +	if (ret)
>> +		return ret;
>> +
>> +	mshv_vsm_page_offsets.as_uint64 = registers[0].value.reg64;
>> +	mshv_vsm_capabilities.as_uint64 = registers[1].value.reg64;
>> +
>> +	pr_debug("%s: VSM code page offsets: %#016llx\n", __func__,
>> +		 mshv_vsm_page_offsets.as_uint64);
>> +	pr_info("%s: VSM capabilities: %#016llx\n", __func__,
>> +		mshv_vsm_capabilities.as_uint64);
> 
> When drivers are working properly, they are quiet.  This is very noisy
> and probably is leaking memory addresses to userspace?
> 

I will remove these, thanks.

> Also, there is NEVER a need for __func__ in a pr_debug() line, it has
> that for you automatically.
> 

Thank you, I didn't know this.

> Also, drivers should never call pr_*() calls, always use the proper
> dev_*() calls instead.
> 

We only use struct device in one place in this driver, I think that is 
the only place it makes sense to use dev_*() over pr_*() calls.
> 
> 
>> +
>> +	return ret;
>> +}
>> +
>> +static int mshv_vtl_configure_vsm_partition(void)
>> +{
>> +	union hv_register_vsm_partition_config config;
>> +	struct hv_register_assoc reg_assoc;
>> +	union hv_input_vtl input_vtl;
>> +
>> +	config.as_u64 = 0;
>> +	config.default_vtl_protection_mask = HV_MAP_GPA_PERMISSIONS_MASK;
>> +	config.enable_vtl_protection = 1;
>> +	config.zero_memory_on_reset = 1;
>> +	config.intercept_vp_startup = 1;
>> +	config.intercept_cpuid_unimplemented = 1;
>> +
>> +	if (mshv_vsm_capabilities.intercept_page_available) {
>> +		pr_debug("%s: using intercept page", __func__);
> 
> Again, __func__ is not needed, you are providing it twice here for no
> real reason except to waste storage space :)
> 

Thanks, I will review all the uses of pr_debug().

>> +		config.intercept_page = 1;
>> +	}
>> +
>> +	reg_assoc.name = HV_REGISTER_VSM_PARTITION_CONFIG;
>> +	reg_assoc.value.reg64 = config.as_u64;
>> +	input_vtl.as_uint8 = 0;
>> +
>> +	return hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>> +				       1, input_vtl, &reg_assoc);
> 
> 
> None of this needs to be unwound if initialization fails later on?
> 

I think unwinding this is not needed, not 100% sure.
Saurabh, can you comment?

Thanks,
Nuno

> thanks,
> 
> greg k-h
Greg Kroah-Hartman Sept. 26, 2023, 4:52 a.m. UTC | #7
On Mon, Sep 25, 2023 at 05:07:24PM -0700, Nuno Das Neves wrote:
> On 9/23/2023 12:58 AM, Greg KH wrote:
> > Also, drivers should never call pr_*() calls, always use the proper
> > dev_*() calls instead.
> > 
> 
> We only use struct device in one place in this driver, I think that is the
> only place it makes sense to use dev_*() over pr_*() calls.

Then the driver needs to be fixed to use struct device properly so that
you do have access to it when you want to print messages.  That's a
valid reason to pass around your device structure when needed.

thanks,

greg k-h
Wei Liu Sept. 26, 2023, 5:54 a.m. UTC | #8
On Tue, Sep 26, 2023 at 06:52:46AM +0200, Greg KH wrote:
> On Mon, Sep 25, 2023 at 05:07:24PM -0700, Nuno Das Neves wrote:
> > On 9/23/2023 12:58 AM, Greg KH wrote:
> > > Also, drivers should never call pr_*() calls, always use the proper
> > > dev_*() calls instead.
> > > 
> > 
> > We only use struct device in one place in this driver, I think that is the
> > only place it makes sense to use dev_*() over pr_*() calls.
> 
> Then the driver needs to be fixed to use struct device properly so that
> you do have access to it when you want to print messages.  That's a
> valid reason to pass around your device structure when needed.

Greg, ACRN and Nitro drivers do not pass around the device structure.
Instead, they rely on a global struct device. We can follow the same.

Nuno, I checked our code. We already have a misc device. We can use that
for dev_* calls.

Something like this:

   dev_warn(mshv_dev.this_device, "this is a warning message");

This should resolve Greg's concern.

Thanks,
Wei.

> 
> thanks,
> 
> greg k-h
>
Greg Kroah-Hartman Sept. 26, 2023, 6:31 a.m. UTC | #9
On Tue, Sep 26, 2023 at 05:54:34AM +0000, Wei Liu wrote:
> On Tue, Sep 26, 2023 at 06:52:46AM +0200, Greg KH wrote:
> > On Mon, Sep 25, 2023 at 05:07:24PM -0700, Nuno Das Neves wrote:
> > > On 9/23/2023 12:58 AM, Greg KH wrote:
> > > > Also, drivers should never call pr_*() calls, always use the proper
> > > > dev_*() calls instead.
> > > > 
> > > 
> > > We only use struct device in one place in this driver, I think that is the
> > > only place it makes sense to use dev_*() over pr_*() calls.
> > 
> > Then the driver needs to be fixed to use struct device properly so that
> > you do have access to it when you want to print messages.  That's a
> > valid reason to pass around your device structure when needed.
> 
> Greg, ACRN and Nitro drivers do not pass around the device structure.
> Instead, they rely on a global struct device. We can follow the same.

A single global struct device is wrong, please don't do that.

Don't copy bad design patterns from other drivers, be better :)

thanks,

greg k-h
Wei Liu Sept. 26, 2023, 7 a.m. UTC | #10
On Tue, Sep 26, 2023 at 08:31:03AM +0200, Greg KH wrote:
> On Tue, Sep 26, 2023 at 05:54:34AM +0000, Wei Liu wrote:
> > On Tue, Sep 26, 2023 at 06:52:46AM +0200, Greg KH wrote:
> > > On Mon, Sep 25, 2023 at 05:07:24PM -0700, Nuno Das Neves wrote:
> > > > On 9/23/2023 12:58 AM, Greg KH wrote:
> > > > > Also, drivers should never call pr_*() calls, always use the proper
> > > > > dev_*() calls instead.
> > > > > 
> > > > 
> > > > We only use struct device in one place in this driver, I think that is the
> > > > only place it makes sense to use dev_*() over pr_*() calls.
> > > 
> > > Then the driver needs to be fixed to use struct device properly so that
> > > you do have access to it when you want to print messages.  That's a
> > > valid reason to pass around your device structure when needed.
> > 
> > Greg, ACRN and Nitro drivers do not pass around the device structure.
> > Instead, they rely on a global struct device. We can follow the same.
> 
> A single global struct device is wrong, please don't do that.
> 
> Don't copy bad design patterns from other drivers, be better :)
> 

If we're working with real devices like network cards or graphics cards
I would agree -- it is easy to imagine that we have several cards of the
same model in the system -- but in real world there won't be two
hypervisor instances running on the same hardware.

We can stash the struct device inside some private data fields, but that
doesn't change the fact that we're still having one instance of the
structure. Is this what you want? Or do you have something else in mind?

Thanks,
Wei.

> thanks,
> 
> greg k-h
>
Greg Kroah-Hartman Sept. 26, 2023, 8:03 a.m. UTC | #11
On Tue, Sep 26, 2023 at 07:00:51AM +0000, Wei Liu wrote:
> On Tue, Sep 26, 2023 at 08:31:03AM +0200, Greg KH wrote:
> > On Tue, Sep 26, 2023 at 05:54:34AM +0000, Wei Liu wrote:
> > > On Tue, Sep 26, 2023 at 06:52:46AM +0200, Greg KH wrote:
> > > > On Mon, Sep 25, 2023 at 05:07:24PM -0700, Nuno Das Neves wrote:
> > > > > On 9/23/2023 12:58 AM, Greg KH wrote:
> > > > > > Also, drivers should never call pr_*() calls, always use the proper
> > > > > > dev_*() calls instead.
> > > > > > 
> > > > > 
> > > > > We only use struct device in one place in this driver, I think that is the
> > > > > only place it makes sense to use dev_*() over pr_*() calls.
> > > > 
> > > > Then the driver needs to be fixed to use struct device properly so that
> > > > you do have access to it when you want to print messages.  That's a
> > > > valid reason to pass around your device structure when needed.
> > > 
> > > Greg, ACRN and Nitro drivers do not pass around the device structure.
> > > Instead, they rely on a global struct device. We can follow the same.
> > 
> > A single global struct device is wrong, please don't do that.
> > 
> > Don't copy bad design patterns from other drivers, be better :)
> > 
> 
> If we're working with real devices like network cards or graphics cards
> I would agree -- it is easy to imagine that we have several cards of the
> same model in the system -- but in real world there won't be two
> hypervisor instances running on the same hardware.
> 
> We can stash the struct device inside some private data fields, but that
> doesn't change the fact that we're still having one instance of the
> structure. Is this what you want? Or do you have something else in mind?

You have a real device, it's how userspace interacts with your
subsystem.  Please use that, it is dynamically created and handled and
is the correct representation here.

thanks,

greg k-h
Saurabh Singh Sengar Sept. 26, 2023, 12:33 p.m. UTC | #12
On Mon, Sep 25, 2023 at 05:07:24PM -0700, Nuno Das Neves wrote:
> Resend in plain text instead of HTML - oops!
> 
> On 9/23/2023 12:58 AM, Greg KH wrote:
> >On Fri, Sep 22, 2023 at 11:38:35AM -0700, Nuno Das Neves wrote:
> >>+static int mshv_vtl_get_vsm_regs(void)
> >>+{
> >>+	struct hv_register_assoc registers[2];
> >>+	union hv_input_vtl input_vtl;
> >>+	int ret, count = 2;
> >>+
> >>+	input_vtl.as_uint8 = 0;
> >>+	registers[0].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
> >>+	registers[1].name = HV_REGISTER_VSM_CAPABILITIES;
> >>+
> >>+	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
> >>+				       count, input_vtl, registers);
> >>+	if (ret)
> >>+		return ret;
> >>+
> >>+	mshv_vsm_page_offsets.as_uint64 = registers[0].value.reg64;
> >>+	mshv_vsm_capabilities.as_uint64 = registers[1].value.reg64;
> >>+
> >>+	pr_debug("%s: VSM code page offsets: %#016llx\n", __func__,
> >>+		 mshv_vsm_page_offsets.as_uint64);
> >>+	pr_info("%s: VSM capabilities: %#016llx\n", __func__,
> >>+		mshv_vsm_capabilities.as_uint64);
> >
> >When drivers are working properly, they are quiet.  This is very noisy
> >and probably is leaking memory addresses to userspace?
> >
> 
> I will remove these, thanks.
> 
> >Also, there is NEVER a need for __func__ in a pr_debug() line, it has
> >that for you automatically.
> >
> 
> Thank you, I didn't know this.
> 
> >Also, drivers should never call pr_*() calls, always use the proper
> >dev_*() calls instead.
> >
> 
> We only use struct device in one place in this driver, I think that
> is the only place it makes sense to use dev_*() over pr_*() calls.
> >
> >
> >>+
> >>+	return ret;
> >>+}
> >>+
> >>+static int mshv_vtl_configure_vsm_partition(void)
> >>+{
> >>+	union hv_register_vsm_partition_config config;
> >>+	struct hv_register_assoc reg_assoc;
> >>+	union hv_input_vtl input_vtl;
> >>+
> >>+	config.as_u64 = 0;
> >>+	config.default_vtl_protection_mask = HV_MAP_GPA_PERMISSIONS_MASK;
> >>+	config.enable_vtl_protection = 1;
> >>+	config.zero_memory_on_reset = 1;
> >>+	config.intercept_vp_startup = 1;
> >>+	config.intercept_cpuid_unimplemented = 1;
> >>+
> >>+	if (mshv_vsm_capabilities.intercept_page_available) {
> >>+		pr_debug("%s: using intercept page", __func__);
> >
> >Again, __func__ is not needed, you are providing it twice here for no
> >real reason except to waste storage space :)
> >
> 
> Thanks, I will review all the uses of pr_debug().
> 
> >>+		config.intercept_page = 1;
> >>+	}
> >>+
> >>+	reg_assoc.name = HV_REGISTER_VSM_PARTITION_CONFIG;
> >>+	reg_assoc.value.reg64 = config.as_u64;
> >>+	input_vtl.as_uint8 = 0;
> >>+
> >>+	return hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
> >>+				       1, input_vtl, &reg_assoc);
> >
> >
> >None of this needs to be unwound if initialization fails later on?
> >
> 
> I think unwinding this is not needed, not 100% sure.
> Saurabh, can you comment?

Yes unwinding is not useful here, as these are synthetic register
and there is no other use case of VSM supporting platforms other
than VSM configuration.

In a non VSM supported platform hv_call_set_vp_registers itself will
fail for HV_REGISTER_VSM_PARTITION_CONFIG.

- Saurabh

> 
> Thanks,
> Nuno
> 
> >thanks,
> >
> >greg k-h
>
Nuno Das Neves Sept. 26, 2023, 9:52 p.m. UTC | #13
On 9/26/2023 1:03 AM, Greg KH wrote:
> On Tue, Sep 26, 2023 at 07:00:51AM +0000, Wei Liu wrote:
>> On Tue, Sep 26, 2023 at 08:31:03AM +0200, Greg KH wrote:
>>> On Tue, Sep 26, 2023 at 05:54:34AM +0000, Wei Liu wrote:
>>>> On Tue, Sep 26, 2023 at 06:52:46AM +0200, Greg KH wrote:
>>>>> On Mon, Sep 25, 2023 at 05:07:24PM -0700, Nuno Das Neves wrote:
>>>>>> On 9/23/2023 12:58 AM, Greg KH wrote:
>>>>>>> Also, drivers should never call pr_*() calls, always use the proper
>>>>>>> dev_*() calls instead.
>>>>>>>
>>>>>>
>>>>>> We only use struct device in one place in this driver, I think that is the
>>>>>> only place it makes sense to use dev_*() over pr_*() calls.
>>>>>
>>>>> Then the driver needs to be fixed to use struct device properly so that
>>>>> you do have access to it when you want to print messages.  That's a
>>>>> valid reason to pass around your device structure when needed.
>>>>

What is the tangible benefit of using dev_*() over pr_*()? As I said,
our use of struct device is very limited compared to all the places we
may need to log errors.

pr_*() is used by many, many drivers; it seems to be the norm. We can
certainly add a pr_fmt to improve the logging.

>>>> Greg, ACRN and Nitro drivers do not pass around the device structure.
>>>> Instead, they rely on a global struct device. We can follow the same.
>>>
>>> A single global struct device is wrong, please don't do that.
>>>
>>> Don't copy bad design patterns from other drivers, be better :)
>>>

What makes it a bad pattern? It seems to be well-established, and is
also used by KVM which this driver is loosely modeled after:
https://elixir.bootlin.com/linux/v6.5.5/source/virt/kvm/kvm_main.c#L5128

>>
>> If we're working with real devices like network cards or graphics cards
>> I would agree -- it is easy to imagine that we have several cards of the
>> same model in the system -- but in real world there won't be two
>> hypervisor instances running on the same hardware.
>>
>> We can stash the struct device inside some private data fields, but that
>> doesn't change the fact that we're still having one instance of the
>> structure. Is this what you want? Or do you have something else in mind?
> 
> You have a real device, it's how userspace interacts with your
> subsystem.  Please use that, it is dynamically created and handled and
> is the correct representation here.
> 

Are you referring to the struct device we get from calling
misc_register? If not, please be more specific.

How would you suggest we get a reference to that device via e.g. open()
or ioctl() without keeping a global reference to it?


Thanks,
Nuno
Greg Kroah-Hartman Sept. 27, 2023, 6:01 a.m. UTC | #14
On Tue, Sep 26, 2023 at 02:52:36PM -0700, Nuno Das Neves wrote:
> On 9/26/2023 1:03 AM, Greg KH wrote:
> > On Tue, Sep 26, 2023 at 07:00:51AM +0000, Wei Liu wrote:
> > > On Tue, Sep 26, 2023 at 08:31:03AM +0200, Greg KH wrote:
> > > > On Tue, Sep 26, 2023 at 05:54:34AM +0000, Wei Liu wrote:
> > > > > On Tue, Sep 26, 2023 at 06:52:46AM +0200, Greg KH wrote:
> > > > > > On Mon, Sep 25, 2023 at 05:07:24PM -0700, Nuno Das Neves wrote:
> > > > > > > On 9/23/2023 12:58 AM, Greg KH wrote:
> > > > > > > > Also, drivers should never call pr_*() calls, always use the proper
> > > > > > > > dev_*() calls instead.
> > > > > > > > 
> > > > > > > 
> > > > > > > We only use struct device in one place in this driver, I think that is the
> > > > > > > only place it makes sense to use dev_*() over pr_*() calls.
> > > > > > 
> > > > > > Then the driver needs to be fixed to use struct device properly so that
> > > > > > you do have access to it when you want to print messages.  That's a
> > > > > > valid reason to pass around your device structure when needed.
> > > > > 
> 
> What is the tangible benefit of using dev_*() over pr_*()?

Unified reporting and handling of userspace of kernel log messages so
they can be classified properly as well as dealing correctly with the
dynamic debugging kernel infrastructure.

Why wouldn't you want to use it?

> As I said,
> our use of struct device is very limited compared to all the places we
> may need to log errors.

Then please fix that.

> pr_*() is used by many, many drivers; it seems to be the norm.

Not at all, it is not.

> We can certainly add a pr_fmt to improve the logging.

Please do it correctly so you don't have to go and add support for it
later when your tools people ask you why they can't properly parse your
driver's kernel log messages.

> > > If we're working with real devices like network cards or graphics cards
> > > I would agree -- it is easy to imagine that we have several cards of the
> > > same model in the system -- but in real world there won't be two
> > > hypervisor instances running on the same hardware.
> > > 
> > > We can stash the struct device inside some private data fields, but that
> > > doesn't change the fact that we're still having one instance of the
> > > structure. Is this what you want? Or do you have something else in mind?
> > 
> > You have a real device, it's how userspace interacts with your
> > subsystem.  Please use that, it is dynamically created and handled and
> > is the correct representation here.
> > 
> 
> Are you referring to the struct device we get from calling
> misc_register?

Yes.

> How would you suggest we get a reference to that device via e.g. open()
> or ioctl() without keeping a global reference to it?

You explicitly have it in your open() and ioctl() call, you never need a
global reference for it the kernel gives it to you!

thanks,

greg k-h
Wei Liu Sept. 27, 2023, 8:04 a.m. UTC | #15
Hi Greg

It is past midnight here, so I may not be able to articulate my thoughts
very well. I want to avoid waiting for another day for another round
trip of emails though. We can look at your reply in the morning and
reply again.

On Wed, Sep 27, 2023 at 08:01:01AM +0200, Greg KH wrote:
[...]
> > > > If we're working with real devices like network cards or graphics cards
> > > > I would agree -- it is easy to imagine that we have several cards of the
> > > > same model in the system -- but in real world there won't be two
> > > > hypervisor instances running on the same hardware.
> > > > 
> > > > We can stash the struct device inside some private data fields, but that
> > > > doesn't change the fact that we're still having one instance of the
> > > > structure. Is this what you want? Or do you have something else in mind?
> > > 
> > > You have a real device, it's how userspace interacts with your
> > > subsystem.  Please use that, it is dynamically created and handled and
> > > is the correct representation here.
> > > 
> > 
> > Are you referring to the struct device we get from calling
> > misc_register?
> 
> Yes.
> 

We know about this, please see below. And we plan to use this.

> > How would you suggest we get a reference to that device via e.g. open()
> > or ioctl() without keeping a global reference to it?
> 
> You explicitly have it in your open() and ioctl() call, you never need a
> global reference for it the kernel gives it to you!
> 

This is what I don't follow.

Nuno and I discussed this today offline. We looked at the code before
and looked again today (well, yesterday now).

Here are the two functions:

    int vfs_open(const struct path *path, struct file *file)
    long vfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)

Or, if we provide an open function in our file_operations struct, we get
an additional struct inode pointer.

    int (*open) (struct inode *, struct file *);

Neither struct file nor struct inode contains a reference to struct device.

Then in vfs.rst, there is a section about open:

``open``
        called by the VFS when an inode should be opened.  When the VFS
        opens a file, it creates a new "struct file".  It then calls the
        open method for the newly allocated file structure.  You might
        think that the open method really belongs in "struct
        inode_operations", and you may be right.  I think it's done the
        way it is because it makes filesystems simpler to implement.
        The open() method is a good place to initialize the
        "private_data" member in the file structure if you want to point
        to a device structure

So, the driver is supposed to stash a pointer to struct device in
private_data. That's what I alluded to in my previous reply. The core
driver framework or the VFS doesn't give us a reference to struct
device. We have to do it ourselves.

We can do that for sure, but the struct device we stash into
private_data is going to be the one that is returned from misc_register,
which at the same time is already stashed inside a static variable in
our driver by our own code (Note that this is a pervasive pattern in the
kernel).

I hope this is clear. If we're missing something extremely obvious, that
somehow we can get a reference to struct device from the VFS while
opening the file or doing ioctl without stashing it ourselves in our own
code, please let us know.

At this point I feel like either I'm extremely stupid or we're just
talking past each other. If you tell me it is the former and help me
understand how we can achieve what you described, I am more than happy
to learn new things I don't know or understand. :-)

If we have to propagate that reference ourselves, that then leads to
next question whether it will just be more convenient to use the stashed
value in the static variable directly like other drivers do, instead of
stashing and propagating it around, knowing 100% it is the same object.
I don't feel too strongly about this. As long as we are on the same page
about getting a reference to struct device, we can do it either way.

Thanks,
Wei.


> thanks,
> 
> greg k-h
>
Greg Kroah-Hartman Sept. 27, 2023, 8:33 a.m. UTC | #16
On Wed, Sep 27, 2023 at 08:04:42AM +0000, Wei Liu wrote:
> On Wed, Sep 27, 2023 at 08:01:01AM +0200, Greg KH wrote:
> [...]
> > > > > If we're working with real devices like network cards or graphics cards
> > > > > I would agree -- it is easy to imagine that we have several cards of the
> > > > > same model in the system -- but in real world there won't be two
> > > > > hypervisor instances running on the same hardware.
> > > > > 
> > > > > We can stash the struct device inside some private data fields, but that
> > > > > doesn't change the fact that we're still having one instance of the
> > > > > structure. Is this what you want? Or do you have something else in mind?
> > > > 
> > > > You have a real device, it's how userspace interacts with your
> > > > subsystem.  Please use that, it is dynamically created and handled and
> > > > is the correct representation here.
> > > > 
> > > 
> > > Are you referring to the struct device we get from calling
> > > misc_register?
> > 
> > Yes.
> > 
> 
> We know about this, please see below. And we plan to use this.
> 
> > > How would you suggest we get a reference to that device via e.g. open()
> > > or ioctl() without keeping a global reference to it?
> > 
> > You explicitly have it in your open() and ioctl() call, you never need a
> > global reference for it the kernel gives it to you!
> > 
> 
> This is what I don't follow.
> 
> Nuno and I discussed this today offline. We looked at the code before
> and looked again today (well, yesterday now).
> 
> Here are the two functions:
> 
>     int vfs_open(const struct path *path, struct file *file)
>     long vfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> 
> Or, if we provide an open function in our file_operations struct, we get
> an additional struct inode pointer.
> 
>     int (*open) (struct inode *, struct file *);
> 
> Neither struct file nor struct inode contains a reference to struct device.
> 
> Then in vfs.rst, there is a section about open:
> 
> ``open``
>         called by the VFS when an inode should be opened.  When the VFS
>         opens a file, it creates a new "struct file".  It then calls the
>         open method for the newly allocated file structure.  You might
>         think that the open method really belongs in "struct
>         inode_operations", and you may be right.  I think it's done the
>         way it is because it makes filesystems simpler to implement.
>         The open() method is a good place to initialize the
>         "private_data" member in the file structure if you want to point
>         to a device structure
> 
> So, the driver is supposed to stash a pointer to struct device in
> private_data. That's what I alluded to in my previous reply. The core
> driver framework or the VFS doesn't give us a reference to struct
> device. We have to do it ourselves.

Please read Linux Device Drivers, 3rd edition, chapter 3, for how to do
this properly.  The book is free online.

Also look at the zillion in-kernel example drivers that use the misc
device api, container_of() is your friend...

> We can do that for sure, but the struct device we stash into
> private_data is going to be the one that is returned from misc_register,
> which at the same time is already stashed inside a static variable in
> our driver by our own code (Note that this is a pervasive pattern in the
> kernel).

Again, don't make this static, there's no requirement to do so.

But even if you do, sure, use it this way, you have a device.  But I
would strongly discourage you from having a static variable, there is no
need to do this at all, and no one else should do so either.

thanks,

greg k-h
Nuno Das Neves Sept. 28, 2023, 12:17 a.m. UTC | #17
On 9/27/2023 1:33 AM, Greg KH wrote:
> On Wed, Sep 27, 2023 at 08:04:42AM +0000, Wei Liu wrote:
>> So, the driver is supposed to stash a pointer to struct device in
>> private_data. That's what I alluded to in my previous reply. The core
>> driver framework or the VFS doesn't give us a reference to struct
>> device. We have to do it ourselves.
> 
> Please read Linux Device Drivers, 3rd edition, chapter 3, for how to do
> this properly.  The book is free online.
> 

Thanks, the issue that confused us was how to get the miscdevice.
I eventually found the answer in the misc_register() documentation:

"By default, an open() syscall to the device sets file->private_data to
point to the structure."

That's good - when we create a guest, we will have the miscdevice
in private_data already. Then we can just put it in our per-guest data
structure. That will let us retrieve the device in the other ioctls so 
we can call dev_*().

Thanks,
Nuno