mbox series

[v15,0/6] drm/xe/xe_vm: Implement xe_vm_get_property_ioctl

Message ID 20250328204045.157914-1-jonathan.cavitt@intel.com (mailing list archive)
Headers show
Series drm/xe/xe_vm: Implement xe_vm_get_property_ioctl | expand

Message

Jonathan Cavitt March 28, 2025, 8:40 p.m. UTC
Add additional information to each VM so they can report up to the first
50 seen faults.  Only pagefaults are saved this way currently, though in
the future, all faults should be tracked by the VM for future reporting.

Additionally, of the pagefaults reported, only failed pagefaults are
saved this way, as successful pagefaults should recover silently and not
need to be reported to userspace.

To allow userspace to access these faults, a new ioctl -
xe_vm_get_property_ioct - was created.

v2: (Matt Brost)
- Break full ban list request into a separate property.
- Reformat drm_xe_vm_get_property struct.
- Remove need for drm_xe_faults helper struct.
- Separate data pointer and scalar return value in ioctl.
- Get address type on pagefault report and save it to the pagefault.
- Correctly reject writes to read-only VMAs.
- Miscellaneous formatting fixes.

v3: (Matt Brost)
- Only allow querying of failed pagefaults

v4:
- Remove unnecessary size parameter from helper function, as it
  is a property of the arguments. (jcavitt)
- Remove unnecessary copy_from_user (Jainxun)
- Set address_precision to 1 (Jainxun)
- Report max size instead of dynamic size for memory allocation
  purposes.  Total memory usage is reported separately.

v5:
- Return int from xe_vm_get_property_size (Shuicheng)
- Fix memory leak (Shuicheng)
- Remove unnecessary size variable (jcavitt)

v6:
- Free vm after use (Shuicheng)
- Compress pf copy logic (Shuicheng)
- Update fault_unsuccessful before storing (Shuicheng)
- Fix old struct name in comments (Shuicheng)
- Keep first 50 pagefaults instead of last 50 (Jianxun)
- Rename ioctl to xe_vm_get_faults_ioctl (jcavitt)

v7:
- Avoid unnecessary execution by checking MAX_PFS earlier (jcavitt)
- Fix double-locking error (jcavitt)
- Assert kmemdump is successful (Shuicheng)
- Repair and move fill_faults break condition (Dan Carpenter)
- Free vm after use (jcavitt)
- Combine assertions (jcavitt)
- Expand size check in xe_vm_get_faults_ioctl (jcavitt)
- Remove return mask from fill_faults, as return is already -EFAULT or 0
  (jcavitt)

v8:
- Revert back to using drm_xe_vm_get_property_ioctl
- s/Migrate/Move (Michal)
- s/xe_pagefault/xe_gt_pagefault (Michal)
- Create new header file, xe_gt_pagefault_types.h (Michal)
- Add and fix kernel docs (Michal)
- Rename xe_vm.pfs to xe_vm.faults (jcavitt)
- Store fault data and not pagefault in xe_vm faults list (jcavitt)
- Store address, address type, and address precision per fault (jcavitt)
- Store engine class and instance data per fault (Jianxun)
- Properly handle kzalloc error (Michal W)
- s/MAX_PFS/MAX_FAULTS_SAVED_PER_VM (Michal W)
- Store fault level per fault (Micahl M)
- Apply better copy_to_user logic (jcavitt)

v9:
- More kernel doc fixes (Michal W, Jianxun)
- Better error handling (jcavitt)

v10:
- Convert enums to defines in regs folder (Michal W)
- Move xe_guc_pagefault_desc to regs folder (Michal W)
- Future-proof size logic for zero-size properties (jcavitt)
- Replace address type extern with access type (Jianxun)
- Add fault type to xe_drm_fault (Jianxun)

v11:
- Remove unnecessary switch case logic (Raag)
- Compress size get, size validation, and property fill functions into a
  single helper function (jcavitt)
- Assert valid size (jcavitt)
- Store pagefaults in non-fault-mode VMs as well (Jianxun)

v12:
- Remove unnecessary else condition
- Correct backwards helper function size logic (jcavitt)
- Fix kernel docs and comments (Michal W)

v13:
- Move xe and user engine class mapping arrays to header (John H)

v14:
- Fix double locking issue (Jianxun)
- Use size_t instead of int (Raag)
- Remove unnecessary includes (jcavitt)

v15:
- Do not report faults from reserved engines (Jianxun)

Signed-off-by: Jonathan Cavitt <joanthan.cavitt@intel.com>
Suggested-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Suggested-by: Matthew Brost <matthew.brost@intel.com>
Cc: Zhang Jianxun <jianxun.zhang@intel.com>
Cc: Shuicheng Lin <shuicheng.lin@intel.com>
Cc: Michal Wajdeczko <Michal.Wajdeczko@intel.com>
Cc: Michal Mrozek <michal.mrozek@intel.com>
Cc: Raag Jadav <raag.jadav@intel.com>
Cc: John Harrison <john.c.harrison@intel.com>

Jonathan Cavitt (6):
  drm/xe/xe_hw_engine: Map xe and user engine class in header
  drm/xe/xe_gt_pagefault: Disallow writes to read-only VMAs
  drm/xe/xe_gt_pagefault: Move pagefault struct to header
  drm/xe/uapi: Define drm_xe_vm_get_property
  drm/xe/xe_vm: Add per VM fault info
  drm/xe/xe_vm: Implement xe_vm_get_property_ioctl

 drivers/gpu/drm/xe/regs/xe_pagefault_desc.h |  50 ++++++
 drivers/gpu/drm/xe/xe_device.c              |   3 +
 drivers/gpu/drm/xe/xe_gt_pagefault.c        |  72 ++++----
 drivers/gpu/drm/xe/xe_gt_pagefault_types.h  |  42 +++++
 drivers/gpu/drm/xe/xe_guc_fwif.h            |  28 ----
 drivers/gpu/drm/xe/xe_hw_engine.c           |  24 ++-
 drivers/gpu/drm/xe/xe_hw_engine_types.h     |   3 +
 drivers/gpu/drm/xe/xe_query.c               |  18 +-
 drivers/gpu/drm/xe/xe_vm.c                  | 177 ++++++++++++++++++++
 drivers/gpu/drm/xe/xe_vm.h                  |  11 ++
 drivers/gpu/drm/xe/xe_vm_types.h            |  32 ++++
 include/uapi/drm/xe_drm.h                   |  79 +++++++++
 12 files changed, 453 insertions(+), 86 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/regs/xe_pagefault_desc.h
 create mode 100644 drivers/gpu/drm/xe/xe_gt_pagefault_types.h

Comments

Lionel Landwerlin March 31, 2025, 8:18 a.m. UTC | #1
Hi Jonathan,

Are the pagefault reported for any unit in the GPU (including command 
streamer?) or is it limited to execution units?

Thanks,

-Lionel

On 28/03/2025 22:40, Jonathan Cavitt wrote:
> Add additional information to each VM so they can report up to the first
> 50 seen faults.  Only pagefaults are saved this way currently, though in
> the future, all faults should be tracked by the VM for future reporting.
>
> Additionally, of the pagefaults reported, only failed pagefaults are
> saved this way, as successful pagefaults should recover silently and not
> need to be reported to userspace.
>
> To allow userspace to access these faults, a new ioctl -
> xe_vm_get_property_ioct - was created.
>
> v2: (Matt Brost)
> - Break full ban list request into a separate property.
> - Reformat drm_xe_vm_get_property struct.
> - Remove need for drm_xe_faults helper struct.
> - Separate data pointer and scalar return value in ioctl.
> - Get address type on pagefault report and save it to the pagefault.
> - Correctly reject writes to read-only VMAs.
> - Miscellaneous formatting fixes.
>
> v3: (Matt Brost)
> - Only allow querying of failed pagefaults
>
> v4:
> - Remove unnecessary size parameter from helper function, as it
>    is a property of the arguments. (jcavitt)
> - Remove unnecessary copy_from_user (Jainxun)
> - Set address_precision to 1 (Jainxun)
> - Report max size instead of dynamic size for memory allocation
>    purposes.  Total memory usage is reported separately.
>
> v5:
> - Return int from xe_vm_get_property_size (Shuicheng)
> - Fix memory leak (Shuicheng)
> - Remove unnecessary size variable (jcavitt)
>
> v6:
> - Free vm after use (Shuicheng)
> - Compress pf copy logic (Shuicheng)
> - Update fault_unsuccessful before storing (Shuicheng)
> - Fix old struct name in comments (Shuicheng)
> - Keep first 50 pagefaults instead of last 50 (Jianxun)
> - Rename ioctl to xe_vm_get_faults_ioctl (jcavitt)
>
> v7:
> - Avoid unnecessary execution by checking MAX_PFS earlier (jcavitt)
> - Fix double-locking error (jcavitt)
> - Assert kmemdump is successful (Shuicheng)
> - Repair and move fill_faults break condition (Dan Carpenter)
> - Free vm after use (jcavitt)
> - Combine assertions (jcavitt)
> - Expand size check in xe_vm_get_faults_ioctl (jcavitt)
> - Remove return mask from fill_faults, as return is already -EFAULT or 0
>    (jcavitt)
>
> v8:
> - Revert back to using drm_xe_vm_get_property_ioctl
> - s/Migrate/Move (Michal)
> - s/xe_pagefault/xe_gt_pagefault (Michal)
> - Create new header file, xe_gt_pagefault_types.h (Michal)
> - Add and fix kernel docs (Michal)
> - Rename xe_vm.pfs to xe_vm.faults (jcavitt)
> - Store fault data and not pagefault in xe_vm faults list (jcavitt)
> - Store address, address type, and address precision per fault (jcavitt)
> - Store engine class and instance data per fault (Jianxun)
> - Properly handle kzalloc error (Michal W)
> - s/MAX_PFS/MAX_FAULTS_SAVED_PER_VM (Michal W)
> - Store fault level per fault (Micahl M)
> - Apply better copy_to_user logic (jcavitt)
>
> v9:
> - More kernel doc fixes (Michal W, Jianxun)
> - Better error handling (jcavitt)
>
> v10:
> - Convert enums to defines in regs folder (Michal W)
> - Move xe_guc_pagefault_desc to regs folder (Michal W)
> - Future-proof size logic for zero-size properties (jcavitt)
> - Replace address type extern with access type (Jianxun)
> - Add fault type to xe_drm_fault (Jianxun)
>
> v11:
> - Remove unnecessary switch case logic (Raag)
> - Compress size get, size validation, and property fill functions into a
>    single helper function (jcavitt)
> - Assert valid size (jcavitt)
> - Store pagefaults in non-fault-mode VMs as well (Jianxun)
>
> v12:
> - Remove unnecessary else condition
> - Correct backwards helper function size logic (jcavitt)
> - Fix kernel docs and comments (Michal W)
>
> v13:
> - Move xe and user engine class mapping arrays to header (John H)
>
> v14:
> - Fix double locking issue (Jianxun)
> - Use size_t instead of int (Raag)
> - Remove unnecessary includes (jcavitt)
>
> v15:
> - Do not report faults from reserved engines (Jianxun)
>
> Signed-off-by: Jonathan Cavitt <joanthan.cavitt@intel.com>
> Suggested-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Suggested-by: Matthew Brost <matthew.brost@intel.com>
> Cc: Zhang Jianxun <jianxun.zhang@intel.com>
> Cc: Shuicheng Lin <shuicheng.lin@intel.com>
> Cc: Michal Wajdeczko <Michal.Wajdeczko@intel.com>
> Cc: Michal Mrozek <michal.mrozek@intel.com>
> Cc: Raag Jadav <raag.jadav@intel.com>
> Cc: John Harrison <john.c.harrison@intel.com>
>
> Jonathan Cavitt (6):
>    drm/xe/xe_hw_engine: Map xe and user engine class in header
>    drm/xe/xe_gt_pagefault: Disallow writes to read-only VMAs
>    drm/xe/xe_gt_pagefault: Move pagefault struct to header
>    drm/xe/uapi: Define drm_xe_vm_get_property
>    drm/xe/xe_vm: Add per VM fault info
>    drm/xe/xe_vm: Implement xe_vm_get_property_ioctl
>
>   drivers/gpu/drm/xe/regs/xe_pagefault_desc.h |  50 ++++++
>   drivers/gpu/drm/xe/xe_device.c              |   3 +
>   drivers/gpu/drm/xe/xe_gt_pagefault.c        |  72 ++++----
>   drivers/gpu/drm/xe/xe_gt_pagefault_types.h  |  42 +++++
>   drivers/gpu/drm/xe/xe_guc_fwif.h            |  28 ----
>   drivers/gpu/drm/xe/xe_hw_engine.c           |  24 ++-
>   drivers/gpu/drm/xe/xe_hw_engine_types.h     |   3 +
>   drivers/gpu/drm/xe/xe_query.c               |  18 +-
>   drivers/gpu/drm/xe/xe_vm.c                  | 177 ++++++++++++++++++++
>   drivers/gpu/drm/xe/xe_vm.h                  |  11 ++
>   drivers/gpu/drm/xe/xe_vm_types.h            |  32 ++++
>   include/uapi/drm/xe_drm.h                   |  79 +++++++++
>   12 files changed, 453 insertions(+), 86 deletions(-)
>   create mode 100644 drivers/gpu/drm/xe/regs/xe_pagefault_desc.h
>   create mode 100644 drivers/gpu/drm/xe/xe_gt_pagefault_types.h
>
Jonathan Cavitt March 31, 2025, 2:11 p.m. UTC | #2
-----Original Message-----
From: Landwerlin, Lionel G <lionel.g.landwerlin@intel.com> 
Sent: Monday, March 31, 2025 1:18 AM
To: Cavitt, Jonathan <jonathan.cavitt@intel.com>; intel-xe@lists.freedesktop.org
Cc: Gupta, saurabhg <saurabhg.gupta@intel.com>; Zuo, Alex <alex.zuo@intel.com>; joonas.lahtinen@linux.intel.com; Brost, Matthew <matthew.brost@intel.com>; Zhang, Jianxun <jianxun.zhang@intel.com>; Lin, Shuicheng <shuicheng.lin@intel.com>; dri-devel@lists.freedesktop.org; Wajdeczko, Michal <Michal.Wajdeczko@intel.com>; Mrozek, Michal <michal.mrozek@intel.com>; Jadav, Raag <raag.jadav@intel.com>; Harrison, John C <john.c.harrison@intel.com>
Subject: Re: [PATCH v15 0/6] drm/xe/xe_vm: Implement xe_vm_get_property_ioctl
> 
> Hi Jonathan,
> 
> Are the pagefault reported for any unit in the GPU (including command 
> streamer?) or is it limited to execution units?

Currently, the only faults that are reported are pagefaults that are handled by
the XE pagefault handler (pf_queue_work_func), and that are reported on a 
userspace-visible engine class (I.E. not "reserved").  So, I think that means only
execution unit pagefaults are visible?
-Jonathan Cavitt

> 
> Thanks,
> 
> -Lionel
> 
> On 28/03/2025 22:40, Jonathan Cavitt wrote:
> > Add additional information to each VM so they can report up to the first
> > 50 seen faults.  Only pagefaults are saved this way currently, though in
> > the future, all faults should be tracked by the VM for future reporting.
> >
> > Additionally, of the pagefaults reported, only failed pagefaults are
> > saved this way, as successful pagefaults should recover silently and not
> > need to be reported to userspace.
> >
> > To allow userspace to access these faults, a new ioctl -
> > xe_vm_get_property_ioct - was created.
> >
> > v2: (Matt Brost)
> > - Break full ban list request into a separate property.
> > - Reformat drm_xe_vm_get_property struct.
> > - Remove need for drm_xe_faults helper struct.
> > - Separate data pointer and scalar return value in ioctl.
> > - Get address type on pagefault report and save it to the pagefault.
> > - Correctly reject writes to read-only VMAs.
> > - Miscellaneous formatting fixes.
> >
> > v3: (Matt Brost)
> > - Only allow querying of failed pagefaults
> >
> > v4:
> > - Remove unnecessary size parameter from helper function, as it
> >    is a property of the arguments. (jcavitt)
> > - Remove unnecessary copy_from_user (Jainxun)
> > - Set address_precision to 1 (Jainxun)
> > - Report max size instead of dynamic size for memory allocation
> >    purposes.  Total memory usage is reported separately.
> >
> > v5:
> > - Return int from xe_vm_get_property_size (Shuicheng)
> > - Fix memory leak (Shuicheng)
> > - Remove unnecessary size variable (jcavitt)
> >
> > v6:
> > - Free vm after use (Shuicheng)
> > - Compress pf copy logic (Shuicheng)
> > - Update fault_unsuccessful before storing (Shuicheng)
> > - Fix old struct name in comments (Shuicheng)
> > - Keep first 50 pagefaults instead of last 50 (Jianxun)
> > - Rename ioctl to xe_vm_get_faults_ioctl (jcavitt)
> >
> > v7:
> > - Avoid unnecessary execution by checking MAX_PFS earlier (jcavitt)
> > - Fix double-locking error (jcavitt)
> > - Assert kmemdump is successful (Shuicheng)
> > - Repair and move fill_faults break condition (Dan Carpenter)
> > - Free vm after use (jcavitt)
> > - Combine assertions (jcavitt)
> > - Expand size check in xe_vm_get_faults_ioctl (jcavitt)
> > - Remove return mask from fill_faults, as return is already -EFAULT or 0
> >    (jcavitt)
> >
> > v8:
> > - Revert back to using drm_xe_vm_get_property_ioctl
> > - s/Migrate/Move (Michal)
> > - s/xe_pagefault/xe_gt_pagefault (Michal)
> > - Create new header file, xe_gt_pagefault_types.h (Michal)
> > - Add and fix kernel docs (Michal)
> > - Rename xe_vm.pfs to xe_vm.faults (jcavitt)
> > - Store fault data and not pagefault in xe_vm faults list (jcavitt)
> > - Store address, address type, and address precision per fault (jcavitt)
> > - Store engine class and instance data per fault (Jianxun)
> > - Properly handle kzalloc error (Michal W)
> > - s/MAX_PFS/MAX_FAULTS_SAVED_PER_VM (Michal W)
> > - Store fault level per fault (Micahl M)
> > - Apply better copy_to_user logic (jcavitt)
> >
> > v9:
> > - More kernel doc fixes (Michal W, Jianxun)
> > - Better error handling (jcavitt)
> >
> > v10:
> > - Convert enums to defines in regs folder (Michal W)
> > - Move xe_guc_pagefault_desc to regs folder (Michal W)
> > - Future-proof size logic for zero-size properties (jcavitt)
> > - Replace address type extern with access type (Jianxun)
> > - Add fault type to xe_drm_fault (Jianxun)
> >
> > v11:
> > - Remove unnecessary switch case logic (Raag)
> > - Compress size get, size validation, and property fill functions into a
> >    single helper function (jcavitt)
> > - Assert valid size (jcavitt)
> > - Store pagefaults in non-fault-mode VMs as well (Jianxun)
> >
> > v12:
> > - Remove unnecessary else condition
> > - Correct backwards helper function size logic (jcavitt)
> > - Fix kernel docs and comments (Michal W)
> >
> > v13:
> > - Move xe and user engine class mapping arrays to header (John H)
> >
> > v14:
> > - Fix double locking issue (Jianxun)
> > - Use size_t instead of int (Raag)
> > - Remove unnecessary includes (jcavitt)
> >
> > v15:
> > - Do not report faults from reserved engines (Jianxun)
> >
> > Signed-off-by: Jonathan Cavitt <joanthan.cavitt@intel.com>
> > Suggested-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Suggested-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: Zhang Jianxun <jianxun.zhang@intel.com>
> > Cc: Shuicheng Lin <shuicheng.lin@intel.com>
> > Cc: Michal Wajdeczko <Michal.Wajdeczko@intel.com>
> > Cc: Michal Mrozek <michal.mrozek@intel.com>
> > Cc: Raag Jadav <raag.jadav@intel.com>
> > Cc: John Harrison <john.c.harrison@intel.com>
> >
> > Jonathan Cavitt (6):
> >    drm/xe/xe_hw_engine: Map xe and user engine class in header
> >    drm/xe/xe_gt_pagefault: Disallow writes to read-only VMAs
> >    drm/xe/xe_gt_pagefault: Move pagefault struct to header
> >    drm/xe/uapi: Define drm_xe_vm_get_property
> >    drm/xe/xe_vm: Add per VM fault info
> >    drm/xe/xe_vm: Implement xe_vm_get_property_ioctl
> >
> >   drivers/gpu/drm/xe/regs/xe_pagefault_desc.h |  50 ++++++
> >   drivers/gpu/drm/xe/xe_device.c              |   3 +
> >   drivers/gpu/drm/xe/xe_gt_pagefault.c        |  72 ++++----
> >   drivers/gpu/drm/xe/xe_gt_pagefault_types.h  |  42 +++++
> >   drivers/gpu/drm/xe/xe_guc_fwif.h            |  28 ----
> >   drivers/gpu/drm/xe/xe_hw_engine.c           |  24 ++-
> >   drivers/gpu/drm/xe/xe_hw_engine_types.h     |   3 +
> >   drivers/gpu/drm/xe/xe_query.c               |  18 +-
> >   drivers/gpu/drm/xe/xe_vm.c                  | 177 ++++++++++++++++++++
> >   drivers/gpu/drm/xe/xe_vm.h                  |  11 ++
> >   drivers/gpu/drm/xe/xe_vm_types.h            |  32 ++++
> >   include/uapi/drm/xe_drm.h                   |  79 +++++++++
> >   12 files changed, 453 insertions(+), 86 deletions(-)
> >   create mode 100644 drivers/gpu/drm/xe/regs/xe_pagefault_desc.h
> >   create mode 100644 drivers/gpu/drm/xe/xe_gt_pagefault_types.h
> >
> 
>