mbox series

[git,pull] habanalabs for drm-next-6.4

Message ID 20230320154026.GA766126@ogabbay-vm-u20.habana-labs.com (mailing list archive)
State New, archived
Headers show
Series [git,pull] habanalabs for drm-next-6.4 | expand

Pull-request

https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux.git tags/drm-habanalabs-next-2023-03-20

Message

Oded Gabbay March 20, 2023, 3:40 p.m. UTC
Hi Dave, Daniel.

First pull request for 6.4.

Changes are all over the place - new uAPI, new features, optimizations, bug
fixes, cleanups, etc.

Full details are in the signed tag.

Thanks,
Oded

The following changes since commit 8bf6e20253b2d2b614f2c0b491f840e956fa6b05:

  Merge tag 'drm-intel-next-2023-03-07' of git://anongit.freedesktop.org/drm/drm-intel into drm-next (2023-03-15 14:59:31 +1000)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux.git tags/drm-habanalabs-next-2023-03-20

for you to fetch changes up to 75b445753047872a69709cfba7e3939660f0ecc1:

  accel/habanalabs: remove redundant TODOs (2023-03-20 17:35:34 +0200)

----------------------------------------------------------------
This tag contains habanalabs driver and accel changes for v6.4:

- uAPI changes:

  - Add opcodes to the CS ioctl to allow user to stall/resume specific engines
    inside Gaudi2. This is to allow the user to perform power
    testing/measurements when training different topologies.

  - Expose in the INFO ioctl the amount of device memory that the driver
    and f/w reserve for themselves.

  - Expose in the INFO ioctl a bit-mask of the available rotator engines
    in Gaudi2. This is to align with other engines that are already exposed.

  - Expose in the INFO ioctl the register's address of the f/w that should
    be used to trigger interrupts from within the user's code running in the
    compute engines.

  - Add a critical-event bit in the eventfd bitmask so the user will know the
    event that was received was critical, and a reset will now occur

  - Expose in the INFO ioctl two new opcodes to fetch information on h/w and
    f/w events. The events recorded are the events that were reported in the
    eventfd.

- New features and improvements:

  - Add a dedicated interrupt ID in MSI-X in the device to the notification of
    an unexpected user-related event in Gaudi2. Handle it in the driver by
    reporting this event.

  - Allow the user to fetch the device memory current usage even when the
    device is undergoing compute-reset (a reset type that only clears the
    compute engines).

  - Enable graceful reset mechanism for compute-reset. This will give the
    user a few seconds before the device is reset. For example, the user can,
    during that time, perform certain device operations (dump data for debug)
    or close the device in an orderly fashion.

  - Align the decoder with the rest of the engines in regard to notification
    to the user about interrupts and in regard to performing graceful reset
    when needed (instead of immediate reset).

  - Add support for assert interrupt from the TPC engine.

  - Get the reset type that is necessary to perform per event from the
    auto-generated irq_map array.

  - Print the specific reason why a device is still in use when notifying to
    the user about it (after the user closed the device's FD).

  - Move to threaded IRQ when handling interrupts of workload completions.

- Firmware related fixes:

  - Fix RAZWI event handler to match newest f/w version.

  - Read error cause register in dma core events because the f/w doesn't
    do that.

  - Increase maximum time to wait for completion of Gaudi2 reset due to f/w
    bug.

  - Align to the latest firmware specs.

- Enforce the release order of the compute device and dma-buf.
  i.e increment the device file refcount for any dma-buf that was exported
  for that device. This will make sure the compute device release function
  won't be called until the user closes all the FDs of the relevant
  dma-bufs. Without this change, closing the device's FD before/without
  closing the dma-buf's FD would always lead to hard-reset of the device.

- Fix a link in the drm documentation to correctly point to the accel section.

- Compilation warnings cleanups

- Misc bug fixes and code cleanups

----------------------------------------------------------------
Bagas Sanjaya (1):
      accel: Link to compute accelerator subsystem intro

Bjorn Helgaas (1):
      accel/habanalabs: Drop redundant pci_enable_pcie_error_reporting()

Colin Ian King (1):
      accel/habanalabs: Fix spelling mistake "maped" -> "mapped"

Dafna Hirschfeld (12):
      accel/habanalabs: tiny refactor of hl_device_reset for readability
      accel/habanalabs: in hl_device_reset remove 'hard_instead_of_soft'
      accel/habanalabs: in hl_device_reset small refactor for readabilty
      accel/habanalabs: don't trace cpu accessible dma alloc/free
      accel/habanalabs: change hw_fini to return int to indicate error
      accel/habanalabs: assert return value of hw_fini
      accel/habanalabs: allow getting HL_INFO_DRAM_USAGE during soft-reset
      accel/habanalabs: unify err log of hw-fini failure in dirty state
      accel/habanalabs: move soft-reset wait to soft-reset execute
      accel/habanalabs: in hw_fini return error code if polling timed-out
      accel/habanalabs: fix use of var reset_sleep_ms
      accel/habanalabs: in {e/p}dma_core events read the err cause reg

Dani Liberman (3):
      accel/habanalabs: fix address decode RAZWI handling
      accel/habanalabs: fix page fault event clear
      accel/habanalabs: change razwi handle after fw fix

Koby Elbaz (12):
      accel/habanalabs: capture RAZWI info only if HW indication detected
      accel/habanalabs: unsecure CFG_TPC_ID register
      accel/habanalabs: disable PCI when escalating compute to hard-reset
      accel/habanalabs: rename security function parameters
      accel/habanalabs: break is_idle function into per-engine sub-routines
      accel/habanalabs: verify return code after scrubbing ARCs DCCMs
      accel/habanalabs: remove a useless is_idle TPC flag
      accel/habanalabs: fix register address on PDMA/EDMA idle check
      accel/habanalabs: use a mutex rather than a spinlock
      accel/habanalabs: add uapi to stall/resume engine
      accel/habanalabs: do not verify engine modes after being changed
      accel/habanalabs: return tlb inv error code upon failure

Moti Haimovski (2):
      accel/habanalabs: add critical-event bit in notifier
      accel/habanalabs: minimize error prints when mem map fails

Oded Gabbay (6):
      accel/habanalabs: split cdev creation to separate function
      accel/habanalabs: save class in hdev
      accel/habanalabs: refactor debugfs init
      accel/habanalabs: make gaudi2_is_device_idle() static
      accel/habanalabs: align to latest firmware specs
      accel/habanalabs: fix field names in hl_info_hw_ip_info

Ofir Bitton (9):
      accel/habanalabs: increase user interrupt grace time
      accel/habanalabs: expose engine core int reg address
      accel/habanalabs: capture interrupt timestamp in handler
      accel/habanalabs: add support for TPC assert
      accel/habanalabs: increase reset poll timeout
      accel/habanalabs: expose dram reserved size by kmd
      accel/habanalabs: expose rotator mask to userspace
      accel/habanalabs: add handling for unexpected user event
      accel/habanalabs: remove redundant TODOs

Ohad Sharabi (3):
      accel/habanalabs: get reset type indication from irq_map
      accel/habanalabs: modify events reset policy
      accel/habanalabs: regenerate gaudi2 ids_map_extended

Sagiv Ozeri (2):
      accel/habanalabs: organize hl_device structure comment
      accel/habanalabs: add device id to all threads names

Tal Cohen (1):
      accel/habanalabs: change user interrupt to threaded IRQ

Tom Rix (2):
      accel/habanalabs: change unused extern decl of hdev to forward decl of hl_device
      accel/habanalabs: set hl_capture_*_err storage-class-specifier to static

Tomer Tayar (15):
      accel/habanalabs: use memhash_node_export_put() in hl_release_dmabuf()
      accel/habanalabs: add info when FD released while device still in use
      accel/habanalabs: enforce release order of compute device and dma-buf
      accel/habanalabs: enable graceful reset mechanism for compute-reset
      accel/habanalabs: fix print in hl_irq_handler_eq()
      accel/habanalabs: remove hl_irq_handler_default()
      accel/habanalabs: improve readability of engines idle mask print
      accel/habanalabs: remove unneeded irq_handler variable
      accel/habanalabs: add helper function to get vm hash node
      accel/habanalabs: use notifications and graceful reset for decoder
      accel/habanalabs: use scnprintf() in print_device_in_use_info()
      accel/habanalabs: postpone mem_mgr IDR destruction to hpriv_release()
      accel/habanalabs: remove '\n' when passing strings to gaudi2_print_event()
      accel/habanalabs: fix a maybe-uninitialized compilation warnings
      accel/habanalabs: fix a missing-braces compilation warning

farah kassabri (1):
      accel/habanalabs: fix few misspelled words in the code

 .../accel/habanalabs/common/command_submission.c   |  130 +-
 drivers/accel/habanalabs/common/debugfs.c          |  142 +-
 drivers/accel/habanalabs/common/decoder.c          |   22 +-
 drivers/accel/habanalabs/common/device.c           |  315 +-
 drivers/accel/habanalabs/common/firmware_if.c      |    2 +-
 drivers/accel/habanalabs/common/habanalabs.h       |  129 +-
 drivers/accel/habanalabs/common/habanalabs_drv.c   |   14 +-
 drivers/accel/habanalabs/common/habanalabs_ioctl.c |   60 +-
 drivers/accel/habanalabs/common/irq.c              |   73 +-
 drivers/accel/habanalabs/common/memory.c           |  133 +-
 drivers/accel/habanalabs/common/memory_mgr.c       |   15 +-
 drivers/accel/habanalabs/common/mmu/mmu.c          |    6 +-
 drivers/accel/habanalabs/common/security.c         |    6 +-
 drivers/accel/habanalabs/common/security.h         |    2 +-
 drivers/accel/habanalabs/gaudi/gaudi.c             |   65 +-
 drivers/accel/habanalabs/gaudi2/gaudi2.c           | 1543 ++++--
 drivers/accel/habanalabs/gaudi2/gaudi2P.h          |    9 +-
 drivers/accel/habanalabs/gaudi2/gaudi2_coresight.c |    2 +-
 drivers/accel/habanalabs/gaudi2/gaudi2_masks.h     |    3 +-
 drivers/accel/habanalabs/gaudi2/gaudi2_security.c  |    1 +
 drivers/accel/habanalabs/goya/goya.c               |   21 +-
 drivers/accel/habanalabs/include/common/cpucp_if.h |    9 +-
 .../accel/habanalabs/include/common/hl_boot_if.h   |   47 +-
 .../include/gaudi2/asic_reg/gaudi2_regs.h          |    5 +
 drivers/accel/habanalabs/include/gaudi2/gaudi2.h   |    2 +
 .../include/gaudi2/gaudi2_async_events.h           |    4 +-
 .../include/gaudi2/gaudi2_async_ids_map_extended.h | 5294 ++++++++++----------
 .../accel/habanalabs/include/gaudi2/gaudi2_fw_if.h |    5 +-
 include/drm/drm_file.h                             |    3 +-
 include/uapi/drm/habanalabs_accel.h                |  102 +-
 30 files changed, 4541 insertions(+), 3623 deletions(-)