mbox series

[0/6] Fix crash after reloading a driver using ttm

Message ID 20190416003523.5069-1-kherbst@redhat.com (mailing list archive)
Headers show
Series Fix crash after reloading a driver using ttm | expand

Message

Karol Herbst April 16, 2019, 12:35 a.m. UTC
Kobjects are supposed to be dynamically allocated, but with recent changes
this rule was violated. Reverting those commits fixes crashes when a drm
driver using TTM gets loaded again.

The object in question is "ttm_mem_glob" declared inside
"include/drm/ttm/ttm_memory.h" and instatiated inside
"drivers/gpu/drm/ttm/ttm_memory.c".

from "Documentation/kobject.txt":
"Because kobjects are dynamic, they must not be declared statically or on
the stack, but instead, always allocated dynamically.  Future versions of
the kernel will contain a run-time check for kobjects that are created
statically and will warn the developer of this improper usage."

Unloading ttm before reloading the driver workarounds that crash, because
the memory backing the kobject member "kobj" is cleaned up. The kobject_del
and kobject_put function never free or clean up the kobject object leaving
it in an undefined state.

I reverted a few more commits to make it less painful for me to rever this
rather big change.

dmesg output:
[54758.418036] kobject (00000000687a067d): tried to init an initialized object, something is seriously wrong.
[54758.418040] CPU: 6 PID: 26746 Comm: insmod Tainted: G     U     OE     5.0.6-200.fc29.x86_64 #1
[54758.418041] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018
[54758.418041] Call Trace:
[54758.418049]  dump_stack+0x5c/0x80
[54758.418054]  kobject_init.cold.9+0x31/0x3f
[54758.418057]  kobject_init_and_add+0x35/0xa0
[54758.418063]  ttm_mem_global_init+0x8f/0x2b0 [ttm]
[54758.418067]  ? __debugfs_create_file+0xe1/0x110
[54758.418071]  ttm_bo_device_init+0x198/0x2a0 [ttm]
[54758.418144]  nouveau_ttm_init+0xbf/0x340 [nouveau]
[54758.418206]  nouveau_drm_device_init+0x125/0x7d0 [nouveau]
[54758.418210]  ? pci_bus_read_config_word+0x49/0x70
[54758.418266]  nouveau_drm_probe+0x26f/0x2c0 [nouveau]
[54758.418270]  local_pci_probe+0x41/0x90
[54758.418272]  pci_device_probe+0x118/0x1a0
[54758.418275]  really_probe+0xf8/0x3b0
[54758.418277]  driver_probe_device+0xb3/0xf0
[54758.418278]  __driver_attach+0xdd/0x110
[54758.418280]  ? driver_probe_device+0xf0/0xf0
[54758.418282]  bus_for_each_dev+0x77/0xc0
[54758.418285]  ? klist_add_tail+0x3b/0x60
[54758.418287]  bus_add_driver+0x152/0x230
[54758.418288]  ? 0xffffffffc1027000
[54758.418290]  driver_register+0x6b/0xb0
[54758.418291]  ? 0xffffffffc1027000
[54758.418294]  do_one_initcall+0x46/0x1c3
[54758.418296]  ? _cond_resched+0x15/0x30
[54758.418299]  ? kmem_cache_alloc_trace+0x154/0x1d0
[54758.418302]  do_init_module+0x5a/0x210
[54758.418304]  load_module+0x2096/0x22d0
[54758.418308]  ? ima_post_read_file+0xf4/0x100
[54758.418310]  ? __do_sys_finit_module+0xa8/0x110
[54758.418312]  __do_sys_finit_module+0xa8/0x110
[54758.418315]  do_syscall_64+0x5b/0x160
[54758.418317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[54758.418319] RIP: 0033:0x7fc0b38b6edd
[54758.418321] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48
[54758.418322] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[54758.418323] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd
[54758.418324] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003
[54758.418325] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000
[54758.418326] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
[54758.418326] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260
[54758.418346] BUG: unable to handle kernel paging request at 00000004ee194700
[54758.418348] #PF error: [WRITE]
[54758.418349] PGD 0 P4D 0 
[54758.418352] Oops: 0002 [#1] SMP PTI
[54758.418354] CPU: 6 PID: 26746 Comm: insmod Tainted: G     U     OE     5.0.6-200.fc29.x86_64 #1
[54758.418355] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018
[54758.418360] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm]
[54758.418361] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00
[54758.418363] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246
[54758.418364] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000
[54758.418366] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330
[54758.418367] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228
[54758.418368] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000
[54758.418370] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880
[54758.418371] FS:  00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000
[54758.418373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[54758.418374] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0
[54758.418375] Call Trace:
[54758.418381]  ttm_bo_device_init+0x198/0x2a0 [ttm]
[54758.418439]  nouveau_ttm_init+0xbf/0x340 [nouveau]
[54758.418495]  nouveau_drm_device_init+0x125/0x7d0 [nouveau]
[54758.418499]  ? pci_bus_read_config_word+0x49/0x70
[54758.418552]  nouveau_drm_probe+0x26f/0x2c0 [nouveau]
[54758.418556]  local_pci_probe+0x41/0x90
[54758.418558]  pci_device_probe+0x118/0x1a0
[54758.418561]  really_probe+0xf8/0x3b0
[54758.418563]  driver_probe_device+0xb3/0xf0
[54758.418565]  __driver_attach+0xdd/0x110
[54758.418567]  ? driver_probe_device+0xf0/0xf0
[54758.418570]  bus_for_each_dev+0x77/0xc0
[54758.418573]  ? klist_add_tail+0x3b/0x60
[54758.418574]  bus_add_driver+0x152/0x230
[54758.418576]  ? 0xffffffffc1027000
[54758.418578]  driver_register+0x6b/0xb0
[54758.418580]  ? 0xffffffffc1027000
[54758.418583]  do_one_initcall+0x46/0x1c3
[54758.418585]  ? _cond_resched+0x15/0x30
[54758.418587]  ? kmem_cache_alloc_trace+0x154/0x1d0
[54758.418591]  do_init_module+0x5a/0x210
[54758.418593]  load_module+0x2096/0x22d0
[54758.418596]  ? ima_post_read_file+0xf4/0x100
[54758.418599]  ? __do_sys_finit_module+0xa8/0x110
[54758.418601]  __do_sys_finit_module+0xa8/0x110
[54758.418604]  do_syscall_64+0x5b/0x160
[54758.418607]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[54758.418609] RIP: 0033:0x7fc0b38b6edd
[54758.418610] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48
[54758.418612] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[54758.418614] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd
[54758.418615] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003
[54758.418616] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000
[54758.418618] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
[54758.418619] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260
[54758.418620] Modules linked in: nouveau(OE+) acpi_call(OE) ttm rfcomm ccm xt_CHECKSUM ipt_MASQUERADE tun bridge stp llc devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat btusb btrtl btbcm btintel uvcvideo videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 videobuf2_common videodev media ecdh_generic arc4 snd_hda_codec_hdmi ath10k_pci ath10k_core snd_hda_codec_realtek iTCO_wdt mei_wdt iTCO_vendor_support snd_hda_codec_generic mac80211 intel_rapl snd_hda_intel x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec dell_laptop ledtrig_audio dell_smm_hwmon snd_hda_core kvm_intel snd_hwdep ath snd_seq snd_seq_device intel_cstate cfg80211 snd_pcm dell_wmi intel_uncore
[54758.418653]  dell_smbios dcdbas intel_rapl_perf snd_timer joydev dell_wmi_descriptor intel_wmi_thunderbolt wmi_bmof snd idma64 soundcore i2c_i801 rfkill rtsx_pci_ms memstick mei_me mei processor_thermal_device intel_lpss_pci intel_soc_dts_iosf intel_pch_thermal intel_lpss int3400_thermal acpi_pad acpi_thermal_rel int3403_thermal intel_hid pcc_cpufreq int340x_thermal_zone sparse_keymap binfmt_misc zram dm_crypt hid_multitouch i915 kvmgt mdev vfio kvm irqbypass i2c_algo_bit crct10dif_pclmul drm_kms_helper rtsx_pci_sdmmc crc32_pclmul mmc_core crc32c_intel mxm_wmi drm nvme ghash_clmulni_intel nvme_core serio_raw rtsx_pci i2c_hid video wmi i2c_dev lz4 lz4_compress [last unloaded: nouveau]
[54758.418679] CR2: 00000004ee194700
[54758.418681] ---[ end trace c5175234e6efc034 ]---
[54758.418686] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm]
[54758.418688] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00
[54758.418689] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246
[54758.418690] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000
[54758.418691] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330
[54758.418692] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228
[54758.418694] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000
[54758.418695] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880
[54758.418696] FS:  00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000
[54758.418698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[54758.418699] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0

Karol Herbst (6):
  Revert "drm: Remove drm_global.{c,h} v2"
  Revert "drm/ttm: initialize globals during device init (v2)"
  Revert "drm/ttm: Fix bo_global and mem_global kfree error"
  Revert "drm/ttm: use a static ttm_bo_global instance"
  Revert "drm/ttm: make the device list mutex static"
  Revert "drm/ttm: use a static ttm_mem_global instance"

 drivers/gpu/drm/Makefile                      |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 103 ++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |   2 +
 drivers/gpu/drm/ast/ast_drv.h                 |   2 +
 drivers/gpu/drm/ast/ast_ttm.c                 |  64 ++++++++
 drivers/gpu/drm/bochs/bochs.h                 |   2 +
 drivers/gpu/drm/bochs/bochs_mm.c              |  61 ++++++++
 drivers/gpu/drm/cirrus/cirrus_drv.h           |   2 +
 drivers/gpu/drm/cirrus/cirrus_ttm.c           |  64 ++++++++
 drivers/gpu/drm/drm_drv.c                     |   2 +
 drivers/gpu/drm/drm_global.c                  | 137 ++++++++++++++++++
 .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h   |   2 +
 drivers/gpu/drm/hisilicon/hibmc/hibmc_ttm.c   |  57 ++++++++
 drivers/gpu/drm/mgag200/mgag200_drv.h         |   2 +
 drivers/gpu/drm/mgag200/mgag200_ttm.c         |  64 ++++++++
 drivers/gpu/drm/nouveau/nouveau_drv.h         |   2 +
 drivers/gpu/drm/nouveau/nouveau_ttm.c         |  67 +++++++++
 drivers/gpu/drm/qxl/qxl_drv.h                 |   3 +
 drivers/gpu/drm/qxl/qxl_ttm.c                 |  61 ++++++++
 drivers/gpu/drm/radeon/radeon.h               |   3 +
 drivers/gpu/drm/radeon/radeon_ttm.c           |  65 +++++++++
 drivers/gpu/drm/ttm/ttm_bo.c                  |  67 +++------
 drivers/gpu/drm/ttm/ttm_memory.c              |  14 +-
 drivers/gpu/drm/virtio/virtgpu_drv.h          |   3 +
 drivers/gpu/drm/virtio/virtgpu_ttm.c          |  62 ++++++++
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.c           |  15 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.h           |   6 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c      |  54 +++++++
 drivers/staging/vboxvideo/vbox_drv.h          |   2 +
 drivers/staging/vboxvideo/vbox_ttm.c          |  65 ++++++++-
 include/drm/drmP.h                            |   1 +
 include/drm/drm_global.h                      |  53 +++++++
 include/drm/ttm/ttm_bo_driver.h               |  54 ++++++-
 include/drm/ttm/ttm_memory.h                  |   4 +-
 34 files changed, 1102 insertions(+), 65 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_global.c
 create mode 100644 include/drm/drm_global.h

Comments

Christian König April 16, 2019, 6:38 a.m. UTC | #1
Am 16.04.19 um 02:35 schrieb Karol Herbst:
> Kobjects are supposed to be dynamically allocated, but with recent changes
> this rule was violated. Reverting those commits fixes crashes when a drm
> driver using TTM gets loaded again.
>
> The object in question is "ttm_mem_glob" declared inside
> "include/drm/ttm/ttm_memory.h" and instatiated inside
> "drivers/gpu/drm/ttm/ttm_memory.c".
>
> from "Documentation/kobject.txt":
> "Because kobjects are dynamic, they must not be declared statically or on
> the stack, but instead, always allocated dynamically.  Future versions of
> the kernel will contain a run-time check for kobjects that are created
> statically and will warn the developer of this improper usage."
>
> Unloading ttm before reloading the driver workarounds that crash, because
> the memory backing the kobject member "kobj" is cleaned up. The kobject_del
> and kobject_put function never free or clean up the kobject object leaving
> it in an undefined state.
>
> I reverted a few more commits to make it less painful for me to rever this
> rather big change.

Well, NAK. By reverting those change you also re-introduced the problems 
we originally fixed with those patches.

Please work on a proper fix instead,
Christian.

>
> dmesg output:
> [54758.418036] kobject (00000000687a067d): tried to init an initialized object, something is seriously wrong.
> [54758.418040] CPU: 6 PID: 26746 Comm: insmod Tainted: G     U     OE     5.0.6-200.fc29.x86_64 #1
> [54758.418041] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018
> [54758.418041] Call Trace:
> [54758.418049]  dump_stack+0x5c/0x80
> [54758.418054]  kobject_init.cold.9+0x31/0x3f
> [54758.418057]  kobject_init_and_add+0x35/0xa0
> [54758.418063]  ttm_mem_global_init+0x8f/0x2b0 [ttm]
> [54758.418067]  ? __debugfs_create_file+0xe1/0x110
> [54758.418071]  ttm_bo_device_init+0x198/0x2a0 [ttm]
> [54758.418144]  nouveau_ttm_init+0xbf/0x340 [nouveau]
> [54758.418206]  nouveau_drm_device_init+0x125/0x7d0 [nouveau]
> [54758.418210]  ? pci_bus_read_config_word+0x49/0x70
> [54758.418266]  nouveau_drm_probe+0x26f/0x2c0 [nouveau]
> [54758.418270]  local_pci_probe+0x41/0x90
> [54758.418272]  pci_device_probe+0x118/0x1a0
> [54758.418275]  really_probe+0xf8/0x3b0
> [54758.418277]  driver_probe_device+0xb3/0xf0
> [54758.418278]  __driver_attach+0xdd/0x110
> [54758.418280]  ? driver_probe_device+0xf0/0xf0
> [54758.418282]  bus_for_each_dev+0x77/0xc0
> [54758.418285]  ? klist_add_tail+0x3b/0x60
> [54758.418287]  bus_add_driver+0x152/0x230
> [54758.418288]  ? 0xffffffffc1027000
> [54758.418290]  driver_register+0x6b/0xb0
> [54758.418291]  ? 0xffffffffc1027000
> [54758.418294]  do_one_initcall+0x46/0x1c3
> [54758.418296]  ? _cond_resched+0x15/0x30
> [54758.418299]  ? kmem_cache_alloc_trace+0x154/0x1d0
> [54758.418302]  do_init_module+0x5a/0x210
> [54758.418304]  load_module+0x2096/0x22d0
> [54758.418308]  ? ima_post_read_file+0xf4/0x100
> [54758.418310]  ? __do_sys_finit_module+0xa8/0x110
> [54758.418312]  __do_sys_finit_module+0xa8/0x110
> [54758.418315]  do_syscall_64+0x5b/0x160
> [54758.418317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [54758.418319] RIP: 0033:0x7fc0b38b6edd
> [54758.418321] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48
> [54758.418322] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [54758.418323] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd
> [54758.418324] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003
> [54758.418325] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000
> [54758.418326] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
> [54758.418326] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260
> [54758.418346] BUG: unable to handle kernel paging request at 00000004ee194700
> [54758.418348] #PF error: [WRITE]
> [54758.418349] PGD 0 P4D 0
> [54758.418352] Oops: 0002 [#1] SMP PTI
> [54758.418354] CPU: 6 PID: 26746 Comm: insmod Tainted: G     U     OE     5.0.6-200.fc29.x86_64 #1
> [54758.418355] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018
> [54758.418360] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm]
> [54758.418361] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00
> [54758.418363] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246
> [54758.418364] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000
> [54758.418366] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330
> [54758.418367] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228
> [54758.418368] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000
> [54758.418370] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880
> [54758.418371] FS:  00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000
> [54758.418373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [54758.418374] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0
> [54758.418375] Call Trace:
> [54758.418381]  ttm_bo_device_init+0x198/0x2a0 [ttm]
> [54758.418439]  nouveau_ttm_init+0xbf/0x340 [nouveau]
> [54758.418495]  nouveau_drm_device_init+0x125/0x7d0 [nouveau]
> [54758.418499]  ? pci_bus_read_config_word+0x49/0x70
> [54758.418552]  nouveau_drm_probe+0x26f/0x2c0 [nouveau]
> [54758.418556]  local_pci_probe+0x41/0x90
> [54758.418558]  pci_device_probe+0x118/0x1a0
> [54758.418561]  really_probe+0xf8/0x3b0
> [54758.418563]  driver_probe_device+0xb3/0xf0
> [54758.418565]  __driver_attach+0xdd/0x110
> [54758.418567]  ? driver_probe_device+0xf0/0xf0
> [54758.418570]  bus_for_each_dev+0x77/0xc0
> [54758.418573]  ? klist_add_tail+0x3b/0x60
> [54758.418574]  bus_add_driver+0x152/0x230
> [54758.418576]  ? 0xffffffffc1027000
> [54758.418578]  driver_register+0x6b/0xb0
> [54758.418580]  ? 0xffffffffc1027000
> [54758.418583]  do_one_initcall+0x46/0x1c3
> [54758.418585]  ? _cond_resched+0x15/0x30
> [54758.418587]  ? kmem_cache_alloc_trace+0x154/0x1d0
> [54758.418591]  do_init_module+0x5a/0x210
> [54758.418593]  load_module+0x2096/0x22d0
> [54758.418596]  ? ima_post_read_file+0xf4/0x100
> [54758.418599]  ? __do_sys_finit_module+0xa8/0x110
> [54758.418601]  __do_sys_finit_module+0xa8/0x110
> [54758.418604]  do_syscall_64+0x5b/0x160
> [54758.418607]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [54758.418609] RIP: 0033:0x7fc0b38b6edd
> [54758.418610] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48
> [54758.418612] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [54758.418614] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd
> [54758.418615] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003
> [54758.418616] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000
> [54758.418618] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
> [54758.418619] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260
> [54758.418620] Modules linked in: nouveau(OE+) acpi_call(OE) ttm rfcomm ccm xt_CHECKSUM ipt_MASQUERADE tun bridge stp llc devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat btusb btrtl btbcm btintel uvcvideo videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 videobuf2_common videodev media ecdh_generic arc4 snd_hda_codec_hdmi ath10k_pci ath10k_core snd_hda_codec_realtek iTCO_wdt mei_wdt iTCO_vendor_support snd_hda_codec_generic mac80211 intel_rapl snd_hda_intel x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec dell_laptop ledtrig_audio dell_smm_hwmon snd_hda_core kvm_intel snd_hwdep ath snd_seq snd_seq_device intel_cstate cfg80211 snd_pcm dell_wmi intel_uncore
> [54758.418653]  dell_smbios dcdbas intel_rapl_perf snd_timer joydev dell_wmi_descriptor intel_wmi_thunderbolt wmi_bmof snd idma64 soundcore i2c_i801 rfkill rtsx_pci_ms memstick mei_me mei processor_thermal_device intel_lpss_pci intel_soc_dts_iosf intel_pch_thermal intel_lpss int3400_thermal acpi_pad acpi_thermal_rel int3403_thermal intel_hid pcc_cpufreq int340x_thermal_zone sparse_keymap binfmt_misc zram dm_crypt hid_multitouch i915 kvmgt mdev vfio kvm irqbypass i2c_algo_bit crct10dif_pclmul drm_kms_helper rtsx_pci_sdmmc crc32_pclmul mmc_core crc32c_intel mxm_wmi drm nvme ghash_clmulni_intel nvme_core serio_raw rtsx_pci i2c_hid video wmi i2c_dev lz4 lz4_compress [last unloaded: nouveau]
> [54758.418679] CR2: 00000004ee194700
> [54758.418681] ---[ end trace c5175234e6efc034 ]---
> [54758.418686] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm]
> [54758.418688] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00
> [54758.418689] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246
> [54758.418690] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000
> [54758.418691] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330
> [54758.418692] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228
> [54758.418694] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000
> [54758.418695] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880
> [54758.418696] FS:  00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000
> [54758.418698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [54758.418699] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0
>
> Karol Herbst (6):
>    Revert "drm: Remove drm_global.{c,h} v2"
>    Revert "drm/ttm: initialize globals during device init (v2)"
>    Revert "drm/ttm: Fix bo_global and mem_global kfree error"
>    Revert "drm/ttm: use a static ttm_bo_global instance"
>    Revert "drm/ttm: make the device list mutex static"
>    Revert "drm/ttm: use a static ttm_mem_global instance"
>
>   drivers/gpu/drm/Makefile                      |   2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 103 ++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |   2 +
>   drivers/gpu/drm/ast/ast_drv.h                 |   2 +
>   drivers/gpu/drm/ast/ast_ttm.c                 |  64 ++++++++
>   drivers/gpu/drm/bochs/bochs.h                 |   2 +
>   drivers/gpu/drm/bochs/bochs_mm.c              |  61 ++++++++
>   drivers/gpu/drm/cirrus/cirrus_drv.h           |   2 +
>   drivers/gpu/drm/cirrus/cirrus_ttm.c           |  64 ++++++++
>   drivers/gpu/drm/drm_drv.c                     |   2 +
>   drivers/gpu/drm/drm_global.c                  | 137 ++++++++++++++++++
>   .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h   |   2 +
>   drivers/gpu/drm/hisilicon/hibmc/hibmc_ttm.c   |  57 ++++++++
>   drivers/gpu/drm/mgag200/mgag200_drv.h         |   2 +
>   drivers/gpu/drm/mgag200/mgag200_ttm.c         |  64 ++++++++
>   drivers/gpu/drm/nouveau/nouveau_drv.h         |   2 +
>   drivers/gpu/drm/nouveau/nouveau_ttm.c         |  67 +++++++++
>   drivers/gpu/drm/qxl/qxl_drv.h                 |   3 +
>   drivers/gpu/drm/qxl/qxl_ttm.c                 |  61 ++++++++
>   drivers/gpu/drm/radeon/radeon.h               |   3 +
>   drivers/gpu/drm/radeon/radeon_ttm.c           |  65 +++++++++
>   drivers/gpu/drm/ttm/ttm_bo.c                  |  67 +++------
>   drivers/gpu/drm/ttm/ttm_memory.c              |  14 +-
>   drivers/gpu/drm/virtio/virtgpu_drv.h          |   3 +
>   drivers/gpu/drm/virtio/virtgpu_ttm.c          |  62 ++++++++
>   drivers/gpu/drm/vmwgfx/vmwgfx_drv.c           |  15 +-
>   drivers/gpu/drm/vmwgfx/vmwgfx_drv.h           |   6 +-
>   drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c      |  54 +++++++
>   drivers/staging/vboxvideo/vbox_drv.h          |   2 +
>   drivers/staging/vboxvideo/vbox_ttm.c          |  65 ++++++++-
>   include/drm/drmP.h                            |   1 +
>   include/drm/drm_global.h                      |  53 +++++++
>   include/drm/ttm/ttm_bo_driver.h               |  54 ++++++-
>   include/drm/ttm/ttm_memory.h                  |   4 +-
>   34 files changed, 1102 insertions(+), 65 deletions(-)
>   create mode 100644 drivers/gpu/drm/drm_global.c
>   create mode 100644 include/drm/drm_global.h
>
Karol Herbst April 16, 2019, 9:10 a.m. UTC | #2
On Tue, Apr 16, 2019 at 8:38 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 16.04.19 um 02:35 schrieb Karol Herbst:
> > Kobjects are supposed to be dynamically allocated, but with recent changes
> > this rule was violated. Reverting those commits fixes crashes when a drm
> > driver using TTM gets loaded again.
> >
> > The object in question is "ttm_mem_glob" declared inside
> > "include/drm/ttm/ttm_memory.h" and instatiated inside
> > "drivers/gpu/drm/ttm/ttm_memory.c".
> >
> > from "Documentation/kobject.txt":
> > "Because kobjects are dynamic, they must not be declared statically or on
> > the stack, but instead, always allocated dynamically.  Future versions of
> > the kernel will contain a run-time check for kobjects that are created
> > statically and will warn the developer of this improper usage."
> >
> > Unloading ttm before reloading the driver workarounds that crash, because
> > the memory backing the kobject member "kobj" is cleaned up. The kobject_del
> > and kobject_put function never free or clean up the kobject object leaving
> > it in an undefined state.
> >
> > I reverted a few more commits to make it less painful for me to rever this
> > rather big change.
>
> Well, NAK. By reverting those change you also re-introduced the problems
> we originally fixed with those patches.
>
> Please work on a proper fix instead,
> Christian.

And which problem was that besides duplicated code (or maybe even a
bit of memory consumption if multiple ttm driver were used)? If I had
to choose between duplicated code and a crash, I choose the former.

Maybe I missed the real reason why those changes are made, but the
commit messages don't really seem to tell me.

>
> >
> > dmesg output:
> > [54758.418036] kobject (00000000687a067d): tried to init an initialized object, something is seriously wrong.
> > [54758.418040] CPU: 6 PID: 26746 Comm: insmod Tainted: G     U     OE     5.0.6-200.fc29.x86_64 #1
> > [54758.418041] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018
> > [54758.418041] Call Trace:
> > [54758.418049]  dump_stack+0x5c/0x80
> > [54758.418054]  kobject_init.cold.9+0x31/0x3f
> > [54758.418057]  kobject_init_and_add+0x35/0xa0
> > [54758.418063]  ttm_mem_global_init+0x8f/0x2b0 [ttm]
> > [54758.418067]  ? __debugfs_create_file+0xe1/0x110
> > [54758.418071]  ttm_bo_device_init+0x198/0x2a0 [ttm]
> > [54758.418144]  nouveau_ttm_init+0xbf/0x340 [nouveau]
> > [54758.418206]  nouveau_drm_device_init+0x125/0x7d0 [nouveau]
> > [54758.418210]  ? pci_bus_read_config_word+0x49/0x70
> > [54758.418266]  nouveau_drm_probe+0x26f/0x2c0 [nouveau]
> > [54758.418270]  local_pci_probe+0x41/0x90
> > [54758.418272]  pci_device_probe+0x118/0x1a0
> > [54758.418275]  really_probe+0xf8/0x3b0
> > [54758.418277]  driver_probe_device+0xb3/0xf0
> > [54758.418278]  __driver_attach+0xdd/0x110
> > [54758.418280]  ? driver_probe_device+0xf0/0xf0
> > [54758.418282]  bus_for_each_dev+0x77/0xc0
> > [54758.418285]  ? klist_add_tail+0x3b/0x60
> > [54758.418287]  bus_add_driver+0x152/0x230
> > [54758.418288]  ? 0xffffffffc1027000
> > [54758.418290]  driver_register+0x6b/0xb0
> > [54758.418291]  ? 0xffffffffc1027000
> > [54758.418294]  do_one_initcall+0x46/0x1c3
> > [54758.418296]  ? _cond_resched+0x15/0x30
> > [54758.418299]  ? kmem_cache_alloc_trace+0x154/0x1d0
> > [54758.418302]  do_init_module+0x5a/0x210
> > [54758.418304]  load_module+0x2096/0x22d0
> > [54758.418308]  ? ima_post_read_file+0xf4/0x100
> > [54758.418310]  ? __do_sys_finit_module+0xa8/0x110
> > [54758.418312]  __do_sys_finit_module+0xa8/0x110
> > [54758.418315]  do_syscall_64+0x5b/0x160
> > [54758.418317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [54758.418319] RIP: 0033:0x7fc0b38b6edd
> > [54758.418321] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48
> > [54758.418322] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> > [54758.418323] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd
> > [54758.418324] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003
> > [54758.418325] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000
> > [54758.418326] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
> > [54758.418326] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260
> > [54758.418346] BUG: unable to handle kernel paging request at 00000004ee194700
> > [54758.418348] #PF error: [WRITE]
> > [54758.418349] PGD 0 P4D 0
> > [54758.418352] Oops: 0002 [#1] SMP PTI
> > [54758.418354] CPU: 6 PID: 26746 Comm: insmod Tainted: G     U     OE     5.0.6-200.fc29.x86_64 #1
> > [54758.418355] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018
> > [54758.418360] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm]
> > [54758.418361] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00
> > [54758.418363] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246
> > [54758.418364] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000
> > [54758.418366] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330
> > [54758.418367] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228
> > [54758.418368] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000
> > [54758.418370] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880
> > [54758.418371] FS:  00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000
> > [54758.418373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [54758.418374] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0
> > [54758.418375] Call Trace:
> > [54758.418381]  ttm_bo_device_init+0x198/0x2a0 [ttm]
> > [54758.418439]  nouveau_ttm_init+0xbf/0x340 [nouveau]
> > [54758.418495]  nouveau_drm_device_init+0x125/0x7d0 [nouveau]
> > [54758.418499]  ? pci_bus_read_config_word+0x49/0x70
> > [54758.418552]  nouveau_drm_probe+0x26f/0x2c0 [nouveau]
> > [54758.418556]  local_pci_probe+0x41/0x90
> > [54758.418558]  pci_device_probe+0x118/0x1a0
> > [54758.418561]  really_probe+0xf8/0x3b0
> > [54758.418563]  driver_probe_device+0xb3/0xf0
> > [54758.418565]  __driver_attach+0xdd/0x110
> > [54758.418567]  ? driver_probe_device+0xf0/0xf0
> > [54758.418570]  bus_for_each_dev+0x77/0xc0
> > [54758.418573]  ? klist_add_tail+0x3b/0x60
> > [54758.418574]  bus_add_driver+0x152/0x230
> > [54758.418576]  ? 0xffffffffc1027000
> > [54758.418578]  driver_register+0x6b/0xb0
> > [54758.418580]  ? 0xffffffffc1027000
> > [54758.418583]  do_one_initcall+0x46/0x1c3
> > [54758.418585]  ? _cond_resched+0x15/0x30
> > [54758.418587]  ? kmem_cache_alloc_trace+0x154/0x1d0
> > [54758.418591]  do_init_module+0x5a/0x210
> > [54758.418593]  load_module+0x2096/0x22d0
> > [54758.418596]  ? ima_post_read_file+0xf4/0x100
> > [54758.418599]  ? __do_sys_finit_module+0xa8/0x110
> > [54758.418601]  __do_sys_finit_module+0xa8/0x110
> > [54758.418604]  do_syscall_64+0x5b/0x160
> > [54758.418607]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [54758.418609] RIP: 0033:0x7fc0b38b6edd
> > [54758.418610] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48
> > [54758.418612] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> > [54758.418614] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd
> > [54758.418615] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003
> > [54758.418616] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000
> > [54758.418618] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
> > [54758.418619] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260
> > [54758.418620] Modules linked in: nouveau(OE+) acpi_call(OE) ttm rfcomm ccm xt_CHECKSUM ipt_MASQUERADE tun bridge stp llc devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat btusb btrtl btbcm btintel uvcvideo videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 videobuf2_common videodev media ecdh_generic arc4 snd_hda_codec_hdmi ath10k_pci ath10k_core snd_hda_codec_realtek iTCO_wdt mei_wdt iTCO_vendor_support snd_hda_codec_generic mac80211 intel_rapl snd_hda_intel x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec dell_laptop ledtrig_audio dell_smm_hwmon snd_hda_core kvm_intel snd_hwdep ath snd_seq snd_seq_device intel_cstate cfg80211 snd_pcm dell_wmi intel_uncore
> > [54758.418653]  dell_smbios dcdbas intel_rapl_perf snd_timer joydev dell_wmi_descriptor intel_wmi_thunderbolt wmi_bmof snd idma64 soundcore i2c_i801 rfkill rtsx_pci_ms memstick mei_me mei processor_thermal_device intel_lpss_pci intel_soc_dts_iosf intel_pch_thermal intel_lpss int3400_thermal acpi_pad acpi_thermal_rel int3403_thermal intel_hid pcc_cpufreq int340x_thermal_zone sparse_keymap binfmt_misc zram dm_crypt hid_multitouch i915 kvmgt mdev vfio kvm irqbypass i2c_algo_bit crct10dif_pclmul drm_kms_helper rtsx_pci_sdmmc crc32_pclmul mmc_core crc32c_intel mxm_wmi drm nvme ghash_clmulni_intel nvme_core serio_raw rtsx_pci i2c_hid video wmi i2c_dev lz4 lz4_compress [last unloaded: nouveau]
> > [54758.418679] CR2: 00000004ee194700
> > [54758.418681] ---[ end trace c5175234e6efc034 ]---
> > [54758.418686] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm]
> > [54758.418688] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00
> > [54758.418689] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246
> > [54758.418690] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000
> > [54758.418691] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330
> > [54758.418692] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228
> > [54758.418694] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000
> > [54758.418695] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880
> > [54758.418696] FS:  00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000
> > [54758.418698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [54758.418699] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0
> >
> > Karol Herbst (6):
> >    Revert "drm: Remove drm_global.{c,h} v2"
> >    Revert "drm/ttm: initialize globals during device init (v2)"
> >    Revert "drm/ttm: Fix bo_global and mem_global kfree error"
> >    Revert "drm/ttm: use a static ttm_bo_global instance"
> >    Revert "drm/ttm: make the device list mutex static"
> >    Revert "drm/ttm: use a static ttm_mem_global instance"
> >
> >   drivers/gpu/drm/Makefile                      |   2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 103 ++++++++++++-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |   2 +
> >   drivers/gpu/drm/ast/ast_drv.h                 |   2 +
> >   drivers/gpu/drm/ast/ast_ttm.c                 |  64 ++++++++
> >   drivers/gpu/drm/bochs/bochs.h                 |   2 +
> >   drivers/gpu/drm/bochs/bochs_mm.c              |  61 ++++++++
> >   drivers/gpu/drm/cirrus/cirrus_drv.h           |   2 +
> >   drivers/gpu/drm/cirrus/cirrus_ttm.c           |  64 ++++++++
> >   drivers/gpu/drm/drm_drv.c                     |   2 +
> >   drivers/gpu/drm/drm_global.c                  | 137 ++++++++++++++++++
> >   .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h   |   2 +
> >   drivers/gpu/drm/hisilicon/hibmc/hibmc_ttm.c   |  57 ++++++++
> >   drivers/gpu/drm/mgag200/mgag200_drv.h         |   2 +
> >   drivers/gpu/drm/mgag200/mgag200_ttm.c         |  64 ++++++++
> >   drivers/gpu/drm/nouveau/nouveau_drv.h         |   2 +
> >   drivers/gpu/drm/nouveau/nouveau_ttm.c         |  67 +++++++++
> >   drivers/gpu/drm/qxl/qxl_drv.h                 |   3 +
> >   drivers/gpu/drm/qxl/qxl_ttm.c                 |  61 ++++++++
> >   drivers/gpu/drm/radeon/radeon.h               |   3 +
> >   drivers/gpu/drm/radeon/radeon_ttm.c           |  65 +++++++++
> >   drivers/gpu/drm/ttm/ttm_bo.c                  |  67 +++------
> >   drivers/gpu/drm/ttm/ttm_memory.c              |  14 +-
> >   drivers/gpu/drm/virtio/virtgpu_drv.h          |   3 +
> >   drivers/gpu/drm/virtio/virtgpu_ttm.c          |  62 ++++++++
> >   drivers/gpu/drm/vmwgfx/vmwgfx_drv.c           |  15 +-
> >   drivers/gpu/drm/vmwgfx/vmwgfx_drv.h           |   6 +-
> >   drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c      |  54 +++++++
> >   drivers/staging/vboxvideo/vbox_drv.h          |   2 +
> >   drivers/staging/vboxvideo/vbox_ttm.c          |  65 ++++++++-
> >   include/drm/drmP.h                            |   1 +
> >   include/drm/drm_global.h                      |  53 +++++++
> >   include/drm/ttm/ttm_bo_driver.h               |  54 ++++++-
> >   include/drm/ttm/ttm_memory.h                  |   4 +-
> >   34 files changed, 1102 insertions(+), 65 deletions(-)
> >   create mode 100644 drivers/gpu/drm/drm_global.c
> >   create mode 100644 include/drm/drm_global.h
> >
>
>
Christian König April 16, 2019, 9:12 a.m. UTC | #3
Am 16.04.19 um 11:10 schrieb Karol Herbst:
> On Tue, Apr 16, 2019 at 8:38 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 16.04.19 um 02:35 schrieb Karol Herbst:
>>> Kobjects are supposed to be dynamically allocated, but with recent changes
>>> this rule was violated. Reverting those commits fixes crashes when a drm
>>> driver using TTM gets loaded again.
>>>
>>> The object in question is "ttm_mem_glob" declared inside
>>> "include/drm/ttm/ttm_memory.h" and instatiated inside
>>> "drivers/gpu/drm/ttm/ttm_memory.c".
>>>
>>> from "Documentation/kobject.txt":
>>> "Because kobjects are dynamic, they must not be declared statically or on
>>> the stack, but instead, always allocated dynamically.  Future versions of
>>> the kernel will contain a run-time check for kobjects that are created
>>> statically and will warn the developer of this improper usage."
>>>
>>> Unloading ttm before reloading the driver workarounds that crash, because
>>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del
>>> and kobject_put function never free or clean up the kobject object leaving
>>> it in an undefined state.
>>>
>>> I reverted a few more commits to make it less painful for me to rever this
>>> rather big change.
>> Well, NAK. By reverting those change you also re-introduced the problems
>> we originally fixed with those patches.
>>
>> Please work on a proper fix instead,
>> Christian.
> And which problem was that besides duplicated code (or maybe even a
> bit of memory consumption if multiple ttm driver were used)? If I had
> to choose between duplicated code and a crash, I choose the former.
>
> Maybe I missed the real reason why those changes are made, but the
> commit messages don't really seem to tell me.

The old implementation crashed because different drivers tried to 
allocate the same kobj.

Crashing in one way is not better than crashing in another way.

Christian.

>
>>> dmesg output:
>>> [54758.418036] kobject (00000000687a067d): tried to init an initialized object, something is seriously wrong.
>>> [54758.418040] CPU: 6 PID: 26746 Comm: insmod Tainted: G     U     OE     5.0.6-200.fc29.x86_64 #1
>>> [54758.418041] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018
>>> [54758.418041] Call Trace:
>>> [54758.418049]  dump_stack+0x5c/0x80
>>> [54758.418054]  kobject_init.cold.9+0x31/0x3f
>>> [54758.418057]  kobject_init_and_add+0x35/0xa0
>>> [54758.418063]  ttm_mem_global_init+0x8f/0x2b0 [ttm]
>>> [54758.418067]  ? __debugfs_create_file+0xe1/0x110
>>> [54758.418071]  ttm_bo_device_init+0x198/0x2a0 [ttm]
>>> [54758.418144]  nouveau_ttm_init+0xbf/0x340 [nouveau]
>>> [54758.418206]  nouveau_drm_device_init+0x125/0x7d0 [nouveau]
>>> [54758.418210]  ? pci_bus_read_config_word+0x49/0x70
>>> [54758.418266]  nouveau_drm_probe+0x26f/0x2c0 [nouveau]
>>> [54758.418270]  local_pci_probe+0x41/0x90
>>> [54758.418272]  pci_device_probe+0x118/0x1a0
>>> [54758.418275]  really_probe+0xf8/0x3b0
>>> [54758.418277]  driver_probe_device+0xb3/0xf0
>>> [54758.418278]  __driver_attach+0xdd/0x110
>>> [54758.418280]  ? driver_probe_device+0xf0/0xf0
>>> [54758.418282]  bus_for_each_dev+0x77/0xc0
>>> [54758.418285]  ? klist_add_tail+0x3b/0x60
>>> [54758.418287]  bus_add_driver+0x152/0x230
>>> [54758.418288]  ? 0xffffffffc1027000
>>> [54758.418290]  driver_register+0x6b/0xb0
>>> [54758.418291]  ? 0xffffffffc1027000
>>> [54758.418294]  do_one_initcall+0x46/0x1c3
>>> [54758.418296]  ? _cond_resched+0x15/0x30
>>> [54758.418299]  ? kmem_cache_alloc_trace+0x154/0x1d0
>>> [54758.418302]  do_init_module+0x5a/0x210
>>> [54758.418304]  load_module+0x2096/0x22d0
>>> [54758.418308]  ? ima_post_read_file+0xf4/0x100
>>> [54758.418310]  ? __do_sys_finit_module+0xa8/0x110
>>> [54758.418312]  __do_sys_finit_module+0xa8/0x110
>>> [54758.418315]  do_syscall_64+0x5b/0x160
>>> [54758.418317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [54758.418319] RIP: 0033:0x7fc0b38b6edd
>>> [54758.418321] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48
>>> [54758.418322] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
>>> [54758.418323] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd
>>> [54758.418324] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003
>>> [54758.418325] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000
>>> [54758.418326] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
>>> [54758.418326] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260
>>> [54758.418346] BUG: unable to handle kernel paging request at 00000004ee194700
>>> [54758.418348] #PF error: [WRITE]
>>> [54758.418349] PGD 0 P4D 0
>>> [54758.418352] Oops: 0002 [#1] SMP PTI
>>> [54758.418354] CPU: 6 PID: 26746 Comm: insmod Tainted: G     U     OE     5.0.6-200.fc29.x86_64 #1
>>> [54758.418355] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018
>>> [54758.418360] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm]
>>> [54758.418361] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00
>>> [54758.418363] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246
>>> [54758.418364] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000
>>> [54758.418366] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330
>>> [54758.418367] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228
>>> [54758.418368] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000
>>> [54758.418370] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880
>>> [54758.418371] FS:  00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000
>>> [54758.418373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [54758.418374] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0
>>> [54758.418375] Call Trace:
>>> [54758.418381]  ttm_bo_device_init+0x198/0x2a0 [ttm]
>>> [54758.418439]  nouveau_ttm_init+0xbf/0x340 [nouveau]
>>> [54758.418495]  nouveau_drm_device_init+0x125/0x7d0 [nouveau]
>>> [54758.418499]  ? pci_bus_read_config_word+0x49/0x70
>>> [54758.418552]  nouveau_drm_probe+0x26f/0x2c0 [nouveau]
>>> [54758.418556]  local_pci_probe+0x41/0x90
>>> [54758.418558]  pci_device_probe+0x118/0x1a0
>>> [54758.418561]  really_probe+0xf8/0x3b0
>>> [54758.418563]  driver_probe_device+0xb3/0xf0
>>> [54758.418565]  __driver_attach+0xdd/0x110
>>> [54758.418567]  ? driver_probe_device+0xf0/0xf0
>>> [54758.418570]  bus_for_each_dev+0x77/0xc0
>>> [54758.418573]  ? klist_add_tail+0x3b/0x60
>>> [54758.418574]  bus_add_driver+0x152/0x230
>>> [54758.418576]  ? 0xffffffffc1027000
>>> [54758.418578]  driver_register+0x6b/0xb0
>>> [54758.418580]  ? 0xffffffffc1027000
>>> [54758.418583]  do_one_initcall+0x46/0x1c3
>>> [54758.418585]  ? _cond_resched+0x15/0x30
>>> [54758.418587]  ? kmem_cache_alloc_trace+0x154/0x1d0
>>> [54758.418591]  do_init_module+0x5a/0x210
>>> [54758.418593]  load_module+0x2096/0x22d0
>>> [54758.418596]  ? ima_post_read_file+0xf4/0x100
>>> [54758.418599]  ? __do_sys_finit_module+0xa8/0x110
>>> [54758.418601]  __do_sys_finit_module+0xa8/0x110
>>> [54758.418604]  do_syscall_64+0x5b/0x160
>>> [54758.418607]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [54758.418609] RIP: 0033:0x7fc0b38b6edd
>>> [54758.418610] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48
>>> [54758.418612] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
>>> [54758.418614] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd
>>> [54758.418615] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003
>>> [54758.418616] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000
>>> [54758.418618] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
>>> [54758.418619] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260
>>> [54758.418620] Modules linked in: nouveau(OE+) acpi_call(OE) ttm rfcomm ccm xt_CHECKSUM ipt_MASQUERADE tun bridge stp llc devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat btusb btrtl btbcm btintel uvcvideo videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 videobuf2_common videodev media ecdh_generic arc4 snd_hda_codec_hdmi ath10k_pci ath10k_core snd_hda_codec_realtek iTCO_wdt mei_wdt iTCO_vendor_support snd_hda_codec_generic mac80211 intel_rapl snd_hda_intel x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec dell_laptop ledtrig_audio dell_smm_hwmon snd_hda_core kvm_intel snd_hwdep ath snd_seq snd_seq_device intel_cstate cfg80211 snd_pcm dell_wmi intel_uncore
>>> [54758.418653]  dell_smbios dcdbas intel_rapl_perf snd_timer joydev dell_wmi_descriptor intel_wmi_thunderbolt wmi_bmof snd idma64 soundcore i2c_i801 rfkill rtsx_pci_ms memstick mei_me mei processor_thermal_device intel_lpss_pci intel_soc_dts_iosf intel_pch_thermal intel_lpss int3400_thermal acpi_pad acpi_thermal_rel int3403_thermal intel_hid pcc_cpufreq int340x_thermal_zone sparse_keymap binfmt_misc zram dm_crypt hid_multitouch i915 kvmgt mdev vfio kvm irqbypass i2c_algo_bit crct10dif_pclmul drm_kms_helper rtsx_pci_sdmmc crc32_pclmul mmc_core crc32c_intel mxm_wmi drm nvme ghash_clmulni_intel nvme_core serio_raw rtsx_pci i2c_hid video wmi i2c_dev lz4 lz4_compress [last unloaded: nouveau]
>>> [54758.418679] CR2: 00000004ee194700
>>> [54758.418681] ---[ end trace c5175234e6efc034 ]---
>>> [54758.418686] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm]
>>> [54758.418688] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00
>>> [54758.418689] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246
>>> [54758.418690] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000
>>> [54758.418691] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330
>>> [54758.418692] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228
>>> [54758.418694] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000
>>> [54758.418695] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880
>>> [54758.418696] FS:  00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000
>>> [54758.418698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [54758.418699] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0
>>>
>>> Karol Herbst (6):
>>>     Revert "drm: Remove drm_global.{c,h} v2"
>>>     Revert "drm/ttm: initialize globals during device init (v2)"
>>>     Revert "drm/ttm: Fix bo_global and mem_global kfree error"
>>>     Revert "drm/ttm: use a static ttm_bo_global instance"
>>>     Revert "drm/ttm: make the device list mutex static"
>>>     Revert "drm/ttm: use a static ttm_mem_global instance"
>>>
>>>    drivers/gpu/drm/Makefile                      |   2 +-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 103 ++++++++++++-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |   2 +
>>>    drivers/gpu/drm/ast/ast_drv.h                 |   2 +
>>>    drivers/gpu/drm/ast/ast_ttm.c                 |  64 ++++++++
>>>    drivers/gpu/drm/bochs/bochs.h                 |   2 +
>>>    drivers/gpu/drm/bochs/bochs_mm.c              |  61 ++++++++
>>>    drivers/gpu/drm/cirrus/cirrus_drv.h           |   2 +
>>>    drivers/gpu/drm/cirrus/cirrus_ttm.c           |  64 ++++++++
>>>    drivers/gpu/drm/drm_drv.c                     |   2 +
>>>    drivers/gpu/drm/drm_global.c                  | 137 ++++++++++++++++++
>>>    .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h   |   2 +
>>>    drivers/gpu/drm/hisilicon/hibmc/hibmc_ttm.c   |  57 ++++++++
>>>    drivers/gpu/drm/mgag200/mgag200_drv.h         |   2 +
>>>    drivers/gpu/drm/mgag200/mgag200_ttm.c         |  64 ++++++++
>>>    drivers/gpu/drm/nouveau/nouveau_drv.h         |   2 +
>>>    drivers/gpu/drm/nouveau/nouveau_ttm.c         |  67 +++++++++
>>>    drivers/gpu/drm/qxl/qxl_drv.h                 |   3 +
>>>    drivers/gpu/drm/qxl/qxl_ttm.c                 |  61 ++++++++
>>>    drivers/gpu/drm/radeon/radeon.h               |   3 +
>>>    drivers/gpu/drm/radeon/radeon_ttm.c           |  65 +++++++++
>>>    drivers/gpu/drm/ttm/ttm_bo.c                  |  67 +++------
>>>    drivers/gpu/drm/ttm/ttm_memory.c              |  14 +-
>>>    drivers/gpu/drm/virtio/virtgpu_drv.h          |   3 +
>>>    drivers/gpu/drm/virtio/virtgpu_ttm.c          |  62 ++++++++
>>>    drivers/gpu/drm/vmwgfx/vmwgfx_drv.c           |  15 +-
>>>    drivers/gpu/drm/vmwgfx/vmwgfx_drv.h           |   6 +-
>>>    drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c      |  54 +++++++
>>>    drivers/staging/vboxvideo/vbox_drv.h          |   2 +
>>>    drivers/staging/vboxvideo/vbox_ttm.c          |  65 ++++++++-
>>>    include/drm/drmP.h                            |   1 +
>>>    include/drm/drm_global.h                      |  53 +++++++
>>>    include/drm/ttm/ttm_bo_driver.h               |  54 ++++++-
>>>    include/drm/ttm/ttm_memory.h                  |   4 +-
>>>    34 files changed, 1102 insertions(+), 65 deletions(-)
>>>    create mode 100644 drivers/gpu/drm/drm_global.c
>>>    create mode 100644 include/drm/drm_global.h
>>>
>>
Karol Herbst April 16, 2019, 10:54 a.m. UTC | #4
On Tue, Apr 16, 2019 at 11:12 AM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 16.04.19 um 11:10 schrieb Karol Herbst:
> > On Tue, Apr 16, 2019 at 8:38 AM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> Am 16.04.19 um 02:35 schrieb Karol Herbst:
> >>> Kobjects are supposed to be dynamically allocated, but with recent changes
> >>> this rule was violated. Reverting those commits fixes crashes when a drm
> >>> driver using TTM gets loaded again.
> >>>
> >>> The object in question is "ttm_mem_glob" declared inside
> >>> "include/drm/ttm/ttm_memory.h" and instatiated inside
> >>> "drivers/gpu/drm/ttm/ttm_memory.c".
> >>>
> >>> from "Documentation/kobject.txt":
> >>> "Because kobjects are dynamic, they must not be declared statically or on
> >>> the stack, but instead, always allocated dynamically.  Future versions of
> >>> the kernel will contain a run-time check for kobjects that are created
> >>> statically and will warn the developer of this improper usage."
> >>>
> >>> Unloading ttm before reloading the driver workarounds that crash, because
> >>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del
> >>> and kobject_put function never free or clean up the kobject object leaving
> >>> it in an undefined state.
> >>>
> >>> I reverted a few more commits to make it less painful for me to rever this
> >>> rather big change.
> >> Well, NAK. By reverting those change you also re-introduced the problems
> >> we originally fixed with those patches.
> >>
> >> Please work on a proper fix instead,
> >> Christian.
> > And which problem was that besides duplicated code (or maybe even a
> > bit of memory consumption if multiple ttm driver were used)? If I had
> > to choose between duplicated code and a crash, I choose the former.
> >
> > Maybe I missed the real reason why those changes are made, but the
> > commit messages don't really seem to tell me.
>
> The old implementation crashed because different drivers tried to
> allocate the same kobj.
>
> Crashing in one way is not better than crashing in another way.
>
> Christian.
>

how can that old crash be triggered? By loading two ttm based drivers
at the same time or by other means?

> >
> >>> dmesg output:
> >>> [54758.418036] kobject (00000000687a067d): tried to init an initialized object, something is seriously wrong.
> >>> [54758.418040] CPU: 6 PID: 26746 Comm: insmod Tainted: G     U     OE     5.0.6-200.fc29.x86_64 #1
> >>> [54758.418041] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018
> >>> [54758.418041] Call Trace:
> >>> [54758.418049]  dump_stack+0x5c/0x80
> >>> [54758.418054]  kobject_init.cold.9+0x31/0x3f
> >>> [54758.418057]  kobject_init_and_add+0x35/0xa0
> >>> [54758.418063]  ttm_mem_global_init+0x8f/0x2b0 [ttm]
> >>> [54758.418067]  ? __debugfs_create_file+0xe1/0x110
> >>> [54758.418071]  ttm_bo_device_init+0x198/0x2a0 [ttm]
> >>> [54758.418144]  nouveau_ttm_init+0xbf/0x340 [nouveau]
> >>> [54758.418206]  nouveau_drm_device_init+0x125/0x7d0 [nouveau]
> >>> [54758.418210]  ? pci_bus_read_config_word+0x49/0x70
> >>> [54758.418266]  nouveau_drm_probe+0x26f/0x2c0 [nouveau]
> >>> [54758.418270]  local_pci_probe+0x41/0x90
> >>> [54758.418272]  pci_device_probe+0x118/0x1a0
> >>> [54758.418275]  really_probe+0xf8/0x3b0
> >>> [54758.418277]  driver_probe_device+0xb3/0xf0
> >>> [54758.418278]  __driver_attach+0xdd/0x110
> >>> [54758.418280]  ? driver_probe_device+0xf0/0xf0
> >>> [54758.418282]  bus_for_each_dev+0x77/0xc0
> >>> [54758.418285]  ? klist_add_tail+0x3b/0x60
> >>> [54758.418287]  bus_add_driver+0x152/0x230
> >>> [54758.418288]  ? 0xffffffffc1027000
> >>> [54758.418290]  driver_register+0x6b/0xb0
> >>> [54758.418291]  ? 0xffffffffc1027000
> >>> [54758.418294]  do_one_initcall+0x46/0x1c3
> >>> [54758.418296]  ? _cond_resched+0x15/0x30
> >>> [54758.418299]  ? kmem_cache_alloc_trace+0x154/0x1d0
> >>> [54758.418302]  do_init_module+0x5a/0x210
> >>> [54758.418304]  load_module+0x2096/0x22d0
> >>> [54758.418308]  ? ima_post_read_file+0xf4/0x100
> >>> [54758.418310]  ? __do_sys_finit_module+0xa8/0x110
> >>> [54758.418312]  __do_sys_finit_module+0xa8/0x110
> >>> [54758.418315]  do_syscall_64+0x5b/0x160
> >>> [54758.418317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>> [54758.418319] RIP: 0033:0x7fc0b38b6edd
> >>> [54758.418321] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48
> >>> [54758.418322] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> >>> [54758.418323] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd
> >>> [54758.418324] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003
> >>> [54758.418325] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000
> >>> [54758.418326] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
> >>> [54758.418326] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260
> >>> [54758.418346] BUG: unable to handle kernel paging request at 00000004ee194700
> >>> [54758.418348] #PF error: [WRITE]
> >>> [54758.418349] PGD 0 P4D 0
> >>> [54758.418352] Oops: 0002 [#1] SMP PTI
> >>> [54758.418354] CPU: 6 PID: 26746 Comm: insmod Tainted: G     U     OE     5.0.6-200.fc29.x86_64 #1
> >>> [54758.418355] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018
> >>> [54758.418360] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm]
> >>> [54758.418361] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00
> >>> [54758.418363] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246
> >>> [54758.418364] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000
> >>> [54758.418366] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330
> >>> [54758.418367] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228
> >>> [54758.418368] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000
> >>> [54758.418370] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880
> >>> [54758.418371] FS:  00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000
> >>> [54758.418373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [54758.418374] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0
> >>> [54758.418375] Call Trace:
> >>> [54758.418381]  ttm_bo_device_init+0x198/0x2a0 [ttm]
> >>> [54758.418439]  nouveau_ttm_init+0xbf/0x340 [nouveau]
> >>> [54758.418495]  nouveau_drm_device_init+0x125/0x7d0 [nouveau]
> >>> [54758.418499]  ? pci_bus_read_config_word+0x49/0x70
> >>> [54758.418552]  nouveau_drm_probe+0x26f/0x2c0 [nouveau]
> >>> [54758.418556]  local_pci_probe+0x41/0x90
> >>> [54758.418558]  pci_device_probe+0x118/0x1a0
> >>> [54758.418561]  really_probe+0xf8/0x3b0
> >>> [54758.418563]  driver_probe_device+0xb3/0xf0
> >>> [54758.418565]  __driver_attach+0xdd/0x110
> >>> [54758.418567]  ? driver_probe_device+0xf0/0xf0
> >>> [54758.418570]  bus_for_each_dev+0x77/0xc0
> >>> [54758.418573]  ? klist_add_tail+0x3b/0x60
> >>> [54758.418574]  bus_add_driver+0x152/0x230
> >>> [54758.418576]  ? 0xffffffffc1027000
> >>> [54758.418578]  driver_register+0x6b/0xb0
> >>> [54758.418580]  ? 0xffffffffc1027000
> >>> [54758.418583]  do_one_initcall+0x46/0x1c3
> >>> [54758.418585]  ? _cond_resched+0x15/0x30
> >>> [54758.418587]  ? kmem_cache_alloc_trace+0x154/0x1d0
> >>> [54758.418591]  do_init_module+0x5a/0x210
> >>> [54758.418593]  load_module+0x2096/0x22d0
> >>> [54758.418596]  ? ima_post_read_file+0xf4/0x100
> >>> [54758.418599]  ? __do_sys_finit_module+0xa8/0x110
> >>> [54758.418601]  __do_sys_finit_module+0xa8/0x110
> >>> [54758.418604]  do_syscall_64+0x5b/0x160
> >>> [54758.418607]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>> [54758.418609] RIP: 0033:0x7fc0b38b6edd
> >>> [54758.418610] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48
> >>> [54758.418612] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> >>> [54758.418614] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd
> >>> [54758.418615] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003
> >>> [54758.418616] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000
> >>> [54758.418618] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
> >>> [54758.418619] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260
> >>> [54758.418620] Modules linked in: nouveau(OE+) acpi_call(OE) ttm rfcomm ccm xt_CHECKSUM ipt_MASQUERADE tun bridge stp llc devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat btusb btrtl btbcm btintel uvcvideo videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 videobuf2_common videodev media ecdh_generic arc4 snd_hda_codec_hdmi ath10k_pci ath10k_core snd_hda_codec_realtek iTCO_wdt mei_wdt iTCO_vendor_support snd_hda_codec_generic mac80211 intel_rapl snd_hda_intel x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec dell_laptop ledtrig_audio dell_smm_hwmon snd_hda_core kvm_intel snd_hwdep ath snd_seq snd_seq_device intel_cstate cfg80211 snd_pcm dell_wmi intel_uncore
> >>> [54758.418653]  dell_smbios dcdbas intel_rapl_perf snd_timer joydev dell_wmi_descriptor intel_wmi_thunderbolt wmi_bmof snd idma64 soundcore i2c_i801 rfkill rtsx_pci_ms memstick mei_me mei processor_thermal_device intel_lpss_pci intel_soc_dts_iosf intel_pch_thermal intel_lpss int3400_thermal acpi_pad acpi_thermal_rel int3403_thermal intel_hid pcc_cpufreq int340x_thermal_zone sparse_keymap binfmt_misc zram dm_crypt hid_multitouch i915 kvmgt mdev vfio kvm irqbypass i2c_algo_bit crct10dif_pclmul drm_kms_helper rtsx_pci_sdmmc crc32_pclmul mmc_core crc32c_intel mxm_wmi drm nvme ghash_clmulni_intel nvme_core serio_raw rtsx_pci i2c_hid video wmi i2c_dev lz4 lz4_compress [last unloaded: nouveau]
> >>> [54758.418679] CR2: 00000004ee194700
> >>> [54758.418681] ---[ end trace c5175234e6efc034 ]---
> >>> [54758.418686] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm]
> >>> [54758.418688] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00
> >>> [54758.418689] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246
> >>> [54758.418690] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000
> >>> [54758.418691] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330
> >>> [54758.418692] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228
> >>> [54758.418694] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000
> >>> [54758.418695] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880
> >>> [54758.418696] FS:  00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000
> >>> [54758.418698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [54758.418699] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0
> >>>
> >>> Karol Herbst (6):
> >>>     Revert "drm: Remove drm_global.{c,h} v2"
> >>>     Revert "drm/ttm: initialize globals during device init (v2)"
> >>>     Revert "drm/ttm: Fix bo_global and mem_global kfree error"
> >>>     Revert "drm/ttm: use a static ttm_bo_global instance"
> >>>     Revert "drm/ttm: make the device list mutex static"
> >>>     Revert "drm/ttm: use a static ttm_mem_global instance"
> >>>
> >>>    drivers/gpu/drm/Makefile                      |   2 +-
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 103 ++++++++++++-
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |   2 +
> >>>    drivers/gpu/drm/ast/ast_drv.h                 |   2 +
> >>>    drivers/gpu/drm/ast/ast_ttm.c                 |  64 ++++++++
> >>>    drivers/gpu/drm/bochs/bochs.h                 |   2 +
> >>>    drivers/gpu/drm/bochs/bochs_mm.c              |  61 ++++++++
> >>>    drivers/gpu/drm/cirrus/cirrus_drv.h           |   2 +
> >>>    drivers/gpu/drm/cirrus/cirrus_ttm.c           |  64 ++++++++
> >>>    drivers/gpu/drm/drm_drv.c                     |   2 +
> >>>    drivers/gpu/drm/drm_global.c                  | 137 ++++++++++++++++++
> >>>    .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h   |   2 +
> >>>    drivers/gpu/drm/hisilicon/hibmc/hibmc_ttm.c   |  57 ++++++++
> >>>    drivers/gpu/drm/mgag200/mgag200_drv.h         |   2 +
> >>>    drivers/gpu/drm/mgag200/mgag200_ttm.c         |  64 ++++++++
> >>>    drivers/gpu/drm/nouveau/nouveau_drv.h         |   2 +
> >>>    drivers/gpu/drm/nouveau/nouveau_ttm.c         |  67 +++++++++
> >>>    drivers/gpu/drm/qxl/qxl_drv.h                 |   3 +
> >>>    drivers/gpu/drm/qxl/qxl_ttm.c                 |  61 ++++++++
> >>>    drivers/gpu/drm/radeon/radeon.h               |   3 +
> >>>    drivers/gpu/drm/radeon/radeon_ttm.c           |  65 +++++++++
> >>>    drivers/gpu/drm/ttm/ttm_bo.c                  |  67 +++------
> >>>    drivers/gpu/drm/ttm/ttm_memory.c              |  14 +-
> >>>    drivers/gpu/drm/virtio/virtgpu_drv.h          |   3 +
> >>>    drivers/gpu/drm/virtio/virtgpu_ttm.c          |  62 ++++++++
> >>>    drivers/gpu/drm/vmwgfx/vmwgfx_drv.c           |  15 +-
> >>>    drivers/gpu/drm/vmwgfx/vmwgfx_drv.h           |   6 +-
> >>>    drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c      |  54 +++++++
> >>>    drivers/staging/vboxvideo/vbox_drv.h          |   2 +
> >>>    drivers/staging/vboxvideo/vbox_ttm.c          |  65 ++++++++-
> >>>    include/drm/drmP.h                            |   1 +
> >>>    include/drm/drm_global.h                      |  53 +++++++
> >>>    include/drm/ttm/ttm_bo_driver.h               |  54 ++++++-
> >>>    include/drm/ttm/ttm_memory.h                  |   4 +-
> >>>    34 files changed, 1102 insertions(+), 65 deletions(-)
> >>>    create mode 100644 drivers/gpu/drm/drm_global.c
> >>>    create mode 100644 include/drm/drm_global.h
> >>>
> >>
>
>
Christian König April 16, 2019, 11:06 a.m. UTC | #5
Am 16.04.19 um 12:54 schrieb Karol Herbst:
> On Tue, Apr 16, 2019 at 11:12 AM Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
>> Am 16.04.19 um 11:10 schrieb Karol Herbst:
>>> On Tue, Apr 16, 2019 at 8:38 AM Christian König
>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>> Am 16.04.19 um 02:35 schrieb Karol Herbst:
>>>>> Kobjects are supposed to be dynamically allocated, but with recent changes
>>>>> this rule was violated. Reverting those commits fixes crashes when a drm
>>>>> driver using TTM gets loaded again.
>>>>>
>>>>> The object in question is "ttm_mem_glob" declared inside
>>>>> "include/drm/ttm/ttm_memory.h" and instatiated inside
>>>>> "drivers/gpu/drm/ttm/ttm_memory.c".
>>>>>
>>>>> from "Documentation/kobject.txt":
>>>>> "Because kobjects are dynamic, they must not be declared statically or on
>>>>> the stack, but instead, always allocated dynamically.  Future versions of
>>>>> the kernel will contain a run-time check for kobjects that are created
>>>>> statically and will warn the developer of this improper usage."
>>>>>
>>>>> Unloading ttm before reloading the driver workarounds that crash, because
>>>>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del
>>>>> and kobject_put function never free or clean up the kobject object leaving
>>>>> it in an undefined state.
>>>>>
>>>>> I reverted a few more commits to make it less painful for me to rever this
>>>>> rather big change.
>>>> Well, NAK. By reverting those change you also re-introduced the problems
>>>> we originally fixed with those patches.
>>>>
>>>> Please work on a proper fix instead,
>>>> Christian.
>>> And which problem was that besides duplicated code (or maybe even a
>>> bit of memory consumption if multiple ttm driver were used)? If I had
>>> to choose between duplicated code and a crash, I choose the former.
>>>
>>> Maybe I missed the real reason why those changes are made, but the
>>> commit messages don't really seem to tell me.
>> The old implementation crashed because different drivers tried to
>> allocate the same kobj.
>>
>> Crashing in one way is not better than crashing in another way.
>>
>> Christian.
>>
> how can that old crash be triggered? By loading two ttm based drivers
> at the same time or by other means?

Yes, exactly. Even worse it could be triggered by one driver 
instantiating multiple times at the same time, e.g two AMD GPUs in the 
same system.

On the other hand I completely agree that using kobj static is 
completely nuts. The problem is that using a kobj was the wrong approach 
in the first place.

In other words when you have something which controls a global behavior 
of a module, what do you use? Right, a module parameter.

Point is that we can't get away from those kobj without breaking the 
uapi, so that is something which can't be done easily.

Regards,
Christian.
Karol Herbst April 16, 2019, 11:20 a.m. UTC | #6
On Tue, Apr 16, 2019 at 1:07 PM Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 16.04.19 um 12:54 schrieb Karol Herbst:
> > On Tue, Apr 16, 2019 at 11:12 AM Koenig, Christian
> > <Christian.Koenig@amd.com> wrote:
> >> Am 16.04.19 um 11:10 schrieb Karol Herbst:
> >>> On Tue, Apr 16, 2019 at 8:38 AM Christian König
> >>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>> Am 16.04.19 um 02:35 schrieb Karol Herbst:
> >>>>> Kobjects are supposed to be dynamically allocated, but with recent changes
> >>>>> this rule was violated. Reverting those commits fixes crashes when a drm
> >>>>> driver using TTM gets loaded again.
> >>>>>
> >>>>> The object in question is "ttm_mem_glob" declared inside
> >>>>> "include/drm/ttm/ttm_memory.h" and instatiated inside
> >>>>> "drivers/gpu/drm/ttm/ttm_memory.c".
> >>>>>
> >>>>> from "Documentation/kobject.txt":
> >>>>> "Because kobjects are dynamic, they must not be declared statically or on
> >>>>> the stack, but instead, always allocated dynamically.  Future versions of
> >>>>> the kernel will contain a run-time check for kobjects that are created
> >>>>> statically and will warn the developer of this improper usage."
> >>>>>
> >>>>> Unloading ttm before reloading the driver workarounds that crash, because
> >>>>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del
> >>>>> and kobject_put function never free or clean up the kobject object leaving
> >>>>> it in an undefined state.
> >>>>>
> >>>>> I reverted a few more commits to make it less painful for me to rever this
> >>>>> rather big change.
> >>>> Well, NAK. By reverting those change you also re-introduced the problems
> >>>> we originally fixed with those patches.
> >>>>
> >>>> Please work on a proper fix instead,
> >>>> Christian.
> >>> And which problem was that besides duplicated code (or maybe even a
> >>> bit of memory consumption if multiple ttm driver were used)? If I had
> >>> to choose between duplicated code and a crash, I choose the former.
> >>>
> >>> Maybe I missed the real reason why those changes are made, but the
> >>> commit messages don't really seem to tell me.
> >> The old implementation crashed because different drivers tried to
> >> allocate the same kobj.
> >>
> >> Crashing in one way is not better than crashing in another way.
> >>
> >> Christian.
> >>
> > how can that old crash be triggered? By loading two ttm based drivers
> > at the same time or by other means?
>
> Yes, exactly. Even worse it could be triggered by one driver
> instantiating multiple times at the same time, e.g two AMD GPUs in the
> same system.
>
> On the other hand I completely agree that using kobj static is
> completely nuts. The problem is that using a kobj was the wrong approach
> in the first place.
>
> In other words when you have something which controls a global behavior
> of a module, what do you use? Right, a module parameter.
>
> Point is that we can't get away from those kobj without breaking the
> uapi, so that is something which can't be done easily.
>
> Regards,
> Christian.

okay, thanks for the explanation!

Currently I will be testing your first patch you sent out on top of
5.0.7, so that we could propose that for stable. The second didn't
apply, but it didn't look relevant for the actual fix. Will reply to
the patch with my results then.

Thanks
Christian König April 16, 2019, 11:25 a.m. UTC | #7
Am 16.04.19 um 13:20 schrieb Karol Herbst:
> On Tue, Apr 16, 2019 at 1:07 PM Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
>> Am 16.04.19 um 12:54 schrieb Karol Herbst:
>>> On Tue, Apr 16, 2019 at 11:12 AM Koenig, Christian
>>> <Christian.Koenig@amd.com> wrote:
>>>> Am 16.04.19 um 11:10 schrieb Karol Herbst:
>>>>> On Tue, Apr 16, 2019 at 8:38 AM Christian König
>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>>> Am 16.04.19 um 02:35 schrieb Karol Herbst:
>>>>>>> Kobjects are supposed to be dynamically allocated, but with recent changes
>>>>>>> this rule was violated. Reverting those commits fixes crashes when a drm
>>>>>>> driver using TTM gets loaded again.
>>>>>>>
>>>>>>> The object in question is "ttm_mem_glob" declared inside
>>>>>>> "include/drm/ttm/ttm_memory.h" and instatiated inside
>>>>>>> "drivers/gpu/drm/ttm/ttm_memory.c".
>>>>>>>
>>>>>>> from "Documentation/kobject.txt":
>>>>>>> "Because kobjects are dynamic, they must not be declared statically or on
>>>>>>> the stack, but instead, always allocated dynamically.  Future versions of
>>>>>>> the kernel will contain a run-time check for kobjects that are created
>>>>>>> statically and will warn the developer of this improper usage."
>>>>>>>
>>>>>>> Unloading ttm before reloading the driver workarounds that crash, because
>>>>>>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del
>>>>>>> and kobject_put function never free or clean up the kobject object leaving
>>>>>>> it in an undefined state.
>>>>>>>
>>>>>>> I reverted a few more commits to make it less painful for me to rever this
>>>>>>> rather big change.
>>>>>> Well, NAK. By reverting those change you also re-introduced the problems
>>>>>> we originally fixed with those patches.
>>>>>>
>>>>>> Please work on a proper fix instead,
>>>>>> Christian.
>>>>> And which problem was that besides duplicated code (or maybe even a
>>>>> bit of memory consumption if multiple ttm driver were used)? If I had
>>>>> to choose between duplicated code and a crash, I choose the former.
>>>>>
>>>>> Maybe I missed the real reason why those changes are made, but the
>>>>> commit messages don't really seem to tell me.
>>>> The old implementation crashed because different drivers tried to
>>>> allocate the same kobj.
>>>>
>>>> Crashing in one way is not better than crashing in another way.
>>>>
>>>> Christian.
>>>>
>>> how can that old crash be triggered? By loading two ttm based drivers
>>> at the same time or by other means?
>> Yes, exactly. Even worse it could be triggered by one driver
>> instantiating multiple times at the same time, e.g two AMD GPUs in the
>> same system.
>>
>> On the other hand I completely agree that using kobj static is
>> completely nuts. The problem is that using a kobj was the wrong approach
>> in the first place.
>>
>> In other words when you have something which controls a global behavior
>> of a module, what do you use? Right, a module parameter.
>>
>> Point is that we can't get away from those kobj without breaking the
>> uapi, so that is something which can't be done easily.
>>
>> Regards,
>> Christian.
> okay, thanks for the explanation!
>
> Currently I will be testing your first patch you sent out on top of
> 5.0.7, so that we could propose that for stable. The second didn't
> apply, but it didn't look relevant for the actual fix. Will reply to
> the patch with my results then.

The second is just a further clean, please ignore that one for backporting.

Let me know if the first one works on 5.0.7, if yes I'm going to add a 
CC stable tag before pushing.

Thanks,
Christian.

>
> Thanks
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Daniel Vetter April 16, 2019, 12:18 p.m. UTC | #8
On Tue, Apr 16, 2019 at 11:06:54AM +0000, Koenig, Christian wrote:
> Am 16.04.19 um 12:54 schrieb Karol Herbst:
> > On Tue, Apr 16, 2019 at 11:12 AM Koenig, Christian
> > <Christian.Koenig@amd.com> wrote:
> >> Am 16.04.19 um 11:10 schrieb Karol Herbst:
> >>> On Tue, Apr 16, 2019 at 8:38 AM Christian König
> >>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>> Am 16.04.19 um 02:35 schrieb Karol Herbst:
> >>>>> Kobjects are supposed to be dynamically allocated, but with recent changes
> >>>>> this rule was violated. Reverting those commits fixes crashes when a drm
> >>>>> driver using TTM gets loaded again.
> >>>>>
> >>>>> The object in question is "ttm_mem_glob" declared inside
> >>>>> "include/drm/ttm/ttm_memory.h" and instatiated inside
> >>>>> "drivers/gpu/drm/ttm/ttm_memory.c".
> >>>>>
> >>>>> from "Documentation/kobject.txt":
> >>>>> "Because kobjects are dynamic, they must not be declared statically or on
> >>>>> the stack, but instead, always allocated dynamically.  Future versions of
> >>>>> the kernel will contain a run-time check for kobjects that are created
> >>>>> statically and will warn the developer of this improper usage."
> >>>>>
> >>>>> Unloading ttm before reloading the driver workarounds that crash, because
> >>>>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del
> >>>>> and kobject_put function never free or clean up the kobject object leaving
> >>>>> it in an undefined state.
> >>>>>
> >>>>> I reverted a few more commits to make it less painful for me to rever this
> >>>>> rather big change.
> >>>> Well, NAK. By reverting those change you also re-introduced the problems
> >>>> we originally fixed with those patches.
> >>>>
> >>>> Please work on a proper fix instead,
> >>>> Christian.
> >>> And which problem was that besides duplicated code (or maybe even a
> >>> bit of memory consumption if multiple ttm driver were used)? If I had
> >>> to choose between duplicated code and a crash, I choose the former.
> >>>
> >>> Maybe I missed the real reason why those changes are made, but the
> >>> commit messages don't really seem to tell me.
> >> The old implementation crashed because different drivers tried to
> >> allocate the same kobj.
> >>
> >> Crashing in one way is not better than crashing in another way.
> >>
> >> Christian.
> >>
> > how can that old crash be triggered? By loading two ttm based drivers
> > at the same time or by other means?
> 
> Yes, exactly. Even worse it could be triggered by one driver 
> instantiating multiple times at the same time, e.g two AMD GPUs in the 
> same system.
> 
> On the other hand I completely agree that using kobj static is 
> completely nuts. The problem is that using a kobj was the wrong approach 
> in the first place.
> 
> In other words when you have something which controls a global behavior 
> of a module, what do you use? Right, a module parameter.
> 
> Point is that we can't get away from those kobj without breaking the 
> uapi, so that is something which can't be done easily.

Randome idea: Push the kobj setup into drm.ko (and shrugg it off as
another lesson in how maybe uapi shouldn't have been designed, but hey not
our worst mistake by far). I think that'd be totally ok.
-Daniel
Christian König April 16, 2019, 12:29 p.m. UTC | #9
Am 16.04.19 um 14:18 schrieb Daniel Vetter:
> On Tue, Apr 16, 2019 at 11:06:54AM +0000, Koenig, Christian wrote:
>> Am 16.04.19 um 12:54 schrieb Karol Herbst:
>>> On Tue, Apr 16, 2019 at 11:12 AM Koenig, Christian
>>> <Christian.Koenig@amd.com> wrote:
>>>> Am 16.04.19 um 11:10 schrieb Karol Herbst:
>>>>> On Tue, Apr 16, 2019 at 8:38 AM Christian König
>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>>> Am 16.04.19 um 02:35 schrieb Karol Herbst:
>>>>>>> Kobjects are supposed to be dynamically allocated, but with recent changes
>>>>>>> this rule was violated. Reverting those commits fixes crashes when a drm
>>>>>>> driver using TTM gets loaded again.
>>>>>>>
>>>>>>> The object in question is "ttm_mem_glob" declared inside
>>>>>>> "include/drm/ttm/ttm_memory.h" and instatiated inside
>>>>>>> "drivers/gpu/drm/ttm/ttm_memory.c".
>>>>>>>
>>>>>>> from "Documentation/kobject.txt":
>>>>>>> "Because kobjects are dynamic, they must not be declared statically or on
>>>>>>> the stack, but instead, always allocated dynamically.  Future versions of
>>>>>>> the kernel will contain a run-time check for kobjects that are created
>>>>>>> statically and will warn the developer of this improper usage."
>>>>>>>
>>>>>>> Unloading ttm before reloading the driver workarounds that crash, because
>>>>>>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del
>>>>>>> and kobject_put function never free or clean up the kobject object leaving
>>>>>>> it in an undefined state.
>>>>>>>
>>>>>>> I reverted a few more commits to make it less painful for me to rever this
>>>>>>> rather big change.
>>>>>> Well, NAK. By reverting those change you also re-introduced the problems
>>>>>> we originally fixed with those patches.
>>>>>>
>>>>>> Please work on a proper fix instead,
>>>>>> Christian.
>>>>> And which problem was that besides duplicated code (or maybe even a
>>>>> bit of memory consumption if multiple ttm driver were used)? If I had
>>>>> to choose between duplicated code and a crash, I choose the former.
>>>>>
>>>>> Maybe I missed the real reason why those changes are made, but the
>>>>> commit messages don't really seem to tell me.
>>>> The old implementation crashed because different drivers tried to
>>>> allocate the same kobj.
>>>>
>>>> Crashing in one way is not better than crashing in another way.
>>>>
>>>> Christian.
>>>>
>>> how can that old crash be triggered? By loading two ttm based drivers
>>> at the same time or by other means?
>> Yes, exactly. Even worse it could be triggered by one driver
>> instantiating multiple times at the same time, e.g two AMD GPUs in the
>> same system.
>>
>> On the other hand I completely agree that using kobj static is
>> completely nuts. The problem is that using a kobj was the wrong approach
>> in the first place.
>>
>> In other words when you have something which controls a global behavior
>> of a module, what do you use? Right, a module parameter.
>>
>> Point is that we can't get away from those kobj without breaking the
>> uapi, so that is something which can't be done easily.
> Randome idea: Push the kobj setup into drm.ko (and shrugg it off as
> another lesson in how maybe uapi shouldn't have been designed, but hey not
> our worst mistake by far). I think that'd be totally ok.

Yeah, thought about that as well.

But I would rather re-design this from the scratch instead of just 
moving it over.

And yes I agree with a bit of luck that UAPI is actually not used at 
all, so we could remove it sooner or later.

Regards,
Christian.

> -Daniel
Eric Anholt April 16, 2019, 11:09 p.m. UTC | #10
Christian König <ckoenig.leichtzumerken@gmail.com> writes:

> Am 16.04.19 um 02:35 schrieb Karol Herbst:
>> Kobjects are supposed to be dynamically allocated, but with recent changes
>> this rule was violated. Reverting those commits fixes crashes when a drm
>> driver using TTM gets loaded again.
>>
>> The object in question is "ttm_mem_glob" declared inside
>> "include/drm/ttm/ttm_memory.h" and instatiated inside
>> "drivers/gpu/drm/ttm/ttm_memory.c".
>>
>> from "Documentation/kobject.txt":
>> "Because kobjects are dynamic, they must not be declared statically or on
>> the stack, but instead, always allocated dynamically.  Future versions of
>> the kernel will contain a run-time check for kobjects that are created
>> statically and will warn the developer of this improper usage."
>>
>> Unloading ttm before reloading the driver workarounds that crash, because
>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del
>> and kobject_put function never free or clean up the kobject object leaving
>> it in an undefined state.
>>
>> I reverted a few more commits to make it less painful for me to rever this
>> rather big change.
>
> Well, NAK. By reverting those change you also re-introduced the problems 
> we originally fixed with those patches.
>
> Please work on a proper fix instead,

That's not Karol's responsibility, that's yours as the author.  I would
like to remind about Linux's regressions policy, quoting from
Documentation/process/4.Coding.rst:

"One final hazard worth mentioning is this: it can be tempting to make a
change (which may bring big improvements) which causes something to break
for existing users.  This kind of change is called a "regression," and
regressions have become most unwelcome in the mainline kernel.  With few
exceptions, changes which cause regressions will be backed out if the
regression cannot be fixed in a timely manner.  Far better to avoid the
regression in the first place.

It is often argued that a regression can be justified if it causes things
to work for more people than it creates problems for.  Why not make a
change if it brings new functionality to ten systems for each one it
breaks?  The best answer to this question was expressed by Linus in July,
2007:

::

	So we don't fix bugs by introducing new problems.  That way lies
	madness, and nobody ever knows if you actually make any real
	progress at all. Is it two steps forwards, one step back, or one
	step forward and two steps back?"
Karol Herbst April 17, 2019, 12:10 a.m. UTC | #11
On Wed, Apr 17, 2019 at 1:09 AM Eric Anholt <eric@anholt.net> wrote:
>
> Christian König <ckoenig.leichtzumerken@gmail.com> writes:
>
> > Am 16.04.19 um 02:35 schrieb Karol Herbst:
> >> Kobjects are supposed to be dynamically allocated, but with recent changes
> >> this rule was violated. Reverting those commits fixes crashes when a drm
> >> driver using TTM gets loaded again.
> >>
> >> The object in question is "ttm_mem_glob" declared inside
> >> "include/drm/ttm/ttm_memory.h" and instatiated inside
> >> "drivers/gpu/drm/ttm/ttm_memory.c".
> >>
> >> from "Documentation/kobject.txt":
> >> "Because kobjects are dynamic, they must not be declared statically or on
> >> the stack, but instead, always allocated dynamically.  Future versions of
> >> the kernel will contain a run-time check for kobjects that are created
> >> statically and will warn the developer of this improper usage."
> >>
> >> Unloading ttm before reloading the driver workarounds that crash, because
> >> the memory backing the kobject member "kobj" is cleaned up. The kobject_del
> >> and kobject_put function never free or clean up the kobject object leaving
> >> it in an undefined state.
> >>
> >> I reverted a few more commits to make it less painful for me to rever this
> >> rather big change.
> >
> > Well, NAK. By reverting those change you also re-introduced the problems
> > we originally fixed with those patches.
> >
> > Please work on a proper fix instead,
>
> That's not Karol's responsibility, that's yours as the author.  I would
> like to remind about Linux's regressions policy, quoting from
> Documentation/process/4.Coding.rst:
>
> "One final hazard worth mentioning is this: it can be tempting to make a
> change (which may bring big improvements) which causes something to break
> for existing users.  This kind of change is called a "regression," and
> regressions have become most unwelcome in the mainline kernel.  With few
> exceptions, changes which cause regressions will be backed out if the
> regression cannot be fixed in a timely manner.  Far better to avoid the
> regression in the first place.
>
> It is often argued that a regression can be justified if it causes things
> to work for more people than it creates problems for.  Why not make a
> change if it brings new functionality to ten systems for each one it
> breaks?  The best answer to this question was expressed by Linus in July,
> 2007:
>
> ::
>
>         So we don't fix bugs by introducing new problems.  That way lies
>         madness, and nobody ever knows if you actually make any real
>         progress at all. Is it two steps forwards, one step back, or one
>         step forward and two steps back?"

he already wrote a fix. Search for "[PATCH 1/2] drm/ttm: fix re-init
of global structures"