Message ID | 20190416003523.5069-1-kherbst@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Fix crash after reloading a driver using ttm | expand |
Am 16.04.19 um 02:35 schrieb Karol Herbst: > Kobjects are supposed to be dynamically allocated, but with recent changes > this rule was violated. Reverting those commits fixes crashes when a drm > driver using TTM gets loaded again. > > The object in question is "ttm_mem_glob" declared inside > "include/drm/ttm/ttm_memory.h" and instatiated inside > "drivers/gpu/drm/ttm/ttm_memory.c". > > from "Documentation/kobject.txt": > "Because kobjects are dynamic, they must not be declared statically or on > the stack, but instead, always allocated dynamically. Future versions of > the kernel will contain a run-time check for kobjects that are created > statically and will warn the developer of this improper usage." > > Unloading ttm before reloading the driver workarounds that crash, because > the memory backing the kobject member "kobj" is cleaned up. The kobject_del > and kobject_put function never free or clean up the kobject object leaving > it in an undefined state. > > I reverted a few more commits to make it less painful for me to rever this > rather big change. Well, NAK. By reverting those change you also re-introduced the problems we originally fixed with those patches. Please work on a proper fix instead, Christian. > > dmesg output: > [54758.418036] kobject (00000000687a067d): tried to init an initialized object, something is seriously wrong. > [54758.418040] CPU: 6 PID: 26746 Comm: insmod Tainted: G U OE 5.0.6-200.fc29.x86_64 #1 > [54758.418041] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018 > [54758.418041] Call Trace: > [54758.418049] dump_stack+0x5c/0x80 > [54758.418054] kobject_init.cold.9+0x31/0x3f > [54758.418057] kobject_init_and_add+0x35/0xa0 > [54758.418063] ttm_mem_global_init+0x8f/0x2b0 [ttm] > [54758.418067] ? __debugfs_create_file+0xe1/0x110 > [54758.418071] ttm_bo_device_init+0x198/0x2a0 [ttm] > [54758.418144] nouveau_ttm_init+0xbf/0x340 [nouveau] > [54758.418206] nouveau_drm_device_init+0x125/0x7d0 [nouveau] > [54758.418210] ? pci_bus_read_config_word+0x49/0x70 > [54758.418266] nouveau_drm_probe+0x26f/0x2c0 [nouveau] > [54758.418270] local_pci_probe+0x41/0x90 > [54758.418272] pci_device_probe+0x118/0x1a0 > [54758.418275] really_probe+0xf8/0x3b0 > [54758.418277] driver_probe_device+0xb3/0xf0 > [54758.418278] __driver_attach+0xdd/0x110 > [54758.418280] ? driver_probe_device+0xf0/0xf0 > [54758.418282] bus_for_each_dev+0x77/0xc0 > [54758.418285] ? klist_add_tail+0x3b/0x60 > [54758.418287] bus_add_driver+0x152/0x230 > [54758.418288] ? 0xffffffffc1027000 > [54758.418290] driver_register+0x6b/0xb0 > [54758.418291] ? 0xffffffffc1027000 > [54758.418294] do_one_initcall+0x46/0x1c3 > [54758.418296] ? _cond_resched+0x15/0x30 > [54758.418299] ? kmem_cache_alloc_trace+0x154/0x1d0 > [54758.418302] do_init_module+0x5a/0x210 > [54758.418304] load_module+0x2096/0x22d0 > [54758.418308] ? ima_post_read_file+0xf4/0x100 > [54758.418310] ? __do_sys_finit_module+0xa8/0x110 > [54758.418312] __do_sys_finit_module+0xa8/0x110 > [54758.418315] do_syscall_64+0x5b/0x160 > [54758.418317] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [54758.418319] RIP: 0033:0x7fc0b38b6edd > [54758.418321] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48 > [54758.418322] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > [54758.418323] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd > [54758.418324] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003 > [54758.418325] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000 > [54758.418326] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000 > [54758.418326] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260 > [54758.418346] BUG: unable to handle kernel paging request at 00000004ee194700 > [54758.418348] #PF error: [WRITE] > [54758.418349] PGD 0 P4D 0 > [54758.418352] Oops: 0002 [#1] SMP PTI > [54758.418354] CPU: 6 PID: 26746 Comm: insmod Tainted: G U OE 5.0.6-200.fc29.x86_64 #1 > [54758.418355] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018 > [54758.418360] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm] > [54758.418361] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00 > [54758.418363] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246 > [54758.418364] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000 > [54758.418366] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330 > [54758.418367] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228 > [54758.418368] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000 > [54758.418370] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880 > [54758.418371] FS: 00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000 > [54758.418373] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [54758.418374] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0 > [54758.418375] Call Trace: > [54758.418381] ttm_bo_device_init+0x198/0x2a0 [ttm] > [54758.418439] nouveau_ttm_init+0xbf/0x340 [nouveau] > [54758.418495] nouveau_drm_device_init+0x125/0x7d0 [nouveau] > [54758.418499] ? pci_bus_read_config_word+0x49/0x70 > [54758.418552] nouveau_drm_probe+0x26f/0x2c0 [nouveau] > [54758.418556] local_pci_probe+0x41/0x90 > [54758.418558] pci_device_probe+0x118/0x1a0 > [54758.418561] really_probe+0xf8/0x3b0 > [54758.418563] driver_probe_device+0xb3/0xf0 > [54758.418565] __driver_attach+0xdd/0x110 > [54758.418567] ? driver_probe_device+0xf0/0xf0 > [54758.418570] bus_for_each_dev+0x77/0xc0 > [54758.418573] ? klist_add_tail+0x3b/0x60 > [54758.418574] bus_add_driver+0x152/0x230 > [54758.418576] ? 0xffffffffc1027000 > [54758.418578] driver_register+0x6b/0xb0 > [54758.418580] ? 0xffffffffc1027000 > [54758.418583] do_one_initcall+0x46/0x1c3 > [54758.418585] ? _cond_resched+0x15/0x30 > [54758.418587] ? kmem_cache_alloc_trace+0x154/0x1d0 > [54758.418591] do_init_module+0x5a/0x210 > [54758.418593] load_module+0x2096/0x22d0 > [54758.418596] ? ima_post_read_file+0xf4/0x100 > [54758.418599] ? __do_sys_finit_module+0xa8/0x110 > [54758.418601] __do_sys_finit_module+0xa8/0x110 > [54758.418604] do_syscall_64+0x5b/0x160 > [54758.418607] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [54758.418609] RIP: 0033:0x7fc0b38b6edd > [54758.418610] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48 > [54758.418612] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > [54758.418614] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd > [54758.418615] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003 > [54758.418616] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000 > [54758.418618] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000 > [54758.418619] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260 > [54758.418620] Modules linked in: nouveau(OE+) acpi_call(OE) ttm rfcomm ccm xt_CHECKSUM ipt_MASQUERADE tun bridge stp llc devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat btusb btrtl btbcm btintel uvcvideo videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 videobuf2_common videodev media ecdh_generic arc4 snd_hda_codec_hdmi ath10k_pci ath10k_core snd_hda_codec_realtek iTCO_wdt mei_wdt iTCO_vendor_support snd_hda_codec_generic mac80211 intel_rapl snd_hda_intel x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec dell_laptop ledtrig_audio dell_smm_hwmon snd_hda_core kvm_intel snd_hwdep ath snd_seq snd_seq_device intel_cstate cfg80211 snd_pcm dell_wmi intel_uncore > [54758.418653] dell_smbios dcdbas intel_rapl_perf snd_timer joydev dell_wmi_descriptor intel_wmi_thunderbolt wmi_bmof snd idma64 soundcore i2c_i801 rfkill rtsx_pci_ms memstick mei_me mei processor_thermal_device intel_lpss_pci intel_soc_dts_iosf intel_pch_thermal intel_lpss int3400_thermal acpi_pad acpi_thermal_rel int3403_thermal intel_hid pcc_cpufreq int340x_thermal_zone sparse_keymap binfmt_misc zram dm_crypt hid_multitouch i915 kvmgt mdev vfio kvm irqbypass i2c_algo_bit crct10dif_pclmul drm_kms_helper rtsx_pci_sdmmc crc32_pclmul mmc_core crc32c_intel mxm_wmi drm nvme ghash_clmulni_intel nvme_core serio_raw rtsx_pci i2c_hid video wmi i2c_dev lz4 lz4_compress [last unloaded: nouveau] > [54758.418679] CR2: 00000004ee194700 > [54758.418681] ---[ end trace c5175234e6efc034 ]--- > [54758.418686] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm] > [54758.418688] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00 > [54758.418689] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246 > [54758.418690] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000 > [54758.418691] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330 > [54758.418692] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228 > [54758.418694] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000 > [54758.418695] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880 > [54758.418696] FS: 00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000 > [54758.418698] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [54758.418699] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0 > > Karol Herbst (6): > Revert "drm: Remove drm_global.{c,h} v2" > Revert "drm/ttm: initialize globals during device init (v2)" > Revert "drm/ttm: Fix bo_global and mem_global kfree error" > Revert "drm/ttm: use a static ttm_bo_global instance" > Revert "drm/ttm: make the device list mutex static" > Revert "drm/ttm: use a static ttm_mem_global instance" > > drivers/gpu/drm/Makefile | 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 103 ++++++++++++- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 + > drivers/gpu/drm/ast/ast_drv.h | 2 + > drivers/gpu/drm/ast/ast_ttm.c | 64 ++++++++ > drivers/gpu/drm/bochs/bochs.h | 2 + > drivers/gpu/drm/bochs/bochs_mm.c | 61 ++++++++ > drivers/gpu/drm/cirrus/cirrus_drv.h | 2 + > drivers/gpu/drm/cirrus/cirrus_ttm.c | 64 ++++++++ > drivers/gpu/drm/drm_drv.c | 2 + > drivers/gpu/drm/drm_global.c | 137 ++++++++++++++++++ > .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h | 2 + > drivers/gpu/drm/hisilicon/hibmc/hibmc_ttm.c | 57 ++++++++ > drivers/gpu/drm/mgag200/mgag200_drv.h | 2 + > drivers/gpu/drm/mgag200/mgag200_ttm.c | 64 ++++++++ > drivers/gpu/drm/nouveau/nouveau_drv.h | 2 + > drivers/gpu/drm/nouveau/nouveau_ttm.c | 67 +++++++++ > drivers/gpu/drm/qxl/qxl_drv.h | 3 + > drivers/gpu/drm/qxl/qxl_ttm.c | 61 ++++++++ > drivers/gpu/drm/radeon/radeon.h | 3 + > drivers/gpu/drm/radeon/radeon_ttm.c | 65 +++++++++ > drivers/gpu/drm/ttm/ttm_bo.c | 67 +++------ > drivers/gpu/drm/ttm/ttm_memory.c | 14 +- > drivers/gpu/drm/virtio/virtgpu_drv.h | 3 + > drivers/gpu/drm/virtio/virtgpu_ttm.c | 62 ++++++++ > drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 15 +- > drivers/gpu/drm/vmwgfx/vmwgfx_drv.h | 6 +- > drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c | 54 +++++++ > drivers/staging/vboxvideo/vbox_drv.h | 2 + > drivers/staging/vboxvideo/vbox_ttm.c | 65 ++++++++- > include/drm/drmP.h | 1 + > include/drm/drm_global.h | 53 +++++++ > include/drm/ttm/ttm_bo_driver.h | 54 ++++++- > include/drm/ttm/ttm_memory.h | 4 +- > 34 files changed, 1102 insertions(+), 65 deletions(-) > create mode 100644 drivers/gpu/drm/drm_global.c > create mode 100644 include/drm/drm_global.h >
On Tue, Apr 16, 2019 at 8:38 AM Christian König <ckoenig.leichtzumerken@gmail.com> wrote: > > Am 16.04.19 um 02:35 schrieb Karol Herbst: > > Kobjects are supposed to be dynamically allocated, but with recent changes > > this rule was violated. Reverting those commits fixes crashes when a drm > > driver using TTM gets loaded again. > > > > The object in question is "ttm_mem_glob" declared inside > > "include/drm/ttm/ttm_memory.h" and instatiated inside > > "drivers/gpu/drm/ttm/ttm_memory.c". > > > > from "Documentation/kobject.txt": > > "Because kobjects are dynamic, they must not be declared statically or on > > the stack, but instead, always allocated dynamically. Future versions of > > the kernel will contain a run-time check for kobjects that are created > > statically and will warn the developer of this improper usage." > > > > Unloading ttm before reloading the driver workarounds that crash, because > > the memory backing the kobject member "kobj" is cleaned up. The kobject_del > > and kobject_put function never free or clean up the kobject object leaving > > it in an undefined state. > > > > I reverted a few more commits to make it less painful for me to rever this > > rather big change. > > Well, NAK. By reverting those change you also re-introduced the problems > we originally fixed with those patches. > > Please work on a proper fix instead, > Christian. And which problem was that besides duplicated code (or maybe even a bit of memory consumption if multiple ttm driver were used)? If I had to choose between duplicated code and a crash, I choose the former. Maybe I missed the real reason why those changes are made, but the commit messages don't really seem to tell me. > > > > > dmesg output: > > [54758.418036] kobject (00000000687a067d): tried to init an initialized object, something is seriously wrong. > > [54758.418040] CPU: 6 PID: 26746 Comm: insmod Tainted: G U OE 5.0.6-200.fc29.x86_64 #1 > > [54758.418041] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018 > > [54758.418041] Call Trace: > > [54758.418049] dump_stack+0x5c/0x80 > > [54758.418054] kobject_init.cold.9+0x31/0x3f > > [54758.418057] kobject_init_and_add+0x35/0xa0 > > [54758.418063] ttm_mem_global_init+0x8f/0x2b0 [ttm] > > [54758.418067] ? __debugfs_create_file+0xe1/0x110 > > [54758.418071] ttm_bo_device_init+0x198/0x2a0 [ttm] > > [54758.418144] nouveau_ttm_init+0xbf/0x340 [nouveau] > > [54758.418206] nouveau_drm_device_init+0x125/0x7d0 [nouveau] > > [54758.418210] ? pci_bus_read_config_word+0x49/0x70 > > [54758.418266] nouveau_drm_probe+0x26f/0x2c0 [nouveau] > > [54758.418270] local_pci_probe+0x41/0x90 > > [54758.418272] pci_device_probe+0x118/0x1a0 > > [54758.418275] really_probe+0xf8/0x3b0 > > [54758.418277] driver_probe_device+0xb3/0xf0 > > [54758.418278] __driver_attach+0xdd/0x110 > > [54758.418280] ? driver_probe_device+0xf0/0xf0 > > [54758.418282] bus_for_each_dev+0x77/0xc0 > > [54758.418285] ? klist_add_tail+0x3b/0x60 > > [54758.418287] bus_add_driver+0x152/0x230 > > [54758.418288] ? 0xffffffffc1027000 > > [54758.418290] driver_register+0x6b/0xb0 > > [54758.418291] ? 0xffffffffc1027000 > > [54758.418294] do_one_initcall+0x46/0x1c3 > > [54758.418296] ? _cond_resched+0x15/0x30 > > [54758.418299] ? kmem_cache_alloc_trace+0x154/0x1d0 > > [54758.418302] do_init_module+0x5a/0x210 > > [54758.418304] load_module+0x2096/0x22d0 > > [54758.418308] ? ima_post_read_file+0xf4/0x100 > > [54758.418310] ? __do_sys_finit_module+0xa8/0x110 > > [54758.418312] __do_sys_finit_module+0xa8/0x110 > > [54758.418315] do_syscall_64+0x5b/0x160 > > [54758.418317] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [54758.418319] RIP: 0033:0x7fc0b38b6edd > > [54758.418321] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48 > > [54758.418322] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > > [54758.418323] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd > > [54758.418324] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003 > > [54758.418325] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000 > > [54758.418326] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000 > > [54758.418326] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260 > > [54758.418346] BUG: unable to handle kernel paging request at 00000004ee194700 > > [54758.418348] #PF error: [WRITE] > > [54758.418349] PGD 0 P4D 0 > > [54758.418352] Oops: 0002 [#1] SMP PTI > > [54758.418354] CPU: 6 PID: 26746 Comm: insmod Tainted: G U OE 5.0.6-200.fc29.x86_64 #1 > > [54758.418355] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018 > > [54758.418360] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm] > > [54758.418361] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00 > > [54758.418363] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246 > > [54758.418364] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000 > > [54758.418366] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330 > > [54758.418367] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228 > > [54758.418368] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000 > > [54758.418370] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880 > > [54758.418371] FS: 00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000 > > [54758.418373] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [54758.418374] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0 > > [54758.418375] Call Trace: > > [54758.418381] ttm_bo_device_init+0x198/0x2a0 [ttm] > > [54758.418439] nouveau_ttm_init+0xbf/0x340 [nouveau] > > [54758.418495] nouveau_drm_device_init+0x125/0x7d0 [nouveau] > > [54758.418499] ? pci_bus_read_config_word+0x49/0x70 > > [54758.418552] nouveau_drm_probe+0x26f/0x2c0 [nouveau] > > [54758.418556] local_pci_probe+0x41/0x90 > > [54758.418558] pci_device_probe+0x118/0x1a0 > > [54758.418561] really_probe+0xf8/0x3b0 > > [54758.418563] driver_probe_device+0xb3/0xf0 > > [54758.418565] __driver_attach+0xdd/0x110 > > [54758.418567] ? driver_probe_device+0xf0/0xf0 > > [54758.418570] bus_for_each_dev+0x77/0xc0 > > [54758.418573] ? klist_add_tail+0x3b/0x60 > > [54758.418574] bus_add_driver+0x152/0x230 > > [54758.418576] ? 0xffffffffc1027000 > > [54758.418578] driver_register+0x6b/0xb0 > > [54758.418580] ? 0xffffffffc1027000 > > [54758.418583] do_one_initcall+0x46/0x1c3 > > [54758.418585] ? _cond_resched+0x15/0x30 > > [54758.418587] ? kmem_cache_alloc_trace+0x154/0x1d0 > > [54758.418591] do_init_module+0x5a/0x210 > > [54758.418593] load_module+0x2096/0x22d0 > > [54758.418596] ? ima_post_read_file+0xf4/0x100 > > [54758.418599] ? __do_sys_finit_module+0xa8/0x110 > > [54758.418601] __do_sys_finit_module+0xa8/0x110 > > [54758.418604] do_syscall_64+0x5b/0x160 > > [54758.418607] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [54758.418609] RIP: 0033:0x7fc0b38b6edd > > [54758.418610] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48 > > [54758.418612] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > > [54758.418614] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd > > [54758.418615] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003 > > [54758.418616] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000 > > [54758.418618] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000 > > [54758.418619] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260 > > [54758.418620] Modules linked in: nouveau(OE+) acpi_call(OE) ttm rfcomm ccm xt_CHECKSUM ipt_MASQUERADE tun bridge stp llc devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat btusb btrtl btbcm btintel uvcvideo videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 videobuf2_common videodev media ecdh_generic arc4 snd_hda_codec_hdmi ath10k_pci ath10k_core snd_hda_codec_realtek iTCO_wdt mei_wdt iTCO_vendor_support snd_hda_codec_generic mac80211 intel_rapl snd_hda_intel x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec dell_laptop ledtrig_audio dell_smm_hwmon snd_hda_core kvm_intel snd_hwdep ath snd_seq snd_seq_device intel_cstate cfg80211 snd_pcm dell_wmi intel_uncore > > [54758.418653] dell_smbios dcdbas intel_rapl_perf snd_timer joydev dell_wmi_descriptor intel_wmi_thunderbolt wmi_bmof snd idma64 soundcore i2c_i801 rfkill rtsx_pci_ms memstick mei_me mei processor_thermal_device intel_lpss_pci intel_soc_dts_iosf intel_pch_thermal intel_lpss int3400_thermal acpi_pad acpi_thermal_rel int3403_thermal intel_hid pcc_cpufreq int340x_thermal_zone sparse_keymap binfmt_misc zram dm_crypt hid_multitouch i915 kvmgt mdev vfio kvm irqbypass i2c_algo_bit crct10dif_pclmul drm_kms_helper rtsx_pci_sdmmc crc32_pclmul mmc_core crc32c_intel mxm_wmi drm nvme ghash_clmulni_intel nvme_core serio_raw rtsx_pci i2c_hid video wmi i2c_dev lz4 lz4_compress [last unloaded: nouveau] > > [54758.418679] CR2: 00000004ee194700 > > [54758.418681] ---[ end trace c5175234e6efc034 ]--- > > [54758.418686] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm] > > [54758.418688] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00 > > [54758.418689] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246 > > [54758.418690] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000 > > [54758.418691] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330 > > [54758.418692] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228 > > [54758.418694] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000 > > [54758.418695] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880 > > [54758.418696] FS: 00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000 > > [54758.418698] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [54758.418699] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0 > > > > Karol Herbst (6): > > Revert "drm: Remove drm_global.{c,h} v2" > > Revert "drm/ttm: initialize globals during device init (v2)" > > Revert "drm/ttm: Fix bo_global and mem_global kfree error" > > Revert "drm/ttm: use a static ttm_bo_global instance" > > Revert "drm/ttm: make the device list mutex static" > > Revert "drm/ttm: use a static ttm_mem_global instance" > > > > drivers/gpu/drm/Makefile | 2 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 103 ++++++++++++- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 + > > drivers/gpu/drm/ast/ast_drv.h | 2 + > > drivers/gpu/drm/ast/ast_ttm.c | 64 ++++++++ > > drivers/gpu/drm/bochs/bochs.h | 2 + > > drivers/gpu/drm/bochs/bochs_mm.c | 61 ++++++++ > > drivers/gpu/drm/cirrus/cirrus_drv.h | 2 + > > drivers/gpu/drm/cirrus/cirrus_ttm.c | 64 ++++++++ > > drivers/gpu/drm/drm_drv.c | 2 + > > drivers/gpu/drm/drm_global.c | 137 ++++++++++++++++++ > > .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h | 2 + > > drivers/gpu/drm/hisilicon/hibmc/hibmc_ttm.c | 57 ++++++++ > > drivers/gpu/drm/mgag200/mgag200_drv.h | 2 + > > drivers/gpu/drm/mgag200/mgag200_ttm.c | 64 ++++++++ > > drivers/gpu/drm/nouveau/nouveau_drv.h | 2 + > > drivers/gpu/drm/nouveau/nouveau_ttm.c | 67 +++++++++ > > drivers/gpu/drm/qxl/qxl_drv.h | 3 + > > drivers/gpu/drm/qxl/qxl_ttm.c | 61 ++++++++ > > drivers/gpu/drm/radeon/radeon.h | 3 + > > drivers/gpu/drm/radeon/radeon_ttm.c | 65 +++++++++ > > drivers/gpu/drm/ttm/ttm_bo.c | 67 +++------ > > drivers/gpu/drm/ttm/ttm_memory.c | 14 +- > > drivers/gpu/drm/virtio/virtgpu_drv.h | 3 + > > drivers/gpu/drm/virtio/virtgpu_ttm.c | 62 ++++++++ > > drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 15 +- > > drivers/gpu/drm/vmwgfx/vmwgfx_drv.h | 6 +- > > drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c | 54 +++++++ > > drivers/staging/vboxvideo/vbox_drv.h | 2 + > > drivers/staging/vboxvideo/vbox_ttm.c | 65 ++++++++- > > include/drm/drmP.h | 1 + > > include/drm/drm_global.h | 53 +++++++ > > include/drm/ttm/ttm_bo_driver.h | 54 ++++++- > > include/drm/ttm/ttm_memory.h | 4 +- > > 34 files changed, 1102 insertions(+), 65 deletions(-) > > create mode 100644 drivers/gpu/drm/drm_global.c > > create mode 100644 include/drm/drm_global.h > > > >
Am 16.04.19 um 11:10 schrieb Karol Herbst: > On Tue, Apr 16, 2019 at 8:38 AM Christian König > <ckoenig.leichtzumerken@gmail.com> wrote: >> Am 16.04.19 um 02:35 schrieb Karol Herbst: >>> Kobjects are supposed to be dynamically allocated, but with recent changes >>> this rule was violated. Reverting those commits fixes crashes when a drm >>> driver using TTM gets loaded again. >>> >>> The object in question is "ttm_mem_glob" declared inside >>> "include/drm/ttm/ttm_memory.h" and instatiated inside >>> "drivers/gpu/drm/ttm/ttm_memory.c". >>> >>> from "Documentation/kobject.txt": >>> "Because kobjects are dynamic, they must not be declared statically or on >>> the stack, but instead, always allocated dynamically. Future versions of >>> the kernel will contain a run-time check for kobjects that are created >>> statically and will warn the developer of this improper usage." >>> >>> Unloading ttm before reloading the driver workarounds that crash, because >>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del >>> and kobject_put function never free or clean up the kobject object leaving >>> it in an undefined state. >>> >>> I reverted a few more commits to make it less painful for me to rever this >>> rather big change. >> Well, NAK. By reverting those change you also re-introduced the problems >> we originally fixed with those patches. >> >> Please work on a proper fix instead, >> Christian. > And which problem was that besides duplicated code (or maybe even a > bit of memory consumption if multiple ttm driver were used)? If I had > to choose between duplicated code and a crash, I choose the former. > > Maybe I missed the real reason why those changes are made, but the > commit messages don't really seem to tell me. The old implementation crashed because different drivers tried to allocate the same kobj. Crashing in one way is not better than crashing in another way. Christian. > >>> dmesg output: >>> [54758.418036] kobject (00000000687a067d): tried to init an initialized object, something is seriously wrong. >>> [54758.418040] CPU: 6 PID: 26746 Comm: insmod Tainted: G U OE 5.0.6-200.fc29.x86_64 #1 >>> [54758.418041] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018 >>> [54758.418041] Call Trace: >>> [54758.418049] dump_stack+0x5c/0x80 >>> [54758.418054] kobject_init.cold.9+0x31/0x3f >>> [54758.418057] kobject_init_and_add+0x35/0xa0 >>> [54758.418063] ttm_mem_global_init+0x8f/0x2b0 [ttm] >>> [54758.418067] ? __debugfs_create_file+0xe1/0x110 >>> [54758.418071] ttm_bo_device_init+0x198/0x2a0 [ttm] >>> [54758.418144] nouveau_ttm_init+0xbf/0x340 [nouveau] >>> [54758.418206] nouveau_drm_device_init+0x125/0x7d0 [nouveau] >>> [54758.418210] ? pci_bus_read_config_word+0x49/0x70 >>> [54758.418266] nouveau_drm_probe+0x26f/0x2c0 [nouveau] >>> [54758.418270] local_pci_probe+0x41/0x90 >>> [54758.418272] pci_device_probe+0x118/0x1a0 >>> [54758.418275] really_probe+0xf8/0x3b0 >>> [54758.418277] driver_probe_device+0xb3/0xf0 >>> [54758.418278] __driver_attach+0xdd/0x110 >>> [54758.418280] ? driver_probe_device+0xf0/0xf0 >>> [54758.418282] bus_for_each_dev+0x77/0xc0 >>> [54758.418285] ? klist_add_tail+0x3b/0x60 >>> [54758.418287] bus_add_driver+0x152/0x230 >>> [54758.418288] ? 0xffffffffc1027000 >>> [54758.418290] driver_register+0x6b/0xb0 >>> [54758.418291] ? 0xffffffffc1027000 >>> [54758.418294] do_one_initcall+0x46/0x1c3 >>> [54758.418296] ? _cond_resched+0x15/0x30 >>> [54758.418299] ? kmem_cache_alloc_trace+0x154/0x1d0 >>> [54758.418302] do_init_module+0x5a/0x210 >>> [54758.418304] load_module+0x2096/0x22d0 >>> [54758.418308] ? ima_post_read_file+0xf4/0x100 >>> [54758.418310] ? __do_sys_finit_module+0xa8/0x110 >>> [54758.418312] __do_sys_finit_module+0xa8/0x110 >>> [54758.418315] do_syscall_64+0x5b/0x160 >>> [54758.418317] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> [54758.418319] RIP: 0033:0x7fc0b38b6edd >>> [54758.418321] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48 >>> [54758.418322] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 >>> [54758.418323] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd >>> [54758.418324] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003 >>> [54758.418325] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000 >>> [54758.418326] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000 >>> [54758.418326] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260 >>> [54758.418346] BUG: unable to handle kernel paging request at 00000004ee194700 >>> [54758.418348] #PF error: [WRITE] >>> [54758.418349] PGD 0 P4D 0 >>> [54758.418352] Oops: 0002 [#1] SMP PTI >>> [54758.418354] CPU: 6 PID: 26746 Comm: insmod Tainted: G U OE 5.0.6-200.fc29.x86_64 #1 >>> [54758.418355] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018 >>> [54758.418360] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm] >>> [54758.418361] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00 >>> [54758.418363] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246 >>> [54758.418364] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000 >>> [54758.418366] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330 >>> [54758.418367] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228 >>> [54758.418368] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000 >>> [54758.418370] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880 >>> [54758.418371] FS: 00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000 >>> [54758.418373] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [54758.418374] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0 >>> [54758.418375] Call Trace: >>> [54758.418381] ttm_bo_device_init+0x198/0x2a0 [ttm] >>> [54758.418439] nouveau_ttm_init+0xbf/0x340 [nouveau] >>> [54758.418495] nouveau_drm_device_init+0x125/0x7d0 [nouveau] >>> [54758.418499] ? pci_bus_read_config_word+0x49/0x70 >>> [54758.418552] nouveau_drm_probe+0x26f/0x2c0 [nouveau] >>> [54758.418556] local_pci_probe+0x41/0x90 >>> [54758.418558] pci_device_probe+0x118/0x1a0 >>> [54758.418561] really_probe+0xf8/0x3b0 >>> [54758.418563] driver_probe_device+0xb3/0xf0 >>> [54758.418565] __driver_attach+0xdd/0x110 >>> [54758.418567] ? driver_probe_device+0xf0/0xf0 >>> [54758.418570] bus_for_each_dev+0x77/0xc0 >>> [54758.418573] ? klist_add_tail+0x3b/0x60 >>> [54758.418574] bus_add_driver+0x152/0x230 >>> [54758.418576] ? 0xffffffffc1027000 >>> [54758.418578] driver_register+0x6b/0xb0 >>> [54758.418580] ? 0xffffffffc1027000 >>> [54758.418583] do_one_initcall+0x46/0x1c3 >>> [54758.418585] ? _cond_resched+0x15/0x30 >>> [54758.418587] ? kmem_cache_alloc_trace+0x154/0x1d0 >>> [54758.418591] do_init_module+0x5a/0x210 >>> [54758.418593] load_module+0x2096/0x22d0 >>> [54758.418596] ? ima_post_read_file+0xf4/0x100 >>> [54758.418599] ? __do_sys_finit_module+0xa8/0x110 >>> [54758.418601] __do_sys_finit_module+0xa8/0x110 >>> [54758.418604] do_syscall_64+0x5b/0x160 >>> [54758.418607] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> [54758.418609] RIP: 0033:0x7fc0b38b6edd >>> [54758.418610] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48 >>> [54758.418612] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 >>> [54758.418614] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd >>> [54758.418615] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003 >>> [54758.418616] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000 >>> [54758.418618] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000 >>> [54758.418619] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260 >>> [54758.418620] Modules linked in: nouveau(OE+) acpi_call(OE) ttm rfcomm ccm xt_CHECKSUM ipt_MASQUERADE tun bridge stp llc devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat btusb btrtl btbcm btintel uvcvideo videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 videobuf2_common videodev media ecdh_generic arc4 snd_hda_codec_hdmi ath10k_pci ath10k_core snd_hda_codec_realtek iTCO_wdt mei_wdt iTCO_vendor_support snd_hda_codec_generic mac80211 intel_rapl snd_hda_intel x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec dell_laptop ledtrig_audio dell_smm_hwmon snd_hda_core kvm_intel snd_hwdep ath snd_seq snd_seq_device intel_cstate cfg80211 snd_pcm dell_wmi intel_uncore >>> [54758.418653] dell_smbios dcdbas intel_rapl_perf snd_timer joydev dell_wmi_descriptor intel_wmi_thunderbolt wmi_bmof snd idma64 soundcore i2c_i801 rfkill rtsx_pci_ms memstick mei_me mei processor_thermal_device intel_lpss_pci intel_soc_dts_iosf intel_pch_thermal intel_lpss int3400_thermal acpi_pad acpi_thermal_rel int3403_thermal intel_hid pcc_cpufreq int340x_thermal_zone sparse_keymap binfmt_misc zram dm_crypt hid_multitouch i915 kvmgt mdev vfio kvm irqbypass i2c_algo_bit crct10dif_pclmul drm_kms_helper rtsx_pci_sdmmc crc32_pclmul mmc_core crc32c_intel mxm_wmi drm nvme ghash_clmulni_intel nvme_core serio_raw rtsx_pci i2c_hid video wmi i2c_dev lz4 lz4_compress [last unloaded: nouveau] >>> [54758.418679] CR2: 00000004ee194700 >>> [54758.418681] ---[ end trace c5175234e6efc034 ]--- >>> [54758.418686] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm] >>> [54758.418688] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00 >>> [54758.418689] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246 >>> [54758.418690] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000 >>> [54758.418691] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330 >>> [54758.418692] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228 >>> [54758.418694] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000 >>> [54758.418695] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880 >>> [54758.418696] FS: 00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000 >>> [54758.418698] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [54758.418699] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0 >>> >>> Karol Herbst (6): >>> Revert "drm: Remove drm_global.{c,h} v2" >>> Revert "drm/ttm: initialize globals during device init (v2)" >>> Revert "drm/ttm: Fix bo_global and mem_global kfree error" >>> Revert "drm/ttm: use a static ttm_bo_global instance" >>> Revert "drm/ttm: make the device list mutex static" >>> Revert "drm/ttm: use a static ttm_mem_global instance" >>> >>> drivers/gpu/drm/Makefile | 2 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 103 ++++++++++++- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 + >>> drivers/gpu/drm/ast/ast_drv.h | 2 + >>> drivers/gpu/drm/ast/ast_ttm.c | 64 ++++++++ >>> drivers/gpu/drm/bochs/bochs.h | 2 + >>> drivers/gpu/drm/bochs/bochs_mm.c | 61 ++++++++ >>> drivers/gpu/drm/cirrus/cirrus_drv.h | 2 + >>> drivers/gpu/drm/cirrus/cirrus_ttm.c | 64 ++++++++ >>> drivers/gpu/drm/drm_drv.c | 2 + >>> drivers/gpu/drm/drm_global.c | 137 ++++++++++++++++++ >>> .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h | 2 + >>> drivers/gpu/drm/hisilicon/hibmc/hibmc_ttm.c | 57 ++++++++ >>> drivers/gpu/drm/mgag200/mgag200_drv.h | 2 + >>> drivers/gpu/drm/mgag200/mgag200_ttm.c | 64 ++++++++ >>> drivers/gpu/drm/nouveau/nouveau_drv.h | 2 + >>> drivers/gpu/drm/nouveau/nouveau_ttm.c | 67 +++++++++ >>> drivers/gpu/drm/qxl/qxl_drv.h | 3 + >>> drivers/gpu/drm/qxl/qxl_ttm.c | 61 ++++++++ >>> drivers/gpu/drm/radeon/radeon.h | 3 + >>> drivers/gpu/drm/radeon/radeon_ttm.c | 65 +++++++++ >>> drivers/gpu/drm/ttm/ttm_bo.c | 67 +++------ >>> drivers/gpu/drm/ttm/ttm_memory.c | 14 +- >>> drivers/gpu/drm/virtio/virtgpu_drv.h | 3 + >>> drivers/gpu/drm/virtio/virtgpu_ttm.c | 62 ++++++++ >>> drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 15 +- >>> drivers/gpu/drm/vmwgfx/vmwgfx_drv.h | 6 +- >>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c | 54 +++++++ >>> drivers/staging/vboxvideo/vbox_drv.h | 2 + >>> drivers/staging/vboxvideo/vbox_ttm.c | 65 ++++++++- >>> include/drm/drmP.h | 1 + >>> include/drm/drm_global.h | 53 +++++++ >>> include/drm/ttm/ttm_bo_driver.h | 54 ++++++- >>> include/drm/ttm/ttm_memory.h | 4 +- >>> 34 files changed, 1102 insertions(+), 65 deletions(-) >>> create mode 100644 drivers/gpu/drm/drm_global.c >>> create mode 100644 include/drm/drm_global.h >>> >>
On Tue, Apr 16, 2019 at 11:12 AM Koenig, Christian <Christian.Koenig@amd.com> wrote: > > Am 16.04.19 um 11:10 schrieb Karol Herbst: > > On Tue, Apr 16, 2019 at 8:38 AM Christian König > > <ckoenig.leichtzumerken@gmail.com> wrote: > >> Am 16.04.19 um 02:35 schrieb Karol Herbst: > >>> Kobjects are supposed to be dynamically allocated, but with recent changes > >>> this rule was violated. Reverting those commits fixes crashes when a drm > >>> driver using TTM gets loaded again. > >>> > >>> The object in question is "ttm_mem_glob" declared inside > >>> "include/drm/ttm/ttm_memory.h" and instatiated inside > >>> "drivers/gpu/drm/ttm/ttm_memory.c". > >>> > >>> from "Documentation/kobject.txt": > >>> "Because kobjects are dynamic, they must not be declared statically or on > >>> the stack, but instead, always allocated dynamically. Future versions of > >>> the kernel will contain a run-time check for kobjects that are created > >>> statically and will warn the developer of this improper usage." > >>> > >>> Unloading ttm before reloading the driver workarounds that crash, because > >>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del > >>> and kobject_put function never free or clean up the kobject object leaving > >>> it in an undefined state. > >>> > >>> I reverted a few more commits to make it less painful for me to rever this > >>> rather big change. > >> Well, NAK. By reverting those change you also re-introduced the problems > >> we originally fixed with those patches. > >> > >> Please work on a proper fix instead, > >> Christian. > > And which problem was that besides duplicated code (or maybe even a > > bit of memory consumption if multiple ttm driver were used)? If I had > > to choose between duplicated code and a crash, I choose the former. > > > > Maybe I missed the real reason why those changes are made, but the > > commit messages don't really seem to tell me. > > The old implementation crashed because different drivers tried to > allocate the same kobj. > > Crashing in one way is not better than crashing in another way. > > Christian. > how can that old crash be triggered? By loading two ttm based drivers at the same time or by other means? > > > >>> dmesg output: > >>> [54758.418036] kobject (00000000687a067d): tried to init an initialized object, something is seriously wrong. > >>> [54758.418040] CPU: 6 PID: 26746 Comm: insmod Tainted: G U OE 5.0.6-200.fc29.x86_64 #1 > >>> [54758.418041] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018 > >>> [54758.418041] Call Trace: > >>> [54758.418049] dump_stack+0x5c/0x80 > >>> [54758.418054] kobject_init.cold.9+0x31/0x3f > >>> [54758.418057] kobject_init_and_add+0x35/0xa0 > >>> [54758.418063] ttm_mem_global_init+0x8f/0x2b0 [ttm] > >>> [54758.418067] ? __debugfs_create_file+0xe1/0x110 > >>> [54758.418071] ttm_bo_device_init+0x198/0x2a0 [ttm] > >>> [54758.418144] nouveau_ttm_init+0xbf/0x340 [nouveau] > >>> [54758.418206] nouveau_drm_device_init+0x125/0x7d0 [nouveau] > >>> [54758.418210] ? pci_bus_read_config_word+0x49/0x70 > >>> [54758.418266] nouveau_drm_probe+0x26f/0x2c0 [nouveau] > >>> [54758.418270] local_pci_probe+0x41/0x90 > >>> [54758.418272] pci_device_probe+0x118/0x1a0 > >>> [54758.418275] really_probe+0xf8/0x3b0 > >>> [54758.418277] driver_probe_device+0xb3/0xf0 > >>> [54758.418278] __driver_attach+0xdd/0x110 > >>> [54758.418280] ? driver_probe_device+0xf0/0xf0 > >>> [54758.418282] bus_for_each_dev+0x77/0xc0 > >>> [54758.418285] ? klist_add_tail+0x3b/0x60 > >>> [54758.418287] bus_add_driver+0x152/0x230 > >>> [54758.418288] ? 0xffffffffc1027000 > >>> [54758.418290] driver_register+0x6b/0xb0 > >>> [54758.418291] ? 0xffffffffc1027000 > >>> [54758.418294] do_one_initcall+0x46/0x1c3 > >>> [54758.418296] ? _cond_resched+0x15/0x30 > >>> [54758.418299] ? kmem_cache_alloc_trace+0x154/0x1d0 > >>> [54758.418302] do_init_module+0x5a/0x210 > >>> [54758.418304] load_module+0x2096/0x22d0 > >>> [54758.418308] ? ima_post_read_file+0xf4/0x100 > >>> [54758.418310] ? __do_sys_finit_module+0xa8/0x110 > >>> [54758.418312] __do_sys_finit_module+0xa8/0x110 > >>> [54758.418315] do_syscall_64+0x5b/0x160 > >>> [54758.418317] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > >>> [54758.418319] RIP: 0033:0x7fc0b38b6edd > >>> [54758.418321] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48 > >>> [54758.418322] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > >>> [54758.418323] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd > >>> [54758.418324] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003 > >>> [54758.418325] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000 > >>> [54758.418326] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000 > >>> [54758.418326] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260 > >>> [54758.418346] BUG: unable to handle kernel paging request at 00000004ee194700 > >>> [54758.418348] #PF error: [WRITE] > >>> [54758.418349] PGD 0 P4D 0 > >>> [54758.418352] Oops: 0002 [#1] SMP PTI > >>> [54758.418354] CPU: 6 PID: 26746 Comm: insmod Tainted: G U OE 5.0.6-200.fc29.x86_64 #1 > >>> [54758.418355] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.12.1 10/02/2018 > >>> [54758.418360] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm] > >>> [54758.418361] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00 > >>> [54758.418363] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246 > >>> [54758.418364] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000 > >>> [54758.418366] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330 > >>> [54758.418367] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228 > >>> [54758.418368] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000 > >>> [54758.418370] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880 > >>> [54758.418371] FS: 00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000 > >>> [54758.418373] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> [54758.418374] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0 > >>> [54758.418375] Call Trace: > >>> [54758.418381] ttm_bo_device_init+0x198/0x2a0 [ttm] > >>> [54758.418439] nouveau_ttm_init+0xbf/0x340 [nouveau] > >>> [54758.418495] nouveau_drm_device_init+0x125/0x7d0 [nouveau] > >>> [54758.418499] ? pci_bus_read_config_word+0x49/0x70 > >>> [54758.418552] nouveau_drm_probe+0x26f/0x2c0 [nouveau] > >>> [54758.418556] local_pci_probe+0x41/0x90 > >>> [54758.418558] pci_device_probe+0x118/0x1a0 > >>> [54758.418561] really_probe+0xf8/0x3b0 > >>> [54758.418563] driver_probe_device+0xb3/0xf0 > >>> [54758.418565] __driver_attach+0xdd/0x110 > >>> [54758.418567] ? driver_probe_device+0xf0/0xf0 > >>> [54758.418570] bus_for_each_dev+0x77/0xc0 > >>> [54758.418573] ? klist_add_tail+0x3b/0x60 > >>> [54758.418574] bus_add_driver+0x152/0x230 > >>> [54758.418576] ? 0xffffffffc1027000 > >>> [54758.418578] driver_register+0x6b/0xb0 > >>> [54758.418580] ? 0xffffffffc1027000 > >>> [54758.418583] do_one_initcall+0x46/0x1c3 > >>> [54758.418585] ? _cond_resched+0x15/0x30 > >>> [54758.418587] ? kmem_cache_alloc_trace+0x154/0x1d0 > >>> [54758.418591] do_init_module+0x5a/0x210 > >>> [54758.418593] load_module+0x2096/0x22d0 > >>> [54758.418596] ? ima_post_read_file+0xf4/0x100 > >>> [54758.418599] ? __do_sys_finit_module+0xa8/0x110 > >>> [54758.418601] __do_sys_finit_module+0xa8/0x110 > >>> [54758.418604] do_syscall_64+0x5b/0x160 > >>> [54758.418607] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > >>> [54758.418609] RIP: 0033:0x7fc0b38b6edd > >>> [54758.418610] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 7f 0c 00 f7 d8 64 89 01 48 > >>> [54758.418612] RSP: 002b:00007ffc7f1620d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 > >>> [54758.418614] RAX: ffffffffffffffda RBX: 00005629a3996ac0 RCX: 00007fc0b38b6edd > >>> [54758.418615] RDX: 0000000000000000 RSI: 00005629a3996260 RDI: 0000000000000003 > >>> [54758.418616] RBP: 00005629a3996260 R08: 0000000000000000 R09: 0000000000000000 > >>> [54758.418618] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000 > >>> [54758.418619] R13: 00005629a3996a80 R14: 0000000000000000 R15: 00005629a3996260 > >>> [54758.418620] Modules linked in: nouveau(OE+) acpi_call(OE) ttm rfcomm ccm xt_CHECKSUM ipt_MASQUERADE tun bridge stp llc devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat btusb btrtl btbcm btintel uvcvideo videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 videobuf2_common videodev media ecdh_generic arc4 snd_hda_codec_hdmi ath10k_pci ath10k_core snd_hda_codec_realtek iTCO_wdt mei_wdt iTCO_vendor_support snd_hda_codec_generic mac80211 intel_rapl snd_hda_intel x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec dell_laptop ledtrig_audio dell_smm_hwmon snd_hda_core kvm_intel snd_hwdep ath snd_seq snd_seq_device intel_cstate cfg80211 snd_pcm dell_wmi intel_uncore > >>> [54758.418653] dell_smbios dcdbas intel_rapl_perf snd_timer joydev dell_wmi_descriptor intel_wmi_thunderbolt wmi_bmof snd idma64 soundcore i2c_i801 rfkill rtsx_pci_ms memstick mei_me mei processor_thermal_device intel_lpss_pci intel_soc_dts_iosf intel_pch_thermal intel_lpss int3400_thermal acpi_pad acpi_thermal_rel int3403_thermal intel_hid pcc_cpufreq int340x_thermal_zone sparse_keymap binfmt_misc zram dm_crypt hid_multitouch i915 kvmgt mdev vfio kvm irqbypass i2c_algo_bit crct10dif_pclmul drm_kms_helper rtsx_pci_sdmmc crc32_pclmul mmc_core crc32c_intel mxm_wmi drm nvme ghash_clmulni_intel nvme_core serio_raw rtsx_pci i2c_hid video wmi i2c_dev lz4 lz4_compress [last unloaded: nouveau] > >>> [54758.418679] CR2: 00000004ee194700 > >>> [54758.418681] ---[ end trace c5175234e6efc034 ]--- > >>> [54758.418686] RIP: 0010:ttm_mem_global_init+0x1fe/0x2b0 [ttm] > >>> [54758.418688] Code: 00 00 00 48 89 5d 40 48 89 ab a0 00 00 00 e8 79 ba 95 f8 85 c0 0f 85 a2 00 00 00 8b 83 90 00 00 00 8d 50 01 89 93 90 00 00 00 <48> 89 ac c3 80 00 00 00 85 d2 0f 85 c6 99 00 00 48 8b 83 98 00 00 > >>> [54758.418689] RSP: 0018:ffffc24e53c1f988 EFLAGS: 00010246 > >>> [54758.418690] RAX: 00000000a5a2f300 RBX: ffffffffc101ae80 RCX: 0000000000000000 > >>> [54758.418691] RDX: 00000000a5a2f301 RSI: 0000000000000000 RDI: ffff9ebe78b67330 > >>> [54758.418692] RBP: ffff9ebe42c79d00 R08: 0000000000000044 R09: 0000000000000228 > >>> [54758.418694] R10: 0000000000000000 R11: ffff9ebb62720bb8 R12: 0000000000000000 > >>> [54758.418695] R13: ffff9ebe27d71640 R14: ffff9eba9f258600 R15: ffffffffc1453880 > >>> [54758.418696] FS: 00007fc0b379c280(0000) GS:ffff9ebeae380000(0000) knlGS:0000000000000000 > >>> [54758.418698] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> [54758.418699] CR2: 00000004ee194700 CR3: 000000045069e005 CR4: 00000000003606e0 > >>> > >>> Karol Herbst (6): > >>> Revert "drm: Remove drm_global.{c,h} v2" > >>> Revert "drm/ttm: initialize globals during device init (v2)" > >>> Revert "drm/ttm: Fix bo_global and mem_global kfree error" > >>> Revert "drm/ttm: use a static ttm_bo_global instance" > >>> Revert "drm/ttm: make the device list mutex static" > >>> Revert "drm/ttm: use a static ttm_mem_global instance" > >>> > >>> drivers/gpu/drm/Makefile | 2 +- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 103 ++++++++++++- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 + > >>> drivers/gpu/drm/ast/ast_drv.h | 2 + > >>> drivers/gpu/drm/ast/ast_ttm.c | 64 ++++++++ > >>> drivers/gpu/drm/bochs/bochs.h | 2 + > >>> drivers/gpu/drm/bochs/bochs_mm.c | 61 ++++++++ > >>> drivers/gpu/drm/cirrus/cirrus_drv.h | 2 + > >>> drivers/gpu/drm/cirrus/cirrus_ttm.c | 64 ++++++++ > >>> drivers/gpu/drm/drm_drv.c | 2 + > >>> drivers/gpu/drm/drm_global.c | 137 ++++++++++++++++++ > >>> .../gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h | 2 + > >>> drivers/gpu/drm/hisilicon/hibmc/hibmc_ttm.c | 57 ++++++++ > >>> drivers/gpu/drm/mgag200/mgag200_drv.h | 2 + > >>> drivers/gpu/drm/mgag200/mgag200_ttm.c | 64 ++++++++ > >>> drivers/gpu/drm/nouveau/nouveau_drv.h | 2 + > >>> drivers/gpu/drm/nouveau/nouveau_ttm.c | 67 +++++++++ > >>> drivers/gpu/drm/qxl/qxl_drv.h | 3 + > >>> drivers/gpu/drm/qxl/qxl_ttm.c | 61 ++++++++ > >>> drivers/gpu/drm/radeon/radeon.h | 3 + > >>> drivers/gpu/drm/radeon/radeon_ttm.c | 65 +++++++++ > >>> drivers/gpu/drm/ttm/ttm_bo.c | 67 +++------ > >>> drivers/gpu/drm/ttm/ttm_memory.c | 14 +- > >>> drivers/gpu/drm/virtio/virtgpu_drv.h | 3 + > >>> drivers/gpu/drm/virtio/virtgpu_ttm.c | 62 ++++++++ > >>> drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 15 +- > >>> drivers/gpu/drm/vmwgfx/vmwgfx_drv.h | 6 +- > >>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c | 54 +++++++ > >>> drivers/staging/vboxvideo/vbox_drv.h | 2 + > >>> drivers/staging/vboxvideo/vbox_ttm.c | 65 ++++++++- > >>> include/drm/drmP.h | 1 + > >>> include/drm/drm_global.h | 53 +++++++ > >>> include/drm/ttm/ttm_bo_driver.h | 54 ++++++- > >>> include/drm/ttm/ttm_memory.h | 4 +- > >>> 34 files changed, 1102 insertions(+), 65 deletions(-) > >>> create mode 100644 drivers/gpu/drm/drm_global.c > >>> create mode 100644 include/drm/drm_global.h > >>> > >> > >
Am 16.04.19 um 12:54 schrieb Karol Herbst: > On Tue, Apr 16, 2019 at 11:12 AM Koenig, Christian > <Christian.Koenig@amd.com> wrote: >> Am 16.04.19 um 11:10 schrieb Karol Herbst: >>> On Tue, Apr 16, 2019 at 8:38 AM Christian König >>> <ckoenig.leichtzumerken@gmail.com> wrote: >>>> Am 16.04.19 um 02:35 schrieb Karol Herbst: >>>>> Kobjects are supposed to be dynamically allocated, but with recent changes >>>>> this rule was violated. Reverting those commits fixes crashes when a drm >>>>> driver using TTM gets loaded again. >>>>> >>>>> The object in question is "ttm_mem_glob" declared inside >>>>> "include/drm/ttm/ttm_memory.h" and instatiated inside >>>>> "drivers/gpu/drm/ttm/ttm_memory.c". >>>>> >>>>> from "Documentation/kobject.txt": >>>>> "Because kobjects are dynamic, they must not be declared statically or on >>>>> the stack, but instead, always allocated dynamically. Future versions of >>>>> the kernel will contain a run-time check for kobjects that are created >>>>> statically and will warn the developer of this improper usage." >>>>> >>>>> Unloading ttm before reloading the driver workarounds that crash, because >>>>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del >>>>> and kobject_put function never free or clean up the kobject object leaving >>>>> it in an undefined state. >>>>> >>>>> I reverted a few more commits to make it less painful for me to rever this >>>>> rather big change. >>>> Well, NAK. By reverting those change you also re-introduced the problems >>>> we originally fixed with those patches. >>>> >>>> Please work on a proper fix instead, >>>> Christian. >>> And which problem was that besides duplicated code (or maybe even a >>> bit of memory consumption if multiple ttm driver were used)? If I had >>> to choose between duplicated code and a crash, I choose the former. >>> >>> Maybe I missed the real reason why those changes are made, but the >>> commit messages don't really seem to tell me. >> The old implementation crashed because different drivers tried to >> allocate the same kobj. >> >> Crashing in one way is not better than crashing in another way. >> >> Christian. >> > how can that old crash be triggered? By loading two ttm based drivers > at the same time or by other means? Yes, exactly. Even worse it could be triggered by one driver instantiating multiple times at the same time, e.g two AMD GPUs in the same system. On the other hand I completely agree that using kobj static is completely nuts. The problem is that using a kobj was the wrong approach in the first place. In other words when you have something which controls a global behavior of a module, what do you use? Right, a module parameter. Point is that we can't get away from those kobj without breaking the uapi, so that is something which can't be done easily. Regards, Christian.
On Tue, Apr 16, 2019 at 1:07 PM Koenig, Christian <Christian.Koenig@amd.com> wrote: > > Am 16.04.19 um 12:54 schrieb Karol Herbst: > > On Tue, Apr 16, 2019 at 11:12 AM Koenig, Christian > > <Christian.Koenig@amd.com> wrote: > >> Am 16.04.19 um 11:10 schrieb Karol Herbst: > >>> On Tue, Apr 16, 2019 at 8:38 AM Christian König > >>> <ckoenig.leichtzumerken@gmail.com> wrote: > >>>> Am 16.04.19 um 02:35 schrieb Karol Herbst: > >>>>> Kobjects are supposed to be dynamically allocated, but with recent changes > >>>>> this rule was violated. Reverting those commits fixes crashes when a drm > >>>>> driver using TTM gets loaded again. > >>>>> > >>>>> The object in question is "ttm_mem_glob" declared inside > >>>>> "include/drm/ttm/ttm_memory.h" and instatiated inside > >>>>> "drivers/gpu/drm/ttm/ttm_memory.c". > >>>>> > >>>>> from "Documentation/kobject.txt": > >>>>> "Because kobjects are dynamic, they must not be declared statically or on > >>>>> the stack, but instead, always allocated dynamically. Future versions of > >>>>> the kernel will contain a run-time check for kobjects that are created > >>>>> statically and will warn the developer of this improper usage." > >>>>> > >>>>> Unloading ttm before reloading the driver workarounds that crash, because > >>>>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del > >>>>> and kobject_put function never free or clean up the kobject object leaving > >>>>> it in an undefined state. > >>>>> > >>>>> I reverted a few more commits to make it less painful for me to rever this > >>>>> rather big change. > >>>> Well, NAK. By reverting those change you also re-introduced the problems > >>>> we originally fixed with those patches. > >>>> > >>>> Please work on a proper fix instead, > >>>> Christian. > >>> And which problem was that besides duplicated code (or maybe even a > >>> bit of memory consumption if multiple ttm driver were used)? If I had > >>> to choose between duplicated code and a crash, I choose the former. > >>> > >>> Maybe I missed the real reason why those changes are made, but the > >>> commit messages don't really seem to tell me. > >> The old implementation crashed because different drivers tried to > >> allocate the same kobj. > >> > >> Crashing in one way is not better than crashing in another way. > >> > >> Christian. > >> > > how can that old crash be triggered? By loading two ttm based drivers > > at the same time or by other means? > > Yes, exactly. Even worse it could be triggered by one driver > instantiating multiple times at the same time, e.g two AMD GPUs in the > same system. > > On the other hand I completely agree that using kobj static is > completely nuts. The problem is that using a kobj was the wrong approach > in the first place. > > In other words when you have something which controls a global behavior > of a module, what do you use? Right, a module parameter. > > Point is that we can't get away from those kobj without breaking the > uapi, so that is something which can't be done easily. > > Regards, > Christian. okay, thanks for the explanation! Currently I will be testing your first patch you sent out on top of 5.0.7, so that we could propose that for stable. The second didn't apply, but it didn't look relevant for the actual fix. Will reply to the patch with my results then. Thanks
Am 16.04.19 um 13:20 schrieb Karol Herbst: > On Tue, Apr 16, 2019 at 1:07 PM Koenig, Christian > <Christian.Koenig@amd.com> wrote: >> Am 16.04.19 um 12:54 schrieb Karol Herbst: >>> On Tue, Apr 16, 2019 at 11:12 AM Koenig, Christian >>> <Christian.Koenig@amd.com> wrote: >>>> Am 16.04.19 um 11:10 schrieb Karol Herbst: >>>>> On Tue, Apr 16, 2019 at 8:38 AM Christian König >>>>> <ckoenig.leichtzumerken@gmail.com> wrote: >>>>>> Am 16.04.19 um 02:35 schrieb Karol Herbst: >>>>>>> Kobjects are supposed to be dynamically allocated, but with recent changes >>>>>>> this rule was violated. Reverting those commits fixes crashes when a drm >>>>>>> driver using TTM gets loaded again. >>>>>>> >>>>>>> The object in question is "ttm_mem_glob" declared inside >>>>>>> "include/drm/ttm/ttm_memory.h" and instatiated inside >>>>>>> "drivers/gpu/drm/ttm/ttm_memory.c". >>>>>>> >>>>>>> from "Documentation/kobject.txt": >>>>>>> "Because kobjects are dynamic, they must not be declared statically or on >>>>>>> the stack, but instead, always allocated dynamically. Future versions of >>>>>>> the kernel will contain a run-time check for kobjects that are created >>>>>>> statically and will warn the developer of this improper usage." >>>>>>> >>>>>>> Unloading ttm before reloading the driver workarounds that crash, because >>>>>>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del >>>>>>> and kobject_put function never free or clean up the kobject object leaving >>>>>>> it in an undefined state. >>>>>>> >>>>>>> I reverted a few more commits to make it less painful for me to rever this >>>>>>> rather big change. >>>>>> Well, NAK. By reverting those change you also re-introduced the problems >>>>>> we originally fixed with those patches. >>>>>> >>>>>> Please work on a proper fix instead, >>>>>> Christian. >>>>> And which problem was that besides duplicated code (or maybe even a >>>>> bit of memory consumption if multiple ttm driver were used)? If I had >>>>> to choose between duplicated code and a crash, I choose the former. >>>>> >>>>> Maybe I missed the real reason why those changes are made, but the >>>>> commit messages don't really seem to tell me. >>>> The old implementation crashed because different drivers tried to >>>> allocate the same kobj. >>>> >>>> Crashing in one way is not better than crashing in another way. >>>> >>>> Christian. >>>> >>> how can that old crash be triggered? By loading two ttm based drivers >>> at the same time or by other means? >> Yes, exactly. Even worse it could be triggered by one driver >> instantiating multiple times at the same time, e.g two AMD GPUs in the >> same system. >> >> On the other hand I completely agree that using kobj static is >> completely nuts. The problem is that using a kobj was the wrong approach >> in the first place. >> >> In other words when you have something which controls a global behavior >> of a module, what do you use? Right, a module parameter. >> >> Point is that we can't get away from those kobj without breaking the >> uapi, so that is something which can't be done easily. >> >> Regards, >> Christian. > okay, thanks for the explanation! > > Currently I will be testing your first patch you sent out on top of > 5.0.7, so that we could propose that for stable. The second didn't > apply, but it didn't look relevant for the actual fix. Will reply to > the patch with my results then. The second is just a further clean, please ignore that one for backporting. Let me know if the first one works on 5.0.7, if yes I'm going to add a CC stable tag before pushing. Thanks, Christian. > > Thanks > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Apr 16, 2019 at 11:06:54AM +0000, Koenig, Christian wrote: > Am 16.04.19 um 12:54 schrieb Karol Herbst: > > On Tue, Apr 16, 2019 at 11:12 AM Koenig, Christian > > <Christian.Koenig@amd.com> wrote: > >> Am 16.04.19 um 11:10 schrieb Karol Herbst: > >>> On Tue, Apr 16, 2019 at 8:38 AM Christian König > >>> <ckoenig.leichtzumerken@gmail.com> wrote: > >>>> Am 16.04.19 um 02:35 schrieb Karol Herbst: > >>>>> Kobjects are supposed to be dynamically allocated, but with recent changes > >>>>> this rule was violated. Reverting those commits fixes crashes when a drm > >>>>> driver using TTM gets loaded again. > >>>>> > >>>>> The object in question is "ttm_mem_glob" declared inside > >>>>> "include/drm/ttm/ttm_memory.h" and instatiated inside > >>>>> "drivers/gpu/drm/ttm/ttm_memory.c". > >>>>> > >>>>> from "Documentation/kobject.txt": > >>>>> "Because kobjects are dynamic, they must not be declared statically or on > >>>>> the stack, but instead, always allocated dynamically. Future versions of > >>>>> the kernel will contain a run-time check for kobjects that are created > >>>>> statically and will warn the developer of this improper usage." > >>>>> > >>>>> Unloading ttm before reloading the driver workarounds that crash, because > >>>>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del > >>>>> and kobject_put function never free or clean up the kobject object leaving > >>>>> it in an undefined state. > >>>>> > >>>>> I reverted a few more commits to make it less painful for me to rever this > >>>>> rather big change. > >>>> Well, NAK. By reverting those change you also re-introduced the problems > >>>> we originally fixed with those patches. > >>>> > >>>> Please work on a proper fix instead, > >>>> Christian. > >>> And which problem was that besides duplicated code (or maybe even a > >>> bit of memory consumption if multiple ttm driver were used)? If I had > >>> to choose between duplicated code and a crash, I choose the former. > >>> > >>> Maybe I missed the real reason why those changes are made, but the > >>> commit messages don't really seem to tell me. > >> The old implementation crashed because different drivers tried to > >> allocate the same kobj. > >> > >> Crashing in one way is not better than crashing in another way. > >> > >> Christian. > >> > > how can that old crash be triggered? By loading two ttm based drivers > > at the same time or by other means? > > Yes, exactly. Even worse it could be triggered by one driver > instantiating multiple times at the same time, e.g two AMD GPUs in the > same system. > > On the other hand I completely agree that using kobj static is > completely nuts. The problem is that using a kobj was the wrong approach > in the first place. > > In other words when you have something which controls a global behavior > of a module, what do you use? Right, a module parameter. > > Point is that we can't get away from those kobj without breaking the > uapi, so that is something which can't be done easily. Randome idea: Push the kobj setup into drm.ko (and shrugg it off as another lesson in how maybe uapi shouldn't have been designed, but hey not our worst mistake by far). I think that'd be totally ok. -Daniel
Am 16.04.19 um 14:18 schrieb Daniel Vetter: > On Tue, Apr 16, 2019 at 11:06:54AM +0000, Koenig, Christian wrote: >> Am 16.04.19 um 12:54 schrieb Karol Herbst: >>> On Tue, Apr 16, 2019 at 11:12 AM Koenig, Christian >>> <Christian.Koenig@amd.com> wrote: >>>> Am 16.04.19 um 11:10 schrieb Karol Herbst: >>>>> On Tue, Apr 16, 2019 at 8:38 AM Christian König >>>>> <ckoenig.leichtzumerken@gmail.com> wrote: >>>>>> Am 16.04.19 um 02:35 schrieb Karol Herbst: >>>>>>> Kobjects are supposed to be dynamically allocated, but with recent changes >>>>>>> this rule was violated. Reverting those commits fixes crashes when a drm >>>>>>> driver using TTM gets loaded again. >>>>>>> >>>>>>> The object in question is "ttm_mem_glob" declared inside >>>>>>> "include/drm/ttm/ttm_memory.h" and instatiated inside >>>>>>> "drivers/gpu/drm/ttm/ttm_memory.c". >>>>>>> >>>>>>> from "Documentation/kobject.txt": >>>>>>> "Because kobjects are dynamic, they must not be declared statically or on >>>>>>> the stack, but instead, always allocated dynamically. Future versions of >>>>>>> the kernel will contain a run-time check for kobjects that are created >>>>>>> statically and will warn the developer of this improper usage." >>>>>>> >>>>>>> Unloading ttm before reloading the driver workarounds that crash, because >>>>>>> the memory backing the kobject member "kobj" is cleaned up. The kobject_del >>>>>>> and kobject_put function never free or clean up the kobject object leaving >>>>>>> it in an undefined state. >>>>>>> >>>>>>> I reverted a few more commits to make it less painful for me to rever this >>>>>>> rather big change. >>>>>> Well, NAK. By reverting those change you also re-introduced the problems >>>>>> we originally fixed with those patches. >>>>>> >>>>>> Please work on a proper fix instead, >>>>>> Christian. >>>>> And which problem was that besides duplicated code (or maybe even a >>>>> bit of memory consumption if multiple ttm driver were used)? If I had >>>>> to choose between duplicated code and a crash, I choose the former. >>>>> >>>>> Maybe I missed the real reason why those changes are made, but the >>>>> commit messages don't really seem to tell me. >>>> The old implementation crashed because different drivers tried to >>>> allocate the same kobj. >>>> >>>> Crashing in one way is not better than crashing in another way. >>>> >>>> Christian. >>>> >>> how can that old crash be triggered? By loading two ttm based drivers >>> at the same time or by other means? >> Yes, exactly. Even worse it could be triggered by one driver >> instantiating multiple times at the same time, e.g two AMD GPUs in the >> same system. >> >> On the other hand I completely agree that using kobj static is >> completely nuts. The problem is that using a kobj was the wrong approach >> in the first place. >> >> In other words when you have something which controls a global behavior >> of a module, what do you use? Right, a module parameter. >> >> Point is that we can't get away from those kobj without breaking the >> uapi, so that is something which can't be done easily. > Randome idea: Push the kobj setup into drm.ko (and shrugg it off as > another lesson in how maybe uapi shouldn't have been designed, but hey not > our worst mistake by far). I think that'd be totally ok. Yeah, thought about that as well. But I would rather re-design this from the scratch instead of just moving it over. And yes I agree with a bit of luck that UAPI is actually not used at all, so we could remove it sooner or later. Regards, Christian. > -Daniel
Christian König <ckoenig.leichtzumerken@gmail.com> writes: > Am 16.04.19 um 02:35 schrieb Karol Herbst: >> Kobjects are supposed to be dynamically allocated, but with recent changes >> this rule was violated. Reverting those commits fixes crashes when a drm >> driver using TTM gets loaded again. >> >> The object in question is "ttm_mem_glob" declared inside >> "include/drm/ttm/ttm_memory.h" and instatiated inside >> "drivers/gpu/drm/ttm/ttm_memory.c". >> >> from "Documentation/kobject.txt": >> "Because kobjects are dynamic, they must not be declared statically or on >> the stack, but instead, always allocated dynamically. Future versions of >> the kernel will contain a run-time check for kobjects that are created >> statically and will warn the developer of this improper usage." >> >> Unloading ttm before reloading the driver workarounds that crash, because >> the memory backing the kobject member "kobj" is cleaned up. The kobject_del >> and kobject_put function never free or clean up the kobject object leaving >> it in an undefined state. >> >> I reverted a few more commits to make it less painful for me to rever this >> rather big change. > > Well, NAK. By reverting those change you also re-introduced the problems > we originally fixed with those patches. > > Please work on a proper fix instead, That's not Karol's responsibility, that's yours as the author. I would like to remind about Linux's regressions policy, quoting from Documentation/process/4.Coding.rst: "One final hazard worth mentioning is this: it can be tempting to make a change (which may bring big improvements) which causes something to break for existing users. This kind of change is called a "regression," and regressions have become most unwelcome in the mainline kernel. With few exceptions, changes which cause regressions will be backed out if the regression cannot be fixed in a timely manner. Far better to avoid the regression in the first place. It is often argued that a regression can be justified if it causes things to work for more people than it creates problems for. Why not make a change if it brings new functionality to ten systems for each one it breaks? The best answer to this question was expressed by Linus in July, 2007: :: So we don't fix bugs by introducing new problems. That way lies madness, and nobody ever knows if you actually make any real progress at all. Is it two steps forwards, one step back, or one step forward and two steps back?"
On Wed, Apr 17, 2019 at 1:09 AM Eric Anholt <eric@anholt.net> wrote: > > Christian König <ckoenig.leichtzumerken@gmail.com> writes: > > > Am 16.04.19 um 02:35 schrieb Karol Herbst: > >> Kobjects are supposed to be dynamically allocated, but with recent changes > >> this rule was violated. Reverting those commits fixes crashes when a drm > >> driver using TTM gets loaded again. > >> > >> The object in question is "ttm_mem_glob" declared inside > >> "include/drm/ttm/ttm_memory.h" and instatiated inside > >> "drivers/gpu/drm/ttm/ttm_memory.c". > >> > >> from "Documentation/kobject.txt": > >> "Because kobjects are dynamic, they must not be declared statically or on > >> the stack, but instead, always allocated dynamically. Future versions of > >> the kernel will contain a run-time check for kobjects that are created > >> statically and will warn the developer of this improper usage." > >> > >> Unloading ttm before reloading the driver workarounds that crash, because > >> the memory backing the kobject member "kobj" is cleaned up. The kobject_del > >> and kobject_put function never free or clean up the kobject object leaving > >> it in an undefined state. > >> > >> I reverted a few more commits to make it less painful for me to rever this > >> rather big change. > > > > Well, NAK. By reverting those change you also re-introduced the problems > > we originally fixed with those patches. > > > > Please work on a proper fix instead, > > That's not Karol's responsibility, that's yours as the author. I would > like to remind about Linux's regressions policy, quoting from > Documentation/process/4.Coding.rst: > > "One final hazard worth mentioning is this: it can be tempting to make a > change (which may bring big improvements) which causes something to break > for existing users. This kind of change is called a "regression," and > regressions have become most unwelcome in the mainline kernel. With few > exceptions, changes which cause regressions will be backed out if the > regression cannot be fixed in a timely manner. Far better to avoid the > regression in the first place. > > It is often argued that a regression can be justified if it causes things > to work for more people than it creates problems for. Why not make a > change if it brings new functionality to ten systems for each one it > breaks? The best answer to this question was expressed by Linus in July, > 2007: > > :: > > So we don't fix bugs by introducing new problems. That way lies > madness, and nobody ever knows if you actually make any real > progress at all. Is it two steps forwards, one step back, or one > step forward and two steps back?" he already wrote a fix. Search for "[PATCH 1/2] drm/ttm: fix re-init of global structures"