Message ID | 20250219150009.1662688-1-alex.bennee@linaro.org (mailing list archive) |
---|---|
Headers | show |
Series | testing/next (aarch64 virt gpu tests) | expand |
On Wed, 19 Feb 2025 at 15:00, Alex Bennée <alex.bennee@linaro.org> wrote: > > Hi, > > As I was looking at the native context patches I realised our existing > GPU testing is a little sparse. I took the opportunity to split the > test from the main virt test and then extend it to exercise the 3 > current display modes (virgl, virgl+blobs, vulkan). > > I've added some additional validation to ensure we have the devices we > expect before we start. It doesn't currently address the reported > clang issues but hopefully it will help narrow down what fails and > what works. Running on my setup with a clang sanitizer build I get ok 1 test_aarch64_virt_gpu.Aarch64VirtGPUMachine.test_aarch64_virt_with_virgl_blobs_gpu ok 2 test_aarch64_virt_gpu.Aarch64VirtGPUMachine.test_aarch64_virt_with_virgl_gpu and then the third test timed out. For the timing out case, the console prints 2025-02-20 11:12:55,208: # weston -B headless --renderer gl --shell kiosk -- vkmark -b:duration=1.0 2025-02-20 11:12:55,288: Date: 2025-02-20 UTC 2025-02-20 11:12:55,288: [11:12:54.841] weston 14.0.0 2025-02-20 11:12:55,289: https://wayland.freedesktop.org 2025-02-20 11:12:55,289: Bug reports to: https://gitlab.freedesktop.org/wayland/weston/issues/ 2025-02-20 11:12:55,289: Build: 14.0.0 2025-02-20 11:12:55,291: [11:12:54.847] Command line: weston -B headless --renderer gl --shell kiosk -- vkmark -b:duration=1.0 2025-02-20 11:12:55,298: [11:12:54.850] OS: Linux, 6.11.10, #2 SMP Thu Dec 5 16:27:12 GMT 2024, aarch64 2025-02-20 11:12:55,299: [11:12:54.855] Flight recorder: enabled 2025-02-20 11:12:55,300: [11:12:54.857] warning: XDG_RUNTIME_DIR "/tmp" is not configured 2025-02-20 11:12:55,301: correctly. Unix access mode must be 0700 (current mode is 0777), 2025-02-20 11:12:55,301: and must be owned by the user UID 0 (current owner is UID 0). 2025-02-20 11:12:55,302: Refer to your distribution on how to get it, or 2025-02-20 11:12:55,302: http://www.freedesktop.org/wiki/Specifications/basedir-spec 2025-02-20 11:12:55,302: on how to implement it. 2025-02-20 11:12:55,308: [11:12:54.865] Starting with no config file. 2025-02-20 11:12:55,322: [11:12:54.879] Output repaint window is 7 ms maximum. 2025-02-20 11:12:55,333: [11:12:54.890] Loading module '/usr/lib/libweston-14/headless-backend.so' 2025-02-20 11:12:55,407: [11:12:54.963] Loading module '/usr/lib/libweston-14/gl-renderer.so' 2025-02-20 11:13:06,936: [11:13:06.491] Using rendering device: /dev/dri/renderD128 2025-02-20 11:13:07,083: [11:13:06.640] EGL version: 1.5 2025-02-20 11:13:07,084: [11:13:06.641] EGL vendor: Mesa Project 2025-02-20 11:13:07,085: [11:13:06.641] EGL client APIs: OpenGL OpenGL_ES 2025-02-20 11:13:07,088: [11:13:06.645] EGL features: 2025-02-20 11:13:07,089: EGL Wayland extension: yes 2025-02-20 11:13:07,089: context priority: no 2025-02-20 11:13:07,089: buffer age: no 2025-02-20 11:13:07,089: partial update: no 2025-02-20 11:13:07,090: swap buffers with damage: no 2025-02-20 11:13:07,090: configless context: yes 2025-02-20 11:13:07,090: surfaceless context: yes 2025-02-20 11:13:07,090: dmabuf support: modifiers 2025-02-20 11:13:07,206: [11:13:06.763] GL version: OpenGL ES 3.2 Mesa 24.3.0 2025-02-20 11:13:07,207: [11:13:06.764] GLSL version: OpenGL ES GLSL ES 3.20 2025-02-20 11:13:07,207: [11:13:06.764] GL vendor: Mesa 2025-02-20 11:13:07,208: [11:13:06.764] GL renderer: virgl (Quadro P400/PCIe/SSE2) 2025-02-20 11:13:08,201: [11:13:07.757] GL ES 3.2 - renderer features: 2025-02-20 11:13:08,202: read-back format: ARGB8888 2025-02-20 11:13:08,203: glReadPixels supports y-flip: yes 2025-02-20 11:13:08,203: glReadPixels supports PBO: yes 2025-02-20 11:13:08,204: wl_shm 10 bpc formats: yes 2025-02-20 11:13:08,204: wl_shm 16 bpc formats: yes 2025-02-20 11:13:08,205: wl_shm half-float formats: yes 2025-02-20 11:13:08,206: internal R and RG formats: yes 2025-02-20 11:13:08,209: OES_EGL_image_external: yes 2025-02-20 11:13:08,210: [11:13:07.767] Using GL renderer 2025-02-20 11:13:08,211: [11:13:07.768] Registered plugin API 'weston_windowed_output_api_headless_v2' of size 16 2025-02-20 11:13:08,215: [11:13:07.772] Color manager: no-op 2025-02-20 11:13:08,216: protocol support: no 2025-02-20 11:13:08,226: [11:13:07.782] Output 'headless' attempts EOTF mode SDR and colorimetry mode default. 2025-02-20 11:13:08,227: [11:13:07.784] Output 'headless' using color profile: stock sRGB color profile and that's the last thing it outputs. The sanitizer reports that when the framework sends the SIGTERM because of the timeout we get a write to a NULL pointer (but interesting not this time in an atexit callback): UndefinedBehaviorSanitizer:DEADLYSIGNAL ==471863==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7a18ceaafe80 bp 0x000000000000 sp 0x7ffe8e3ff6d0 T471863) ==471863==The signal is caused by a WRITE memory access. ==471863==Hint: address points to the zero page. #0 0x7a18ceaafe80 (/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01+0x16afe80) (BuildId: 24b0d0b90369112e3de888a93eb8d7e00304a6db) #1 0x7a18ce9e72c0 (/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01+0x15e72c0) (BuildId: 24b0d0b90369112e3de888a93eb8d7e00304a6db) #2 0x7a18ce9f11bb (/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01+0x15f11bb) (BuildId: 24b0d0b90369112e3de888a93eb8d7e00304a6db) #3 0x7a18ce6dc9d1 (/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01+0x12dc9d1) (BuildId: 24b0d0b90369112e3de888a93eb8d7e00304a6db) #4 0x7a18e7d15326 in vrend_renderer_create_fence /usr/src/virglrenderer-1.0.0-1ubuntu2/obj-x86_64-linux-gnu/../src/vrend_renderer.c:10883:26 #5 0x55bfb6621871 in virtio_gpu_virgl_process_cmd /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../hw/display/virtio-gpu-virgl.c:973:5 #6 0x55bfb66086ba in virtio_gpu_process_cmdq /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../hw/display/virtio-gpu.c:1048:9 #7 0x55bfb661b69b in virtio_gpu_gl_handle_ctrl /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../hw/display/virtio-gpu-gl.c:100:5 #8 0x55bfb74a7782 in aio_bh_call /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/async.c:172:5 #9 0x55bfb74a7b58 in aio_bh_poll /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/async.c:219:13 #10 0x55bfb74625ea in aio_dispatch /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/aio-posix.c:424:5 #11 0x55bfb74aaaea in aio_ctx_dispatch /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/async.c:361:5 #12 0x7a18e8dc15b4 in g_main_dispatch /usr/src/glib2.0-2.80.0-6ubuntu3.2/debian/build/deb/../../../glib/gmain.c:3344:28 #13 0x7a18e8dc16ff in g_main_context_dispatch_unlocked /usr/src/glib2.0-2.80.0-6ubuntu3.2/debian/build/deb/../../../glib/gmain.c:4152:7 #14 0x7a18e8dc16ff in g_main_context_dispatch /usr/src/glib2.0-2.80.0-6ubuntu3.2/debian/build/deb/../../../glib/gmain.c:4140:3 #15 0x55bfb74ab96b in glib_pollfds_poll /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/main-loop.c:287:9 #16 0x55bfb74ab96b in os_host_main_loop_wait /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/main-loop.c:310:5 #17 0x55bfb74ab96b in main_loop_wait /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../util/main-loop.c:589:11 #18 0x55bfb64799e6 in qemu_main_loop /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../system/runstate.c:835:9 #19 0x55bfb7340356 in qemu_default_main /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../system/main.c:48:14 #20 0x55bfb734032e in main /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/../../system/main.c:76:9 #21 0x7a18e6a2a1c9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #22 0x7a18e6a2a28a in __libc_start_main csu/../csu/libc-start.c:360:3 #23 0x55bfb59b6554 in _start (/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/arm-clang/qemu-system-aarch64+0x15dd554) (BuildId: df0d680785eeda685de951dbbbbd220f5c9e773d) UndefinedBehaviorSanitizer can not provide additional info. SUMMARY: UndefinedBehaviorSanitizer: SEGV (/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.183.01+0x16afe80) (BuildId: 24b0d0b90369112e3de888a93eb8d7e00304a6db) ==471863==ABORTING -- PMM
On Wed, 19 Feb 2025 at 15:00, Alex Bennée <alex.bennee@linaro.org> wrote: > > Hi, > > As I was looking at the native context patches I realised our existing > GPU testing is a little sparse. I took the opportunity to split the > test from the main virt test and then extend it to exercise the 3 > current display modes (virgl, virgl+blobs, vulkan). > > I've added some additional validation to ensure we have the devices we > expect before we start. It doesn't currently address the reported > clang issues but hopefully it will help narrow down what fails and > what works. > > Once I've built some new buildroot images I'll re-spin with a while > bunch of additional test binaries available. Running on a non-sanitizer debug build, I found that aarch64_virt_with_virgl_gpu hit the timeout. Looking at the output the last thing printed is 2025-02-20 11:46:36,864: [shadow] <default>: FPS: 45 FrameTime: 22.585 ms That timestamp is 4 minutes into the test run, and the same [shadow] test takes over 2 minutes on the with_virgil_blobs_gpu test, so it looks like it just hit the 360s timeout and might well have finished OK if it had been allowed to keep running. Actually I'm surprised the other one didn't hit a timeout, because its log timestamps show it running from 11:35:03,896 to 11:42:26,468 which is definitely more than 360s. Is there a less time-intensive test of the virgl code we can use? check-functional already has way too many tests that take minutes to run... -- PMM
On Thu, 20 Feb 2025 at 11:29, Peter Maydell <peter.maydell@linaro.org> wrote: > > On Wed, 19 Feb 2025 at 15:00, Alex Bennée <alex.bennee@linaro.org> wrote: > > > > Hi, > > > > As I was looking at the native context patches I realised our existing > > GPU testing is a little sparse. I took the opportunity to split the > > test from the main virt test and then extend it to exercise the 3 > > current display modes (virgl, virgl+blobs, vulkan). > > > > I've added some additional validation to ensure we have the devices we > > expect before we start. It doesn't currently address the reported > > clang issues but hopefully it will help narrow down what fails and > > what works. > > Running on my setup with a clang sanitizer build I get > > ok 1 test_aarch64_virt_gpu.Aarch64VirtGPUMachine.test_aarch64_virt_with_virgl_blobs_gpu > ok 2 test_aarch64_virt_gpu.Aarch64VirtGPUMachine.test_aarch64_virt_with_virgl_gpu > > and then the third test timed out. vulkaninfo --summary as requested on irc: ========== VULKANINFO ========== Vulkan Instance Version: 1.3.275 Instance Extensions: count = 24 ------------------------------- VK_EXT_acquire_drm_display : extension revision 1 VK_EXT_acquire_xlib_display : extension revision 1 VK_EXT_debug_report : extension revision 10 VK_EXT_debug_utils : extension revision 2 VK_EXT_direct_mode_display : extension revision 1 VK_EXT_display_surface_counter : extension revision 1 VK_EXT_headless_surface : extension revision 1 VK_EXT_surface_maintenance1 : extension revision 1 VK_EXT_swapchain_colorspace : extension revision 4 VK_KHR_device_group_creation : extension revision 1 VK_KHR_display : extension revision 23 VK_KHR_external_fence_capabilities : extension revision 1 VK_KHR_external_memory_capabilities : extension revision 1 VK_KHR_external_semaphore_capabilities : extension revision 1 VK_KHR_get_display_properties2 : extension revision 1 VK_KHR_get_physical_device_properties2 : extension revision 2 VK_KHR_get_surface_capabilities2 : extension revision 1 VK_KHR_portability_enumeration : extension revision 1 VK_KHR_surface : extension revision 25 VK_KHR_surface_protected_capabilities : extension revision 1 VK_KHR_wayland_surface : extension revision 6 VK_KHR_xcb_surface : extension revision 6 VK_KHR_xlib_surface : extension revision 6 VK_LUNARG_direct_driver_loading : extension revision 1 Instance Layers: count = 4 -------------------------- VK_LAYER_INTEL_nullhw INTEL NULL HW 1.1.73 version 1 VK_LAYER_MESA_device_select Linux device selection layer 1.3.211 version 1 VK_LAYER_MESA_overlay Mesa Overlay layer 1.3.211 version 1 VK_LAYER_NV_optimus NVIDIA Optimus layer 1.3.242 version 1 Devices: ======== GPU0: apiVersion = 1.3.242 driverVersion = 535.183.1.0 vendorID = 0x10de deviceID = 0x1cb3 deviceType = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU deviceName = Quadro P400 driverID = DRIVER_ID_NVIDIA_PROPRIETARY driverName = NVIDIA driverInfo = 535.183.01 conformanceVersion = 1.3.5.0 deviceUUID = 0a44d8af-913b-892f-1603-e76ce29ac9b5 driverUUID = 526ab2c8-1f4a-5dd0-9559-81dab18f1e08 GPU1: apiVersion = 1.3.289 driverVersion = 0.0.1 vendorID = 0x10005 deviceID = 0x0000 deviceType = PHYSICAL_DEVICE_TYPE_CPU deviceName = llvmpipe (LLVM 19.1.1, 256 bits) driverID = DRIVER_ID_MESA_LLVMPIPE driverName = llvmpipe driverInfo = Mesa 24.2.8-1ubuntu1~24.04.1 (LLVM 19.1.1) conformanceVersion = 1.3.1.1 deviceUUID = 6d657361-3234-2e32-2e38-2d3175627500 driverUUID = 6c6c766d-7069-7065-5555-494400000000 -- PMM
Peter Maydell <peter.maydell@linaro.org> writes: > On Wed, 19 Feb 2025 at 15:00, Alex Bennée <alex.bennee@linaro.org> wrote: >> >> Hi, >> >> As I was looking at the native context patches I realised our existing >> GPU testing is a little sparse. I took the opportunity to split the >> test from the main virt test and then extend it to exercise the 3 >> current display modes (virgl, virgl+blobs, vulkan). >> >> I've added some additional validation to ensure we have the devices we >> expect before we start. It doesn't currently address the reported >> clang issues but hopefully it will help narrow down what fails and >> what works. >> >> Once I've built some new buildroot images I'll re-spin with a while >> bunch of additional test binaries available. > > Running on a non-sanitizer debug build, I found that > aarch64_virt_with_virgl_gpu hit the timeout. Looking at the > output the last thing printed is > 2025-02-20 11:46:36,864: [shadow] <default>: FPS: 45 FrameTime: 22.585 ms > That timestamp is 4 minutes into the test run, and the same > [shadow] test takes over 2 minutes on the with_virgil_blobs_gpu > test, so it looks like it just hit the 360s timeout and might > well have finished OK if it had been allowed to keep running. On my system it takes ~43s to run the plain virgl_gpu test. About 2.5s to boot the kernel and setup the env and approx 40s to run through each test. The -b:duration=1.0 limits each of the 33 scenes to 1s of runtime. I'm guessing something in your setup is stalling the scene and instead of reaching its time limit it stalls and takes more than 1s to recover. > Actually I'm surprised the other one didn't hit a timeout, > because its log timestamps show it running from 11:35:03,896 > to 11:42:26,468 which is definitely more than 360s. > > Is there a less time-intensive test of the virgl code > we can use? check-functional already has way too many > tests that take minutes to run... I am building a newer rootfs with more testing tools on it so we could preface with simpler tests and bail early if say the drm device node can't be seen. That said I worked quite hard on keeping the runtime bellow 60s and the benefit of the glmark/vkmark tests is they run through a number of different scenarios so hopefully exercise a range of the API. It also has the benefit easily detecting the end from stdout whereas the simpler tests tend to draw a triangle and then loop forever until you hit Ctrl-C. > > -- PMM