Message ID | m3bnd52y4s.fsf@t19.piap.pl (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> Testing Linux v4.2 + PLL5 DTS patch (for HDMI output with enabled LVDS). > Using mplayer with YUV420 (DRM Xvideo would probably work with packed > UYVY-alike formats but I need YUV420 because H.264 decoder produces it). > The driver is git://ftp.arm.linux.org.uk/~rmk/xf86-video-armada.git, > branch unstable-devel, and it uses > git://ftp.arm.linux.org.uk/~rmk/libdrm-armada.git/. > > IMX DRM Xvideo output: > > Only unscaled video: no color (luminance is good but the color > components are green). The driver doesn't use color information. 1024 x 768 for now. I noticed a strange thing: XvShmPutImage() usually takes (much) less than 4 milliseconds, but every 315 frames, it takes much longer (when started, it takes 1259 frames). The following is with constant image, i.e., not altered between frames. Source attached (one may need to change xv_port variable). Now what? XVideo: using adaptor #0, port 85 Image ID 30323449 1024x768 data_size 1179648 num_planes 3 pitches 1024 512 512 data 0x76b86000 Frame# frame count from the last such event vvvv vvvv 1259 1259: Took 0.665640667 s, avg 0.000530765 s 1574 315: Took 0.501529334 s, avg 0.000743380 s 1889 315: Took 0.496656667 s, avg 0.000882432 s 2204 315: Took 0.496726667 s, avg 0.000981782 s 2519 315: Took 0.496672333 s, avg 0.001056277 s 2834 315: Took 0.496910000 s, avg 0.001114301 s 3149 315: Took 0.497260667 s, avg 0.001160832 s 3464 315: Took 0.499651334 s, avg 0.001199600 s 3779 315: Took 0.495202334 s, avg 0.001230718 s 4094 315: Took 0.499449667 s, avg 0.001258081 s 4409 315: Took 0.525910333 s, avg 0.001287535 s 4724 315: Took 0.510400000 s, avg 0.001309786 s 5039 315: Took 0.541753334 s, avg 0.001335467 s 5354 315: Took 0.537526666 s, avg 0.001357350 s 5669 315: Took 0.542006000 s, avg 0.001377594 s 5984 315: Took 0.539530333 s, avg 0.001395288 s 6299 315: Took 0.541836000 s, avg 0.001411578 s 6614 315: Took 0.541556333 s, avg 0.001426275 s 6929 315: Took 0.543209666 s, avg 0.001439874 s 7244 315: Took 0.540764000 s, avg 0.001451956 s 7559 315: Took 0.512616666 s, avg 0.001459303 s 7874 315: Took 0.541380000 s, avg 0.001469720 s 8189 315: Took 0.535251333 s, avg 0.001478584 s 8504 315: Took 0.536773334 s, avg 0.001486975 s 8819 315: Took 0.521379000 s, avg 0.001493018 s 9134 315: Took 0.492112667 s, avg 0.001495438 s 9449 315: Took 0.616766000 s, avg 0.001510886 s 9764 315: Took 0.492945666 s, avg 0.001512657 s 10079 315: Took 0.492615000 s, avg 0.001514286 s 10394 315: Took 0.493068000 s, avg 0.001515858 s 10709 315: Took 0.496838000 s, avg 0.001517693 s 11024 315: Took 0.516352333 s, avg 0.001521193 s 11339 315: Took 0.500899333 s, avg 0.001523138 s 11654 315: Took 0.494034334 s, avg 0.001524386 s 11969 315: Took 0.492401000 s, avg 0.001525430 s 12284 315: Took 0.505440000 s, avg 0.001527485 s 12599 315: Took 0.511242666 s, avg 0.001529898 s 12914 315: Took 0.493489333 s, avg 0.001530816 s With etnaviv Xvideo, the effect is similar but much worse: > rmk/drm-etnaviv-devel: > > With unscaled video, the only visible problem is tearing in the middle > of the screen (unability to sync with screen refresh). XVideo: using adaptor #1, port 90 Image ID 30323449 1024x768 data_size 1179648 num_planes 3 pitches 1024 512 512 data 0x76b8d000 1259 1259: Took 18.864969669 s, avg 0.014974047 s 1574 315: Took 20.758470002 s, avg 0.025159601 s 1889 315: Took 20.720458336 s, avg 0.031929827 s 2204 315: Took 20.831572669 s, avg 0.036815999 s 2519 315: Took 20.807073336 s, avg 0.040470906 s 2834 315: Took 20.789431003 s, avg 0.043307485 s 3149 315: Took 20.809190003 s, avg 0.045582932 s 3464 315: Took 20.803870003 s, avg 0.047443189 s 3779 315: Took 20.798422002 s, avg 0.048991905 s 4094 315: Took 20.809182336 s, avg 0.050304985 s 4409 315: Took 20.812659336 s, avg 0.051431326 s 4724 315: Took 20.816828669 s, avg 0.052408363 s 5039 315: Took 20.814890002 s, avg 0.053262894 s 5354 315: Took 20.803809003 s, avg 0.054014822 s 5669 315: Took 20.798233669 s, avg 0.054682176 s 5984 315: Took 20.809503336 s, avg 0.055281168 s 6299 315: Took 20.814339002 s, avg 0.055821065 s 6614 315: Took 20.812699003 s, avg 0.056309294 s 6929 315: Took 20.800363670 s, avg 0.056751361 s 7244 315: Took 20.814518669 s, avg 0.057156941 s 7559 315: Took 20.814902002 s, avg 0.057528773 s 7874 315: Took 20.814914336 s, avg 0.057870859 s 8189 315: Took 20.803919669 s, avg 0.058185288 s 8504 315: Took 20.807010336 s, avg 0.058476760 s 8819 315: Took 20.855880669 s, avg 0.058752983 s 9134 315: Took 20.809301669 s, avg 0.059005029 s 9449 315: Took 26.004653003 s, avg 0.059790071 s 9764 315: Took 20.796989002 s, avg 0.059991163 s 10079 315: Took 20.798379670 s, avg 0.060179803 s 10394 315: Took 20.809240335 s, avg 0.060358055 s 10709 315: Took 20.814655003 s, avg 0.060526351 s 11024 315: Took 20.803836002 s, avg 0.060684046 s
On Tue, Sep 15, 2015 at 10:24:55AM +0200, Krzysztof Ha?asa wrote: > 1024 x 768 for now. I noticed a strange thing: XvShmPutImage() usually > takes (much) less than 4 milliseconds, but every 315 frames, it takes > much longer (when started, it takes 1259 frames). The following is with > constant image, i.e., not altered between frames. Source attached (one > may need to change xv_port variable). Testing on Dove and Armada DRM, I've had to make several changes to your program: 1. Don't hard code the Xv port ID (worth mentioning somewhere so people testing it don't spend something like a quarter of an hour trying to debug why it doesn't work.) 2. Added _GNU_SOURCE at the top to get various structs and functions that the program uses. 3. Added a line detailing the GCC flags to be used 4. A description of what the program is doing would be nice. For the record, it appears to try and call XvShmPutImage() as quickly as possible, recording the absolute time between calls, and reporting when it is more than 4ms. 5. Added a call to XSync(display, 1) after the call to XvShmPutImage() to ensure that the XvShmPutImage() gets pushed to the X server in a timely manner, doesn't sit in the applications X queue, and ensure that we wait for the X server to respond. 6. Extended the 4ms timeout to something sane, like 17ms. If we're talking about synchronising to the vertical sync, then the period when we want to complain is when it takes much longer than that. If the display is synchronised to the vertical sync to avoid tearing, then you can only update the display once per vertical sync, and at 60Hz, that's about 17ms. At 50Hz, that's 20ms. Now, running on Dove overlay, I see about 17ms being the average time, which is slightly longer than the vsync. Further analysis with strace of the X server (which pushes things up to 20ms average time): - The X server takes about 1ms between select() indicating that there's an event available, and reading the event. - It takes 14ms from reading the event to the DDX calling the DRM ioctl to display the image - mostly spent memcpy()'ing the image. - 900us to return the response to the X client - The X server then takes about 4ms to get back to calling select(). The big chunk of that is the memcpy() - that's something which can't be eliminated for several reasons: - With XvShm, the image buffer is shared between the X client and the X server. The X server has no control over how the X client manipulates that buffer after the X server has returned from its call - however the displayed image must remain stable and free from manipulation until the next PutImage call. - The X shm buffer can't be passed directly into DRM - DRM needs the image to be displayed to be in DRM buffer objects, which are then wrapped as frame buffers (a frame buffer can contain several buffer objects, one for each plane), and lastly the overlay plane updated with the new framebuffer. Let's not forget that Dove hardware is slower than iMX6, so I think that iMX6 should manage to comfortably achieve the 16.6ms required for frame-by-frame display from your program. Now, as for using the GPU, X server analysis: - Again about 1ms between select() and read() of the event - 1ms to the first WAIT_VBLANK ioctl on the display adapter to read the last VSYNC time - 200us later to map the user pointer - 300us to the WAIT_VBLANK ioctl to wait for the vblank (which takes in this instance 3.5ms to fire) - 130us to first attempt to submit the GPU queue, which returns -EAGAIN after 6.4ms - second attempt takes 9.5ms to complete successfully - 1ms (with intervening SIGALRM handler) to wait for the GPU to finish, which takes 11.5ms - 450us to ask DRM to release the user pointer buffer, which takes 9.4ms - 1ms (with intervening SIGALRM handler) to report completion of the operation - The X server then takes about 4ms to get back to calling select(). So, you can see that using the GPU is a much heavier and complex operation all round. The long delays in mapping and releasing the buffer are of course the DMA mapping/unmapping of the buffer object to ensure that the GPU can see the data in those buffers. What's also notable is that it takes the GPU on the order of 10ms to do the operation - it's actually two operations: a vertical filter blit followed by a horizontal filter blit. None of the Vivante GPUs that I've access to support a single-pass filter blit scaling in both directions (the feature bit for that is clear on both Dove's GC600 and iMX6 GC320 GPUs.) I suspect a Vivante GPU supporting that would take around half the time, or better. Of course, the time each of these operations take scales with the image size, so an image half the width and height (which will be a quarter of the size) will take a quarter of the time, and will be comfortably within 17ms.
Russell King - ARM Linux <linux@arm.linux.org.uk> writes: >> 1024 x 768 for now. I noticed a strange thing: XvShmPutImage() usually >> takes (much) less than 4 milliseconds, but every 315 frames, it takes >> much longer (when started, it takes 1259 frames). The following is with >> constant image, i.e., not altered between frames. Source attached (one >> may need to change xv_port variable). ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > 1. Don't hard code the Xv port ID (worth mentioning somewhere so people > testing it don't spend something like a quarter of an hour trying to > debug why it doesn't work.) Well, I actually mentioned it. So this means the delay was only "grouped" by the X protocol, and it won't be magically faster. > Now, running on Dove overlay, I see about 17ms being the average time, > which is slightly longer than the vsync. > Let's not forget that Dove hardware is slower than iMX6, so I think that > iMX6 should manage to comfortably achieve the 16.6ms required for > frame-by-frame display from your program. Interestingly, i.MX6 seems to achive about 3.2 ms, which is much faster. If not for the missing color planes (and scaling), it would be perfect. (and I only need 30 FPS). > Now, as for using the GPU, X server analysis: > - Again about 1ms between select() and read() of the event > - 1ms to the first WAIT_VBLANK ioctl on the display adapter to read the > last VSYNC time > - 200us later to map the user pointer > - 300us to the WAIT_VBLANK ioctl to wait for the vblank (which takes in this > instance 3.5ms to fire) > - 130us to first attempt to submit the GPU queue, which returns -EAGAIN > after 6.4ms > - second attempt takes 9.5ms to complete successfully > - 1ms (with intervening SIGALRM handler) to wait for the GPU to finish, > which takes 11.5ms > - 450us to ask DRM to release the user pointer buffer, which takes 9.4ms > - 1ms (with intervening SIGALRM handler) to report completion of the operation > - The X server then takes about 4ms to get back to calling select(). Sure, I know it has to be slower. So this takes (on Dove) about 50 ms (at 1024x768), thus is way too slow to display video at this (or higher) resolution and frame rate. On i.MX6q, it takes about 66 - 74 ms, depending on actual clock speed (frequency scaling, up to 1 GHz). Thanks for testing.
On Tue, Sep 15, 2015 at 01:01:25PM +0200, Krzysztof Ha?asa wrote: > Russell King - ARM Linux <linux@arm.linux.org.uk> writes: > > Let's not forget that Dove hardware is slower than iMX6, so I think that > > iMX6 should manage to comfortably achieve the 16.6ms required for > > frame-by-frame display from your program. > > Interestingly, i.MX6 seems to achive about 3.2 ms, which is much faster. > If not for the missing color planes (and scaling), it would be perfect. > (and I only need 30 FPS). That means iMX6 isn't synchronising to the vertical sync, so would be subject to tearing. > > > Now, as for using the GPU, X server analysis: > > - Again about 1ms between select() and read() of the event > > - 1ms to the first WAIT_VBLANK ioctl on the display adapter to read the > > last VSYNC time > > - 200us later to map the user pointer > > - 300us to the WAIT_VBLANK ioctl to wait for the vblank (which takes in this > > instance 3.5ms to fire) > > - 130us to first attempt to submit the GPU queue, which returns -EAGAIN > > after 6.4ms > > - second attempt takes 9.5ms to complete successfully > > - 1ms (with intervening SIGALRM handler) to wait for the GPU to finish, > > which takes 11.5ms > > - 450us to ask DRM to release the user pointer buffer, which takes 9.4ms > > - 1ms (with intervening SIGALRM handler) to report completion of the operation > > - The X server then takes about 4ms to get back to calling select(). > > Sure, I know it has to be slower. > > So this takes (on Dove) about 50 ms (at 1024x768), thus is way too slow > to display video at this (or higher) resolution and frame rate. > On i.MX6q, it takes about 66 - 74 ms, depending on actual clock speed > (frequency scaling, up to 1 GHz). It's interesting that iMX6 takes longer - without analysing what's going on, I'd guess that will be because the GC320's memory bandwidth is throttled compared to Dove. iMX6 seems to have a problem with the bus arbitration in that each master which can access DDR is given a fixed timeslot, whether or not there are any other masters wanting access. Performance measurement with 64-bit DDR has shown that out of iMX6Q,D,DL,S the iMX6DL performs the best for raw IO as there are fewer masters (fewer CPUs and fewer peripherals wanting DDR access), so the timeslots are bigger. I've found that it is very difficult to get iMX6's GPUs to out-perform neon optimised libpixman in terms of performance measurements - and I think the reason for this is because both Neon optimised libpixman and the GPUs are running up against the same bottleneck - the available DDR bandwidth. However, what you do get is CPU offload of that effort - filling one of those non-CPU DDR access timeslots with useful work, freeing up the CPU DDR timeslot to do other tasks. I have no idea how much truth there is to this, this is all based upon performance measurement, and drawing conclusions from those measurements.
Am Montag, den 14.09.2015, 10:39 +0200 schrieb Krzysztof Ha?asa: > Another round of tests, I noticed the new git versions :-) > Testing Linux v4.2 + PLL5 DTS patch (for HDMI output with enabled > LVDS). > Using mplayer with YUV420 (DRM Xvideo would probably work with packed > UYVY-alike formats but I need YUV420 because H.264 decoder produces > it). > The driver is git://ftp.arm.linux.org.uk/~rmk/xf86-video-armada.git, > branch unstable-devel, and it uses > git://ftp.arm.linux.org.uk/~rmk/libdrm-armada.git/. > > IMX DRM Xvideo output: > > Only unscaled video: no color (luminance is good but the color > components are green). The driver doesn't use color information. > I hope this is easily fixable. > There were patches for that some weeks ago already, I would hope they got applied to some tree destinied for upstream. Will look at that and ping people if necessary. > > rmk/drm-etnaviv-devel: > > With unscaled video, the only visible problem is tearing in the > middle > of the screen (unability to sync with screen refresh). > > With scaling, I'm getting horizontal lines (mostly visible when the > scene changes) and some sort of stalls - sometimes two frames are > alternating for few seconds then it goes forward. > > > pengutronix/etnaviv-for-upstream: That is the right branch, containing the changes I sent out for review. It includes a new kernel<->userspace interface, so needs a modified xf86-video-armada to go with it. You can pick that up here: git://git.pengutronix.de/git/lst/xf86-video-armada.git for-rmk Caution: the armada code is WIP and needs further cleanups. It should work though. > No etnaviv Xvideo: > (==) armada(0): Backing store enabled > (==) armada(0): Silken mouse enabled > (EE) armada(0): etnaviv: unable to open: Not a directory > (WW) armada(0): [drm] Vivante initialization failed, running > unaccelerated > > I assume I need a newer something. > > > pengutronix/v4.2/topic/etnaviv-for-rmk: Please don't use that branch. It just there to keep stability for people that already know about it. Regards, Lucas
On Tue, Sep 15, 2015 at 05:53:45PM +0200, Lucas Stach wrote: > It includes a new kernel<->userspace interface, so needs a modified > xf86-video-armada to go with it. You can pick that up here: > git://git.pengutronix.de/git/lst/xf86-video-armada.git for-rmk > > Caution: the armada code is WIP and needs further cleanups. It should > work though. And it needs updating to the latest version - you seem to be missing a whole load of changes, including the changes which update the kernel API. So... you seem to be talking about the kernel API having changed from the version which you picked up years ago, rather than the current version. It's really unclear what you're talking about.
Russell King - ARM Linux <linux@arm.linux.org.uk> writes: >> Interestingly, i.MX6 seems to achive about 3.2 ms, which is much faster. >> If not for the missing color planes (and scaling), it would be perfect. >> (and I only need 30 FPS). > > That means iMX6 isn't synchronising to the vertical sync, so would be > subject to tearing. Right. However the bigger problem ATM is the lack of color, or actually wrong colors (such as solid green or magenta etc). I'm looking if the support can be added easily, this whole IPU stuff is a bit complicated. > It's interesting that iMX6 takes longer - without analysing what's going > on, I'd guess that will be because the GC320's memory bandwidth is > throttled compared to Dove. Well, one would think the bandwidths should be an order of magnitude larger than needed here. The boards I have here (Ventana GW5xxx) are using e.g. 4 * MT41K128M16JT-125 DDR3 chips, for a total of (if configured correctly) 12 GB/s. Now, a single burst read (or write) at 1024x768 32 bit requires 3 MB, for 30 FPS it means 90 MB/s. I realize the buffers may be copied many times, and the timeslots you describe don't make it easier. 66 ms for a frame is rather long, though. I guess, in this case, it would be better to convert YUV420 planar -> YUV422 packed (perhaps with NEON - the packed mode(s) work correctly), and use the more traditional overlay instead of the textured adapter. Or maybe I will make this YUV420 mode work.
Hi Russell, Am Dienstag, den 15.09.2015, 17:36 +0100 schrieb Russell King - ARM Linux: > On Tue, Sep 15, 2015 at 05:53:45PM +0200, Lucas Stach wrote: > > It includes a new kernel<->userspace interface, so needs a modified > > xf86-video-armada to go with it. You can pick that up here: > > git://git.pengutronix.de/git/lst/xf86-video-armada.git for-rmk > > > > Caution: the armada code is WIP and needs further cleanups. It > > should > > work though. > > And it needs updating to the latest version - you seem to be missing > a whole load of changes, including the changes which update the > kernel API. > > So... you seem to be talking about the kernel API having changed from > the version which you picked up years ago, rather than the current > version. It's really unclear what you're talking about. > What do you mean with missing a lot of stuff? The "for-rmk" branch in that repo is your upstream "unstable-devel" branch minus the few XvBO commits with my changes on top. Regards, Lucas
Lucas Stach <l.stach@pengutronix.de> writes: >> Only unscaled video: no color (luminance is good but the color >> components are green). The driver doesn't use color information. >> I hope this is easily fixable. > There were patches for that some weeks ago already, I would hope they > got applied to some tree destinied for upstream. Will look at that and > ping people if necessary. Good, this probably resolves one of main problems.
On Tue, Sep 15, 2015 at 12:53:54PM -0400, Lucas Stach wrote: > Hi Russell, > > Am Dienstag, den 15.09.2015, 17:36 +0100 schrieb Russell King - ARM > Linux: > > On Tue, Sep 15, 2015 at 05:53:45PM +0200, Lucas Stach wrote: > > > It includes a new kernel<->userspace interface, so needs a modified > > > xf86-video-armada to go with it. You can pick that up here: > > > git://git.pengutronix.de/git/lst/xf86-video-armada.git for-rmk > > > > > > Caution: the armada code is WIP and needs further cleanups. It > > > should > > > work though. > > > > And it needs updating to the latest version - you seem to be missing > > a whole load of changes, including the changes which update the > > kernel API. > > > > So... you seem to be talking about the kernel API having changed from > > the version which you picked up years ago, rather than the current > > version. It's really unclear what you're talking about. > > > What do you mean with missing a lot of stuff? The "for-rmk" branch in > that repo is your upstream "unstable-devel" branch minus the few XvBO > commits with my changes on top. Ah, probably because I haven't pushed the stuff out :) Anyway, I've taken two of your patches, reworking the second slightly: 1fec728 etnaviv: fix compilation without DRI2 support 1663382 make sure that system strndup doesn't collide with X.Org server version which are _both_ fixes, and would have been nice for them to have been sent in a timely manner. Looking at "etnaviv: fix XV resize for UYVY source format", this is wrong to the documentation for the GC320. Documentation which was around for the GC320 indicates that the _only_ YUV format it supports as a target is YUY2 with the vertical filter blit. It's very specifically worded to indicate that this is the only YUV target format which is supported. What testing have you done to prove uyvy support, and on which GPUs? "etnaviv: align vximage buffer size to pagesize" I have issues with as well - mapping more memory than was passed to the X server is really not permissible. If we need the image size to be larger, we need to report a larger buffer size - and in fact, that's already done. QueryImageAttributes indicates that the image size is to be rounded up to a page size. The X server will reject smaller buffers at higher levels. So, this patch isn't required. The last patch is your final one, I'll look at that at some point, but for now I'm pushing out the tree as it now stands.
Am Dienstag, den 15.09.2015, 18:04 +0100 schrieb Russell King - ARM Linux: > On Tue, Sep 15, 2015 at 12:53:54PM -0400, Lucas Stach wrote: > > Hi Russell, > > > > Am Dienstag, den 15.09.2015, 17:36 +0100 schrieb Russell King - ARM > > Linux: > > > On Tue, Sep 15, 2015 at 05:53:45PM +0200, Lucas Stach wrote: > > > > It includes a new kernel<->userspace interface, so needs a > > > > modified > > > > xf86-video-armada to go with it. You can pick that up here: > > > > git://git.pengutronix.de/git/lst/xf86-video-armada.git for-rmk > > > > > > > > Caution: the armada code is WIP and needs further cleanups. It > > > > should > > > > work though. > > > > > > And it needs updating to the latest version - you seem to be > > > missing > > > a whole load of changes, including the changes which update the > > > kernel API. > > > > > > So... you seem to be talking about the kernel API having changed > > > from > > > the version which you picked up years ago, rather than the > > > current > > > version. It's really unclear what you're talking about. > > > > > What do you mean with missing a lot of stuff? The "for-rmk" branch > > in > > that repo is your upstream "unstable-devel" branch minus the few > > XvBO > > commits with my changes on top. > > Ah, probably because I haven't pushed the stuff out :) > > Anyway, I've taken two of your patches, reworking the second > slightly: > > 1fec728 etnaviv: fix compilation without DRI2 support > 1663382 make sure that system strndup doesn't collide with X.Org > server version > > which are _both_ fixes, and would have been nice for them to have > been > sent in a timely manner. > Sorry about that. I'll try to single out the fixes more quickly going forward. > Looking at "etnaviv: fix XV resize for UYVY source format", this is > wrong > to the documentation for the GC320. Documentation which was around > for > the GC320 indicates that the _only_ YUV format it supports as a > target > is YUY2 with the vertical filter blit. It's very specifically worded > to > indicate that this is the only YUV target format which is supported. > What testing have you done to prove uyvy support, and on which GPUs? > I've tested this on i.MX6Q hardware with some GStreamer pipelines using the videotestsrc. Without this commit the GPU will write out only zerosin the first step, so the video image will show up as fully green after CSC. I don't think I have that documentation you are talking about. Is this doc under NDA? Chapter "31.4.1.6 Filter BLT" of the i.MX6 RM seems to agree with you, so this commit depends on unspecified behavior. But it fixes a real bug where the GPU writes out only zeros to the intermediate buffer if the source is UYVY, so videos show up as fully green after CSC. So this needs more investigation. > "etnaviv: align vximage buffer size to pagesize" I have issues with > as > well - mapping more memory than was passed to the X server is really > not > permissible. If we need the image size to be larger, we need to > report > a larger buffer size - and in fact, that's already done. > QueryImageAttributes indicates that the image size is to be rounded > up to a page size. The X server will reject smaller buffers at > higher > levels. So, this patch isn't required. > This does not seem to be be working. I've certainly seen GStreamer pushing non-aligned buffers which cause the GEM import to fail. Ignoring the real size may well be a GStreamer bug (I'll look into that code), but the server doesn't seem to reject that. > The last patch is your final one, I'll look at that at some point, > but for now I'm pushing out the tree as it now stands. > If you're not going to look at this too soonish I can rebase that on top of what you just pushed out and mybe clean it up some more. But certainly not this week, as I'm physically away from any test hardware. Regards, Lucas
Lucas Stach <l.stach@pengutronix.de> writes: >> IMX DRM Xvideo output: >> >> Only unscaled video: no color (luminance is good but the color >> components are green). The driver doesn't use color information. >> I hope this is easily fixable. >> > There were patches for that some weeks ago already, I would hope they > got applied to some tree destinied for upstream. Will look at that and > ping people if necessary. BTW: any pointer to the patches?
Am Mittwoch, den 16.09.2015, 09:57 +0200 schrieb Krzysztof Ha?asa: > Lucas Stach <l.stach@pengutronix.de> writes: > > > > IMX DRM Xvideo output: > > > > > > Only unscaled video: no color (luminance is good but the color > > > components are green). The driver doesn't use color information. > > > I hope this is easily fixable. > > > > > There were patches for that some weeks ago already, I would hope > > they > > got applied to some tree destinied for upstream. Will look at that > > and > > ping people if necessary. > > BTW: any pointer to the patches? I just digged a bit and it seems they didn't make it out. I've pinged Philipp about that. Regards, Lucas
Am Dienstag, den 15.09.2015, 18:04 +0100 schrieb Russell King - ARM Linux: > On Tue, Sep 15, 2015 at 12:53:54PM -0400, Lucas Stach wrote: > > Hi Russell, > > > > Am Dienstag, den 15.09.2015, 17:36 +0100 schrieb Russell King - ARM > > Linux: > > > On Tue, Sep 15, 2015 at 05:53:45PM +0200, Lucas Stach wrote: > > > > It includes a new kernel<->userspace interface, so needs a modified > > > > xf86-video-armada to go with it. You can pick that up here: > > > > git://git.pengutronix.de/git/lst/xf86-video-armada.git for-rmk > > > > > > > > Caution: the armada code is WIP and needs further cleanups. It > > > > should > > > > work though. > > > > > > And it needs updating to the latest version - you seem to be missing > > > a whole load of changes, including the changes which update the > > > kernel API. > > > > > > So... you seem to be talking about the kernel API having changed from > > > the version which you picked up years ago, rather than the current > > > version. It's really unclear what you're talking about. > > > > > What do you mean with missing a lot of stuff? The "for-rmk" branch in > > that repo is your upstream "unstable-devel" branch minus the few XvBO > > commits with my changes on top. > > Ah, probably because I haven't pushed the stuff out :) > > Anyway, I've taken two of your patches, reworking the second slightly: > > 1fec728 etnaviv: fix compilation without DRI2 support > 1663382 make sure that system strndup doesn't collide with X.Org server version > > which are _both_ fixes, and would have been nice for them to have been > sent in a timely manner. > > Looking at "etnaviv: fix XV resize for UYVY source format", this is wrong > to the documentation for the GC320. Documentation which was around for > the GC320 indicates that the _only_ YUV format it supports as a target > is YUY2 with the vertical filter blit. It's very specifically worded to > indicate that this is the only YUV target format which is supported. > What testing have you done to prove uyvy support, and on which GPUs? > > "etnaviv: align vximage buffer size to pagesize" I have issues with as > well - mapping more memory than was passed to the X server is really not > permissible. If we need the image size to be larger, we need to report > a larger buffer size - and in fact, that's already done. > QueryImageAttributes indicates that the image size is to be rounded > up to a page size. The X server will reject smaller buffers at higher > levels. So, this patch isn't required. > I've looked at that again and the issue here seems to be that GStreamer is handing us a pointer to a buffer that isn't aligned to a page. The buffers size is properly rounded up to a pagesize, as requested by QueryImageAttributes, but if the buffer start pointer isn't aligned to a page boundary we end up with an non-mappable buffer anyway. Unfortunately there is no obvious way for a driver to request a minimum alignment for the buffer. The only possible fix is for the client to always align the buffer to a page boundary in hopes that this is enough for the hardware to map it directly and allow to skip any unwanted copying. Regards, Lucas
On Mon, Sep 28, 2015 at 04:48:08PM +0200, Lucas Stach wrote: > I've looked at that again and the issue here seems to be that GStreamer > is handing us a pointer to a buffer that isn't aligned to a page. The > buffers size is properly rounded up to a pagesize, as requested by > QueryImageAttributes, but if the buffer start pointer isn't aligned to a > page boundary we end up with an non-mappable buffer anyway. > > Unfortunately there is no obvious way for a driver to request a minimum > alignment for the buffer. The only possible fix is for the client to > always align the buffer to a page boundary in hopes that this is enough > for the hardware to map it directly and allow to skip any unwanted > copying. We can't just "round up" the size. We've no idea whether the buffer came from shmem (for XvShmPutImage), or whether it's part of an internal X buffer (for XvPutImage). In the latter case, we've no idea whether data in the remainder of the page will be read or written by the CPU when, eg, a signal occurs - and X does use signals.
Am Montag, den 28.09.2015, 16:24 +0100 schrieb Russell King - ARM Linux: > On Mon, Sep 28, 2015 at 04:48:08PM +0200, Lucas Stach wrote: > > I've looked at that again and the issue here seems to be that GStreamer > > is handing us a pointer to a buffer that isn't aligned to a page. The > > buffers size is properly rounded up to a pagesize, as requested by > > QueryImageAttributes, but if the buffer start pointer isn't aligned to a > > page boundary we end up with an non-mappable buffer anyway. > > > > Unfortunately there is no obvious way for a driver to request a minimum > > alignment for the buffer. The only possible fix is for the client to > > always align the buffer to a page boundary in hopes that this is enough > > for the hardware to map it directly and allow to skip any unwanted > > copying. > > We can't just "round up" the size. We've no idea whether the buffer > came from shmem (for XvShmPutImage), or whether it's part of an internal > X buffer (for XvPutImage). In the latter case, we've no idea whether > data in the remainder of the page will be read or written by the CPU > when, eg, a signal occurs - and X does use signals. > I'm not talking about the driver rounding up the size. That is obviously (contrary to what my patch did) the wrong thing to do. I was talking of the client (as in VLC, GStreamer, whatever) aligning the buffer to a page boundary. As there is no way for the client to query the alignment restrictions, we may still need a fallback path in the driver, so that if an unaligned buffer comes in we don't do a buffer_from_userptr but actually copy client memory to a new buffer. Regards, Lucas
On Mon, Sep 28, 2015 at 05:40:13PM +0200, Lucas Stach wrote: > Am Montag, den 28.09.2015, 16:24 +0100 schrieb Russell King - ARM Linux: > > On Mon, Sep 28, 2015 at 04:48:08PM +0200, Lucas Stach wrote: > > > I've looked at that again and the issue here seems to be that GStreamer > > > is handing us a pointer to a buffer that isn't aligned to a page. The > > > buffers size is properly rounded up to a pagesize, as requested by > > > QueryImageAttributes, but if the buffer start pointer isn't aligned to a > > > page boundary we end up with an non-mappable buffer anyway. > > > > > > Unfortunately there is no obvious way for a driver to request a minimum > > > alignment for the buffer. The only possible fix is for the client to > > > always align the buffer to a page boundary in hopes that this is enough > > > for the hardware to map it directly and allow to skip any unwanted > > > copying. > > > > We can't just "round up" the size. We've no idea whether the buffer > > came from shmem (for XvShmPutImage), or whether it's part of an internal > > X buffer (for XvPutImage). In the latter case, we've no idea whether > > data in the remainder of the page will be read or written by the CPU > > when, eg, a signal occurs - and X does use signals. > > > I'm not talking about the driver rounding up the size. That is obviously > (contrary to what my patch did) the wrong thing to do. > > I was talking of the client (as in VLC, GStreamer, whatever) aligning > the buffer to a page boundary. As there is no way for the client to > query the alignment restrictions, we may still need a fallback path in > the driver, so that if an unaligned buffer comes in we don't do a > buffer_from_userptr but actually copy client memory to a new buffer. You really do _not_ want to do be copying image data. With large images (1080p) that will consume lots of CPU, and tie up the X server doing not much other than copying data. You might as well manually convert and copy the data to the screen at that point.
Am Montag, den 28.09.2015, 17:50 +0100 schrieb Russell King - ARM Linux: > On Mon, Sep 28, 2015 at 05:40:13PM +0200, Lucas Stach wrote: > > Am Montag, den 28.09.2015, 16:24 +0100 schrieb Russell King - ARM Linux: > > > On Mon, Sep 28, 2015 at 04:48:08PM +0200, Lucas Stach wrote: > > > > I've looked at that again and the issue here seems to be that GStreamer > > > > is handing us a pointer to a buffer that isn't aligned to a page. The > > > > buffers size is properly rounded up to a pagesize, as requested by > > > > QueryImageAttributes, but if the buffer start pointer isn't aligned to a > > > > page boundary we end up with an non-mappable buffer anyway. > > > > > > > > Unfortunately there is no obvious way for a driver to request a minimum > > > > alignment for the buffer. The only possible fix is for the client to > > > > always align the buffer to a page boundary in hopes that this is enough > > > > for the hardware to map it directly and allow to skip any unwanted > > > > copying. > > > > > > We can't just "round up" the size. We've no idea whether the buffer > > > came from shmem (for XvShmPutImage), or whether it's part of an internal > > > X buffer (for XvPutImage). In the latter case, we've no idea whether > > > data in the remainder of the page will be read or written by the CPU > > > when, eg, a signal occurs - and X does use signals. > > > > > I'm not talking about the driver rounding up the size. That is obviously > > (contrary to what my patch did) the wrong thing to do. > > > > I was talking of the client (as in VLC, GStreamer, whatever) aligning > > the buffer to a page boundary. As there is no way for the client to > > query the alignment restrictions, we may still need a fallback path in > > the driver, so that if an unaligned buffer comes in we don't do a > > buffer_from_userptr but actually copy client memory to a new buffer. > > You really do _not_ want to do be copying image data. With large images > (1080p) that will consume lots of CPU, and tie up the X server doing not > much other than copying data. You might as well manually convert and > copy the data to the screen at that point. > So what other possibilities do we have if the client hands us a non page aligned buffer, other than failing the PutImage? Converting the image manually and possibly doubling the amount of bytes to write out (YUV->RGB) will tie up the X server even longer. This all doesn't seem to be a problem as long as the client allocates from XShm, as this seems to hand out page aligned memory. Only if the client mallocs the image buffer we might end up with unaligned buffers. Regards, Lucas
On Tue, Sep 29, 2015 at 10:28:19AM +0200, Lucas Stach wrote: > Am Montag, den 28.09.2015, 17:50 +0100 schrieb Russell King - ARM Linux: > > You really do _not_ want to do be copying image data. With large images > > (1080p) that will consume lots of CPU, and tie up the X server doing not > > much other than copying data. You might as well manually convert and > > copy the data to the screen at that point. > > > So what other possibilities do we have if the client hands us a non page > aligned buffer, other than failing the PutImage? Converting the image > manually and possibly doubling the amount of bytes to write out > (YUV->RGB) will tie up the X server even longer. Are you sure about that? If you memcpy() the buffer, you have to read about 6MB of data, write 6MB of data, flush the caches of that, then ask the GPU to perform two blits (which on my measurements just about fit in one field scan), wait for the blit to finish before then returning to the client. I can tell you that VLC uses memcpy() on images internally, and you can't get away with it for 1080p video. > This all doesn't seem to be a problem as long as the client allocates > from XShm, as this seems to hand out page aligned memory. Only if the > client mallocs the image buffer we might end up with unaligned buffers. I don't see that. How does a malloc()'d buffer get to the X server, other than throuh XvPutImage? If you're using XvPutImage(), you'll be incurring an even _larger_ overhead because the kernel will have to copy the image data as well. AFAIK, you can't pass arbitary malloc()d buffers to sysvipc and therefore via XShm.
Am Dienstag, den 29.09.2015, 09:41 +0100 schrieb Russell King - ARM Linux: > On Tue, Sep 29, 2015 at 10:28:19AM +0200, Lucas Stach wrote: > > Am Montag, den 28.09.2015, 17:50 +0100 schrieb Russell King - ARM Linux: > > > You really do _not_ want to do be copying image data. With large images > > > (1080p) that will consume lots of CPU, and tie up the X server doing not > > > much other than copying data. You might as well manually convert and > > > copy the data to the screen at that point. > > > > > So what other possibilities do we have if the client hands us a non page > > aligned buffer, other than failing the PutImage? Converting the image > > manually and possibly doubling the amount of bytes to write out > > (YUV->RGB) will tie up the X server even longer. > > Are you sure about that? > > If you memcpy() the buffer, you have to read about 6MB of data, write > 6MB of data, flush the caches of that, then ask the GPU to perform two > blits (which on my measurements just about fit in one field scan), wait > for the blit to finish before then returning to the client. > The cache flush can easily be avoided if you copy into a WC buffer. And I don't see why we would need to wait for the blit to finish before returning to the client. If we copy the userdata there is no need to block the client until the GPU is done with the data. Or is there something in the spec I'm not aware of which would mandate such blocking? > I can tell you that VLC uses memcpy() on images internally, and you can't > get away with it for 1080p video. > I'm not disputing that it's a really bad idea if you care about performance. > > This all doesn't seem to be a problem as long as the client allocates > > from XShm, as this seems to hand out page aligned memory. Only if the > > client mallocs the image buffer we might end up with unaligned buffers. > > I don't see that. How does a malloc()'d buffer get to the X server, > other than throuh XvPutImage? If you're using XvPutImage(), you'll > be incurring an even _larger_ overhead because the kernel will have > to copy the image data as well. AFAIK, you can't pass arbitary malloc()d > buffers to sysvipc and therefore via XShm. > Right I'm talking about XvPutImage. I'm not saying it is a good way to hand off your image data to the X server, all I'm saying is that this is a path that applications _might_ use. We can just choose to ignore this for the time being, as it's a non-issue if the client is using XShm. Regards, Lucas
--- a/drivers/base/dma-contiguous.c +++ b/drivers/base/dma-contiguous.c @@ -216,6 +216,7 @@ phys_addr_t dma_get_contiguous_base(struct device *dev) { return cma_get_base(dev_get_cma_area(dev)); } +EXPORT_SYMBOL(dma_get_contiguous_base); /* * Support for reserved memory regions defined in device tree