Message ID | 20231019212133.245155-7-harry.wentland@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Color Pipeline API w/ VKMS | expand |
Thanks for continuing to work on this! On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: > v2: > - Update colorop visualizations to match reality (Sebastian, Alex Hung) > - Updated wording (Pekka) > - Change BYPASS wording to make it non-mandatory (Sebastian) > - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property > section (Pekka) > - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) > - Add "Driver Implementer's Guide" section (Pekka) > - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) > > Signed-off-by: Harry Wentland <harry.wentland@amd.com> > Cc: Ville Syrjala <ville.syrjala@linux.intel.com> > Cc: Pekka Paalanen <pekka.paalanen@collabora.com> > Cc: Simon Ser <contact@emersion.fr> > Cc: Harry Wentland <harry.wentland@amd.com> > Cc: Melissa Wen <mwen@igalia.com> > Cc: Jonas Ådahl <jadahl@redhat.com> > Cc: Sebastian Wick <sebastian.wick@redhat.com> > Cc: Shashank Sharma <shashank.sharma@amd.com> > Cc: Alexander Goins <agoins@nvidia.com> > Cc: Joshua Ashton <joshua@froggi.es> > Cc: Michel Dänzer <mdaenzer@redhat.com> > Cc: Aleix Pol <aleixpol@kde.org> > Cc: Xaver Hugl <xaver.hugl@gmail.com> > Cc: Victoria Brekenfeld <victoria@system76.com> > Cc: Sima <daniel@ffwll.ch> > Cc: Uma Shankar <uma.shankar@intel.com> > Cc: Naseer Ahmed <quic_naseer@quicinc.com> > Cc: Christopher Braga <quic_cbraga@quicinc.com> > Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> > Cc: Arthur Grillo <arthurgrillo@riseup.net> > Cc: Hector Martin <marcan@marcan.st> > Cc: Liviu Dudau <Liviu.Dudau@arm.com> > Cc: Sasha McIntosh <sashamcintosh@google.com> > --- > Documentation/gpu/rfc/color_pipeline.rst | 347 +++++++++++++++++++++++ > 1 file changed, 347 insertions(+) > create mode 100644 Documentation/gpu/rfc/color_pipeline.rst > > diff --git a/Documentation/gpu/rfc/color_pipeline.rst b/Documentation/gpu/rfc/color_pipeline.rst > new file mode 100644 > index 000000000000..af5f2ea29116 > --- /dev/null > +++ b/Documentation/gpu/rfc/color_pipeline.rst > @@ -0,0 +1,347 @@ > +======================== > +Linux Color Pipeline API > +======================== > + > +What problem are we solving? > +============================ > + > +We would like to support pre-, and post-blending complex color > +transformations in display controller hardware in order to allow for > +HW-supported HDR use-cases, as well as to provide support to > +color-managed applications, such as video or image editors. > + > +It is possible to support an HDR output on HW supporting the Colorspace > +and HDR Metadata drm_connector properties, but that requires the > +compositor or application to render and compose the content into one > +final buffer intended for display. Doing so is costly. > + > +Most modern display HW offers various 1D LUTs, 3D LUTs, matrices, and other > +operations to support color transformations. These operations are often > +implemented in fixed-function HW and therefore much more power efficient than > +performing similar operations via shaders or CPU. > + > +We would like to make use of this HW functionality to support complex color > +transformations with no, or minimal CPU or shader load. > + > + > +How are other OSes solving this problem? > +======================================== > + > +The most widely supported use-cases regard HDR content, whether video or > +gaming. > + > +Most OSes will specify the source content format (color gamut, encoding transfer > +function, and other metadata, such as max and average light levels) to a driver. > +Drivers will then program their fixed-function HW accordingly to map from a > +source content buffer's space to a display's space. > + > +When fixed-function HW is not available the compositor will assemble a shader to > +ask the GPU to perform the transformation from the source content format to the > +display's format. > + > +A compositor's mapping function and a driver's mapping function are usually > +entirely separate concepts. On OSes where a HW vendor has no insight into > +closed-source compositor code such a vendor will tune their color management > +code to visually match the compositor's. On other OSes, where both mapping > +functions are open to an implementer they will ensure both mappings match. > + > +This results in mapping algorithm lock-in, meaning that no-one alone can > +experiment with or introduce new mapping algorithms and achieve > +consistent results regardless of which implementation path is taken. > + > +Why is Linux different? > +======================= > + > +Unlike other OSes, where there is one compositor for one or more drivers, on > +Linux we have a many-to-many relationship. Many compositors; many drivers. > +In addition each compositor vendor or community has their own view of how > +color management should be done. This is what makes Linux so beautiful. > + > +This means that a HW vendor can now no longer tune their driver to one > +compositor, as tuning it to one could make it look fairly different from > +another compositor's color mapping. > + > +We need a better solution. > + > + > +Descriptive API > +=============== > + > +An API that describes the source and destination colorspaces is a descriptive > +API. It describes the input and output color spaces but does not describe > +how precisely they should be mapped. Such a mapping includes many minute > +design decision that can greatly affect the look of the final result. > + > +It is not feasible to describe such mapping with enough detail to ensure the > +same result from each implementation. In fact, these mappings are a very active > +research area. > + > + > +Prescriptive API > +================ > + > +A prescriptive API describes not the source and destination colorspaces. It > +instead prescribes a recipe for how to manipulate pixel values to arrive at the > +desired outcome. > + > +This recipe is generally an ordered list of straight-forward operations, > +with clear mathematical definitions, such as 1D LUTs, 3D LUTs, matrices, > +or other operations that can be described in a precise manner. > + > + > +The Color Pipeline API > +====================== > + > +HW color management pipelines can significantly differ between HW > +vendors in terms of availability, ordering, and capabilities of HW > +blocks. This makes a common definition of color management blocks and > +their ordering nigh impossible. Instead we are defining an API that > +allows user space to discover the HW capabilities in a generic manner, > +agnostic of specific drivers and hardware. > + > + > +drm_colorop Object & IOCTLs > +=========================== > + > +To support the definition of color pipelines we define the DRM core > +object type drm_colorop. Individual drm_colorop objects will be chained > +via the NEXT property of a drm_colorop to constitute a color pipeline. > +Each drm_colorop object is unique, i.e., even if multiple color > +pipelines have the same operation they won't share the same drm_colorop > +object to describe that operation. > + > +Note that drivers are not expected to map drm_colorop objects statically > +to specific HW blocks. The mapping of drm_colorop objects is entirely a > +driver-internal detail and can be as dynamic or static as a driver needs > +it to be. See more in the Driver Implementation Guide section below. > + > +Just like other DRM objects the drm_colorop objects are discovered via > +IOCTLs: > + > +DRM_IOCTL_MODE_GETCOLOROPRESOURCES: This IOCTL is used to retrieve the > +number of all drm_colorop objects. > + > +DRM_IOCTL_MODE_GETCOLOROP: This IOCTL is used to read one drm_colorop. > +It includes the ID for the colorop object, as well as the plane_id of > +the associated plane. All other values should be registered as > +properties. > + > +Each drm_colorop has three core properties: > + > +TYPE: The type of transformation, such as > +* enumerated curve > +* custom (uniform) 1D LUT > +* 3x3 matrix > +* 3x4 matrix > +* 3D LUT > +* etc. > + > +Depending on the type of transformation other properties will describe > +more details. > + > +BYPASS: A boolean property that can be used to easily put a block into > +bypass mode. While setting other properties might fail atomic check, > +setting the BYPASS property to true should never fail. The BYPASS It hurts me to say as someone who is going to deal with this in user space but I think we should drop the requirement to never fail setting a pipeline to bypass mode with !ALLOW_MODESET. On IRC there was a discussion with Sima where he explained that atomic checks always check from current state (C) to a new state (B). This doesn't imply B->C will succeed as well. So to make the guarantee possible we'd have to change all drivers to be able to check from arbitrary state A to arbitrary state B and then check both C->B and B->C (or let user space do it). Let's leave this can of worms for another time and then solve it not just for the color pipeline but for any state. > +property is not mandatory for a colorop, as long as the entire pipeline > +can get bypassed by setting the COLOR_PIPELINE on a plane to '0'. > + > +NEXT: The ID of the next drm_colorop in a color pipeline, or 0 if this > +drm_colorop is the last in the chain. > + > +An example of a drm_colorop object might look like one of these:: > + > + /* 1D enumerated curve */ > + Color operation 42 > + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D enumerated curve > + ├─ "BYPASS": bool {true, false} > + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ inverse EOTF, …} > + └─ "NEXT": immutable color operation ID = 43 > + > + /* custom 4k entry 1D LUT */ > + Color operation 52 > + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D LUT > + ├─ "BYPASS": bool {true, false} > + ├─ "LUT_1D_SIZE": immutable range = 4096 > + ├─ "LUT_1D": blob > + └─ "NEXT": immutable color operation ID = 0 > + > + /* 17^3 3D LUT */ > + Color operation 72 > + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 3D LUT > + ├─ "BYPASS": bool {true, false} > + ├─ "LUT_3D_SIZE": immutable range = 17 > + ├─ "LUT_3D": blob > + └─ "NEXT": immutable color operation ID = 73 > + > + > +COLOR_PIPELINE Plane Property > +============================= > + > +Color Pipelines are created by a driver and advertised via a new > +COLOR_PIPELINE enum property on each plane. Values of the property > +always include '0', which is the default and means all color processing > +is disabled. Additional values will be the object IDs of the first > +drm_colorop in a pipeline. A driver can create and advertise none, one, > +or more possible color pipelines. A DRM client will select a color > +pipeline by setting the COLOR PIPELINE to the respective value. > + > +In the case where drivers have custom support for pre-blending color > +processing those drivers shall reject atomic commits that are trying to > +use both the custom color properties, as well as the COLOR_PIPELINE > +property. I think we all agree that we need a CAP even for the pre-blending pipeline anyway because of COLOR_ENCODING etc. So this probably should be more general and should say that with this CAP to expose the color pipeline any other pre-blending color processing properties need to be removed and all driver-internal pre-blending color processing must be disabled. > + > +An example of a COLOR_PIPELINE property on a plane might look like this:: > + > + Plane 10 > + ├─ "type": immutable enum {Overlay, Primary, Cursor} = Primary > + ├─ … > + └─ "color_pipeline": enum {0, 42, 52} = 0 > + > + > +Color Pipeline Discovery > +======================== > + > +A DRM client wanting color management on a drm_plane will: > + > +1. Read all drm_colorop objects > +2. Get the COLOR_PIPELINE property of the plane > +3. iterate all COLOR_PIPELINE enum values > +4. for each enum value walk the color pipeline (via the NEXT pointers) > + and see if the available color operations are suitable for the > + desired color management operations > + > +An example of chained properties to define an AMD pre-blending color > +pipeline might look like this:: > + > + Plane 10 > + ├─ "TYPE" (immutable) = Primary > + └─ "COLOR_PIPELINE": enum {0, 44} = 0 > + > + Color operation 44 > + ├─ "TYPE" (immutable) = 1D enumerated curve > + ├─ "BYPASS": bool > + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, PQ EOTF} = sRGB EOTF > + └─ "NEXT" (immutable) = 45 > + > + Color operation 45 > + ├─ "TYPE" (immutable) = 3x4 Matrix > + ├─ "BYPASS": bool > + ├─ "MATRIX_3_4": blob > + └─ "NEXT" (immutable) = 46 > + > + Color operation 46 > + ├─ "TYPE" (immutable) = 1D enumerated curve > + ├─ "BYPASS": bool > + ├─ "CURVE_1D_TYPE": enum {sRGB Inverse EOTF, PQ Inverse EOTF} = sRGB EOTF > + └─ "NEXT" (immutable) = 47 > + > + Color operation 47 > + ├─ "TYPE" (immutable) = 1D LUT > + ├─ "LUT_1D_SIZE": immutable range = 4096 > + ├─ "LUT_1D_DATA": blob > + └─ "NEXT" (immutable) = 48 > + > + Color operation 48 > + ├─ "TYPE" (immutable) = 3D LUT > + ├─ "LUT_3D_SIZE" (immutable) = 17 > + ├─ "LUT_3D_DATA": blob > + └─ "NEXT" (immutable) = 49 > + > + Color operation 49 > + ├─ "TYPE" (immutable) = 1D enumerated curve > + ├─ "BYPASS": bool > + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, PQ EOTF} = sRGB EOTF > + └─ "NEXT" (immutable) = 0 > + > + > +Color Pipeline Programming > +========================== > + > +Once a DRM client has found a suitable pipeline it will: > + > +1. Set the COLOR_PIPELINE enum value to the one pointing at the first > + drm_colorop object of the desired pipeline > +2. Set the properties for all drm_colorop objects in the pipeline to the > + desired values, setting BYPASS to true for unused drm_colorop blocks, > + and false for enabled drm_colorop blocks > +3. Perform atomic_check/commit as desired > + > +To configure the pipeline for an HDR10 PQ plane and blending in linear > +space, a compositor might perform an atomic commit with the following > +property values:: > + > + Plane 10 > + └─ "COLOR_PIPELINE" = 42 > + > + Color operation 42 (input CSC) > + └─ "BYPASS" = true > + > + Color operation 44 (DeGamma) > + └─ "BYPASS" = true > + > + Color operation 45 (gamut remap) > + └─ "BYPASS" = true > + > + Color operation 46 (shaper LUT RAM) > + └─ "BYPASS" = true > + > + Color operation 47 (3D LUT RAM) > + └─ "LUT_3D_DATA" = Gamut mapping + tone mapping + night mode > + > + Color operation 48 (blend gamma) > + └─ "CURVE_1D_TYPE" = PQ EOTF > + > + > +Driver Implementer's Guide > +========================== > + > +What does this all mean for driver implementations? As noted above the > +colorops can map to HW directly but don't need to do so. Here are some > +suggestions on how to think about creating your color pipelines: > + > +- Try to expose pipelines that use already defined colorops, even if > + your hardware pipeline is split differently. This allows existing > + userspace to immediately take advantage of the hardware. > + > +- Additionally, try to expose your actual hardware blocks as colorops. > + Define new colorop types where you believe it can offer significant > + benefits if userspace learns to program them. > + > +- Avoid defining new colorops for compound operations with very narrow > + scope. If you have a hardware block for a special operation that > + cannot be split further, you can expose that as a new colorop type. > + However, try to not define colorops for "use cases", especially if > + they require you to combine multiple hardware blocks. > + > +- Design new colorops as prescriptive, not descriptive; by the > + mathematical formula, not by the assumed input and output. > + > +A defined colorop type must be deterministic. Its operation can depend > +only on its properties and input and nothing else, allowed error > +tolerance notwithstanding. Maybe add that the exact behavior or formula of the element must be documented entirely. > + > + > +Driver Forward/Backward Compatibility > +===================================== > + > +As this is uAPI drivers can't regress color pipelines that have been > +introduced for a given HW generation. New HW generations are free to > +abandon color pipelines advertised for previous generations. > +Nevertheless, it can be beneficial to carry support for existing color > +pipelines forward as those will likely already have support in DRM > +clients. > + > +Introducing new colorops to a pipeline is fine, as long as they can be > +disabled or are purely informational. DRM clients implementing support > +for the pipeline can always skip unknown properties as long as they can > +be confident that doing so will not cause unexpected results. > + > +If a new colorop doesn't fall into one of the above categories > +(bypassable or informational) the modified pipeline would be unusable > +for user space. In this case a new pipeline should be defined. How can user space detect an informational element? Should we just add a BYPASS property to informational elements, make it read only and set to true maybe? Or something more descriptive? > + > + > +References > +========== > + > +1. https://lore.kernel.org/dri-devel/QMers3awXvNCQlyhWdTtsPwkp5ie9bze_hD5nAccFW7a_RXlWjYB7MoUW_8CKLT2bSQwIXVi5H6VULYIxCdgvryZoAoJnC5lZgyK1QWn488=@emersion.fr/ > \ No newline at end of file > -- > 2.42.0 >
On Fri, 20 Oct 2023 16:22:56 +0200 Sebastian Wick <sebastian.wick@redhat.com> wrote: > Thanks for continuing to work on this! > > On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: > > v2: > > - Update colorop visualizations to match reality (Sebastian, Alex Hung) > > - Updated wording (Pekka) > > - Change BYPASS wording to make it non-mandatory (Sebastian) > > - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property > > section (Pekka) > > - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) > > - Add "Driver Implementer's Guide" section (Pekka) > > - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) > > ... > > +Driver Forward/Backward Compatibility > > +===================================== > > + > > +As this is uAPI drivers can't regress color pipelines that have been > > +introduced for a given HW generation. New HW generations are free to > > +abandon color pipelines advertised for previous generations. > > +Nevertheless, it can be beneficial to carry support for existing color > > +pipelines forward as those will likely already have support in DRM > > +clients. > > + > > +Introducing new colorops to a pipeline is fine, as long as they can be > > +disabled or are purely informational. DRM clients implementing support > > +for the pipeline can always skip unknown properties as long as they can > > +be confident that doing so will not cause unexpected results. > > + > > +If a new colorop doesn't fall into one of the above categories > > +(bypassable or informational) the modified pipeline would be unusable > > +for user space. In this case a new pipeline should be defined. > > How can user space detect an informational element? Should we just add a > BYPASS property to informational elements, make it read only and set to > true maybe? Or something more descriptive? Read-only BYPASS set to true would be fine by me, I guess. I think we also need a definition of "informational". Counter-example 1: a colorop that represents a non-configurable YUV<->RGB conversion. Maybe it determines its operation from FB pixel format. It cannot be set to bypass, it cannot be configured, and it will alter color values. Counter-example 2: image size scaling colorop. It might not be configurable, it is controlled by the plane CRTC_* and SRC_* properties. You still need to understand what it does, so you can arrange the scaling to work correctly. (Do not want to scale an image with PQ-encoded values as Josh demonstrated in XDC.) Counter-example 3: image sampling colorop. Averages FB originated color values to produce a color sample. Again do not want to do this with PQ-encoded values. Thanks, pq
On 2023-10-20 10:57, Pekka Paalanen wrote: > On Fri, 20 Oct 2023 16:22:56 +0200 > Sebastian Wick <sebastian.wick@redhat.com> wrote: > >> Thanks for continuing to work on this! >> >> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: >>> v2: >>> - Update colorop visualizations to match reality (Sebastian, Alex Hung) >>> - Updated wording (Pekka) >>> - Change BYPASS wording to make it non-mandatory (Sebastian) >>> - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property >>> section (Pekka) >>> - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) >>> - Add "Driver Implementer's Guide" section (Pekka) >>> - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) >>> > > ... > >>> +Driver Forward/Backward Compatibility >>> +===================================== >>> + >>> +As this is uAPI drivers can't regress color pipelines that have been >>> +introduced for a given HW generation. New HW generations are free to >>> +abandon color pipelines advertised for previous generations. >>> +Nevertheless, it can be beneficial to carry support for existing color >>> +pipelines forward as those will likely already have support in DRM >>> +clients. >>> + >>> +Introducing new colorops to a pipeline is fine, as long as they can be >>> +disabled or are purely informational. DRM clients implementing support >>> +for the pipeline can always skip unknown properties as long as they can >>> +be confident that doing so will not cause unexpected results. >>> + >>> +If a new colorop doesn't fall into one of the above categories >>> +(bypassable or informational) the modified pipeline would be unusable >>> +for user space. In this case a new pipeline should be defined. >> >> How can user space detect an informational element? Should we just add a >> BYPASS property to informational elements, make it read only and set to >> true maybe? Or something more descriptive? > > Read-only BYPASS set to true would be fine by me, I guess. > Don't you mean set to false? An informational element will always do something, so it can't be bypassed. > I think we also need a definition of "informational". > > Counter-example 1: a colorop that represents a non-configurable Not sure what's "counter" for these examples? > YUV<->RGB conversion. Maybe it determines its operation from FB pixel > format. It cannot be set to bypass, it cannot be configured, and it > will alter color values. > > Counter-example 2: image size scaling colorop. It might not be > configurable, it is controlled by the plane CRTC_* and SRC_* > properties. You still need to understand what it does, so you can > arrange the scaling to work correctly. (Do not want to scale an image > with PQ-encoded values as Josh demonstrated in XDC.) > IMO the position of the scaling operation is the thing that's important here as the color pipeline won't define scaling properties. > Counter-example 3: image sampling colorop. Averages FB originated color > values to produce a color sample. Again do not want to do this with > PQ-encoded values. > Wouldn't this only happen during a scaling op? Harry > > Thanks, > pq
On Fri, 20 Oct 2023 11:23:28 -0400 Harry Wentland <harry.wentland@amd.com> wrote: > On 2023-10-20 10:57, Pekka Paalanen wrote: > > On Fri, 20 Oct 2023 16:22:56 +0200 > > Sebastian Wick <sebastian.wick@redhat.com> wrote: > > > >> Thanks for continuing to work on this! > >> > >> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: > >>> v2: > >>> - Update colorop visualizations to match reality (Sebastian, Alex Hung) > >>> - Updated wording (Pekka) > >>> - Change BYPASS wording to make it non-mandatory (Sebastian) > >>> - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property > >>> section (Pekka) > >>> - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) > >>> - Add "Driver Implementer's Guide" section (Pekka) > >>> - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) > >>> > > > > ... > > > >>> +Driver Forward/Backward Compatibility > >>> +===================================== > >>> + > >>> +As this is uAPI drivers can't regress color pipelines that have been > >>> +introduced for a given HW generation. New HW generations are free to > >>> +abandon color pipelines advertised for previous generations. > >>> +Nevertheless, it can be beneficial to carry support for existing color > >>> +pipelines forward as those will likely already have support in DRM > >>> +clients. > >>> + > >>> +Introducing new colorops to a pipeline is fine, as long as they can be > >>> +disabled or are purely informational. DRM clients implementing support > >>> +for the pipeline can always skip unknown properties as long as they can > >>> +be confident that doing so will not cause unexpected results. > >>> + > >>> +If a new colorop doesn't fall into one of the above categories > >>> +(bypassable or informational) the modified pipeline would be unusable > >>> +for user space. In this case a new pipeline should be defined. > >> > >> How can user space detect an informational element? Should we just add a > >> BYPASS property to informational elements, make it read only and set to > >> true maybe? Or something more descriptive? > > > > Read-only BYPASS set to true would be fine by me, I guess. > > > > Don't you mean set to false? An informational element will always do > something, so it can't be bypassed. Yeah, this is why we need a definition. I understand "informational" to not change pixel values in any way. Previously I had some weird idea that scaling doesn't alter color, but of course it may. > > I think we also need a definition of "informational". > > > > Counter-example 1: a colorop that represents a non-configurable > > Not sure what's "counter" for these examples? > > > YUV<->RGB conversion. Maybe it determines its operation from FB pixel > > format. It cannot be set to bypass, it cannot be configured, and it > > will alter color values. > > > > Counter-example 2: image size scaling colorop. It might not be > > configurable, it is controlled by the plane CRTC_* and SRC_* > > properties. You still need to understand what it does, so you can > > arrange the scaling to work correctly. (Do not want to scale an image > > with PQ-encoded values as Josh demonstrated in XDC.) > > > > IMO the position of the scaling operation is the thing that's important > here as the color pipeline won't define scaling properties. > > > Counter-example 3: image sampling colorop. Averages FB originated color > > values to produce a color sample. Again do not want to do this with > > PQ-encoded values. > > > > Wouldn't this only happen during a scaling op? There is certainly some overlap between examples 2 and 3. IIRC SRC_X/Y coordinates can be fractional, which makes nearest vs. bilinear sampling have a difference even if there is no scaling. There is also the question of chroma siting with sub-sampled YUV. I don't know how that actually works, or how it theoretically should work. Thanks, pq
Thank you Harry and all other contributors for your work on this. Responses inline - On Mon, 23 Oct 2023, Pekka Paalanen wrote: > On Fri, 20 Oct 2023 11:23:28 -0400 > Harry Wentland <harry.wentland@amd.com> wrote: > > > On 2023-10-20 10:57, Pekka Paalanen wrote: > > > On Fri, 20 Oct 2023 16:22:56 +0200 > > > Sebastian Wick <sebastian.wick@redhat.com> wrote: > > > > > >> Thanks for continuing to work on this! > > >> > > >> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: > > >>> v2: > > >>> - Update colorop visualizations to match reality (Sebastian, Alex Hung) > > >>> - Updated wording (Pekka) > > >>> - Change BYPASS wording to make it non-mandatory (Sebastian) > > >>> - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property > > >>> section (Pekka) > > >>> - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) > > >>> - Add "Driver Implementer's Guide" section (Pekka) > > >>> - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) > > > > > > ... > > > > > >>> +An example of a drm_colorop object might look like one of these:: > > >>> + > > >>> + /* 1D enumerated curve */ > > >>> + Color operation 42 > > >>> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D enumerated curve > > >>> + ├─ "BYPASS": bool {true, false} > > >>> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ inverse EOTF, …} > > >>> + └─ "NEXT": immutable color operation ID = 43 I know these are just examples, but I would also like to suggest the possibility of an "identity" CURVE_1D_TYPE. BYPASS = true might get different results compared to setting an identity in some cases depending on the hardware. See below for more on this, RE: implicit format conversions. Although NVIDIA hardware doesn't use a ROM for enumerated curves, it came up in offline discussions that it would nonetheless be helpful to expose enumerated curves in order to hide the vendor-specific complexities of programming segmented LUTs from clients. In that case, we would simply refer to the enumerated curve when calculating/choosing segmented LUT entries. Another thing that came up in offline discussions is that we could use multiple color operations to program a single operation in hardware. As I understand it, AMD has a ROM-defined LUT, followed by a custom 4K entry LUT, followed by an "HDR Multiplier". On NVIDIA we don't have these as separate hardware stages, but we could combine them into a singular LUT in software, such that you can combine e.g. segmented PQ EOTF with night light. One caveat is that you will lose precision from the custom LUT where it overlaps with the linear section of the enumerated curve, but that is unavoidable and shouldn't be an issue in most use-cases. Actually, the current examples in the proposal don't include a multiplier color op, which might be useful. For AMD as above, but also for NVIDIA as the following issue arises: As discussed further below, the NVIDIA "degamma" LUT performs an implicit fixed point to FP16 conversion. In that conversion, what fixed point 0xFFFFFFFF maps to in floating point varies depending on the source content. If it's SDR content, we want the max value in FP16 to be 1.0 (80 nits), subject to a potential boost multiplier if we want SDR content to be brighter. If it's HDR PQ content, we want the max value in FP16 to be 125.0 (10,000 nits). My assumption is that this is also what AMD's "HDR Multiplier" stage is used for, is that correct? From the given enumerated curves, it's not clear how they would map to the above. Should sRGB EOTF have a max FP16 value of 1.0, and the PQ EOTF a max FP16 value of 125.0? That may work, but it tends towards the "descriptive" notion of assuming the source content, which may not be accurate in all cases. This is also an issue for the custom 1D LUT, as the blob will need to be converted to FP16 in order to populate our "degamma" LUT. What should the resulting max FP16 value be, given that we no longer have any hint as to the source content? I think a multiplier color op solves all of these issues. Named curves and custom 1D LUTs would by default assume a max FP16 value of 1.0, which would then be adjusted by the multiplier. For 80 nit SDR content, set it to 1, for 400 nit SDR content, set it to 5, for HDR PQ content, set it to 125, etc. > > >>> + > > >>> + /* custom 4k entry 1D LUT */ > > >>> + Color operation 52 > > >>> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D LUT > > >>> + ├─ "BYPASS": bool {true, false} > > >>> + ├─ "LUT_1D_SIZE": immutable range = 4096 > > >>> + ├─ "LUT_1D": blob > > >>> + └─ "NEXT": immutable color operation ID = 0 > > > > > > ... > > > > > >>> +Driver Forward/Backward Compatibility > > >>> +===================================== > > >>> + > > >>> +As this is uAPI drivers can't regress color pipelines that have been > > >>> +introduced for a given HW generation. New HW generations are free to > > >>> +abandon color pipelines advertised for previous generations. > > >>> +Nevertheless, it can be beneficial to carry support for existing color > > >>> +pipelines forward as those will likely already have support in DRM > > >>> +clients. > > >>> + > > >>> +Introducing new colorops to a pipeline is fine, as long as they can be > > >>> +disabled or are purely informational. DRM clients implementing support > > >>> +for the pipeline can always skip unknown properties as long as they can > > >>> +be confident that doing so will not cause unexpected results. > > >>> + > > >>> +If a new colorop doesn't fall into one of the above categories > > >>> +(bypassable or informational) the modified pipeline would be unusable > > >>> +for user space. In this case a new pipeline should be defined. > > >> > > >> How can user space detect an informational element? Should we just add a > > >> BYPASS property to informational elements, make it read only and set to > > >> true maybe? Or something more descriptive? > > > > > > Read-only BYPASS set to true would be fine by me, I guess. > > > > > > > Don't you mean set to false? An informational element will always do > > something, so it can't be bypassed. > > Yeah, this is why we need a definition. I understand "informational" to > not change pixel values in any way. Previously I had some weird idea > that scaling doesn't alter color, but of course it may. On recent hardware, the NVIDIA pre-blending pipeline includes LUTs that do implicit fixed-point to FP16 conversions, and vice versa. For example, the "degamma" LUT towards the beginning of the pipeline implicitly converts from fixed point to FP16, and some of the following operations expect to operate in FP16. As such, if you have a fixed point input and don't bypass those following operations, you *must not* bypass the LUT, even if you are otherwise just programming it with the identity. Conversely, if you have a floating point input, you *must* bypass the LUT. Could informational elements and allowing the exclusion of the BYPASS property be used to convey this information to the client? For example, we could expose one pipeline with the LUT exposed with read-only BYPASS set to false, and sandwich it with informational "Fixed Point" and "FP16" elements to accommodate fixed point input. Then, expose another pipeline with the LUT missing, and an informational "FP16" element in its place to accommodate floating point input. That's just an example; we also have other operations in the pipeline that do similar implicit conversions. In these cases we don't want the operations to be bypassed individually, so instead we would expose them as mandatory in some pipelines and missing in others, with informational elements to help inform the client of which to choose. Is that acceptable under the current proposal? Note that in this case, the information just has to do with what format the pixels should be in, it doesn't correspond to any specific operation. So, I'm not sure that BYPASS has any meaning for informational elements in this context. > > > I think we also need a definition of "informational". > > > > > > Counter-example 1: a colorop that represents a non-configurable > > > > Not sure what's "counter" for these examples? > > > > > YUV<->RGB conversion. Maybe it determines its operation from FB pixel > > > format. It cannot be set to bypass, it cannot be configured, and it > > > will alter color values. Would it be reasonable to expose this is a 3x4 matrix with a read-only blob and no BYPASS property? I already brought up a similar idea at the XDC HDR Workshop based on the principle that read-only blobs could be used to express some static pipeline elements without the need to define a new type, but got mixed opinions. I think this demonstrates the principle further, as clients could detect this programmatically instead of having to special-case the informational element. > > > > > > Counter-example 2: image size scaling colorop. It might not be > > > configurable, it is controlled by the plane CRTC_* and SRC_* > > > properties. You still need to understand what it does, so you can > > > arrange the scaling to work correctly. (Do not want to scale an image > > > with PQ-encoded values as Josh demonstrated in XDC.) > > > > > > > IMO the position of the scaling operation is the thing that's important > > here as the color pipeline won't define scaling properties. I agree that blending should ideally be done in linear space, and I remember that from Josh's presentation at XDC, but I don't recall the same being said for scaling. In fact, the NVIDIA pre-blending scaler exists in a stage of the pipeline that is meant to be in PQ space (more on this below), and that was found to achieve better results at HDR/SDR boundaries. Of course, this only bolsters the argument that it would be helpful to have an informational "scaler" element to understand at which stage scaling takes place. > > > Counter-example 3: image sampling colorop. Averages FB originated color > > > values to produce a color sample. Again do not want to do this with > > > PQ-encoded values. > > > > > > > Wouldn't this only happen during a scaling op? > > There is certainly some overlap between examples 2 and 3. IIRC SRC_X/Y > coordinates can be fractional, which makes nearest vs. bilinear > sampling have a difference even if there is no scaling. > > There is also the question of chroma siting with sub-sampled YUV. I > don't know how that actually works, or how it theoretically should work. We have some operations in our pipeline that are intended to be static, i.e. a static matrix that converts from RGB to LMS, and later another that converts from LMS to ICtCp. There are even LUTs that are intended to be static, converting from linear to PQ and vice versa. All of this is because the pre-blending scaler and tone mapping operator are intended to operate in ICtCp PQ space. Although the stated LUTs and matrices are intended to be static, they are actually programmable. In offline discussions, it was indicated that it would be helpful to actually expose the programmability, as opposed to exposing them as non-bypassable blocks, as some compositors may have novel uses for them. Despite being programmable, the LUTs are updated in a manner that is less efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful if there was some way to tag operations according to their performance, for example so that clients can prefer a high performance one when they intend to do an animated transition? I recall from the XDC HDR workshop that this is also an issue with AMD's 3DLUT, where updates can be too slow to animate. Thanks, Alex Goins NVIDIA Linux Driver Team > Thanks, > pq >
On Wed, 25 Oct 2023 15:16:08 -0500 (CDT) Alex Goins <agoins@nvidia.com> wrote: > Thank you Harry and all other contributors for your work on this. Responses > inline - > > On Mon, 23 Oct 2023, Pekka Paalanen wrote: > > > On Fri, 20 Oct 2023 11:23:28 -0400 > > Harry Wentland <harry.wentland@amd.com> wrote: > > > > > On 2023-10-20 10:57, Pekka Paalanen wrote: > > > > On Fri, 20 Oct 2023 16:22:56 +0200 > > > > Sebastian Wick <sebastian.wick@redhat.com> wrote: > > > > > > > >> Thanks for continuing to work on this! > > > >> > > > >> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: > > > >>> v2: > > > >>> - Update colorop visualizations to match reality (Sebastian, Alex Hung) > > > >>> - Updated wording (Pekka) > > > >>> - Change BYPASS wording to make it non-mandatory (Sebastian) > > > >>> - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property > > > >>> section (Pekka) > > > >>> - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) > > > >>> - Add "Driver Implementer's Guide" section (Pekka) > > > >>> - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) > > > > > > > > ... > > > > > > > >>> +An example of a drm_colorop object might look like one of these:: > > > >>> + > > > >>> + /* 1D enumerated curve */ > > > >>> + Color operation 42 > > > >>> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D enumerated curve > > > >>> + ├─ "BYPASS": bool {true, false} > > > >>> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ inverse EOTF, …} > > > >>> + └─ "NEXT": immutable color operation ID = 43 > > I know these are just examples, but I would also like to suggest the possibility > of an "identity" CURVE_1D_TYPE. BYPASS = true might get different results > compared to setting an identity in some cases depending on the hardware. See > below for more on this, RE: implicit format conversions. > > Although NVIDIA hardware doesn't use a ROM for enumerated curves, it came up in > offline discussions that it would nonetheless be helpful to expose enumerated > curves in order to hide the vendor-specific complexities of programming > segmented LUTs from clients. In that case, we would simply refer to the > enumerated curve when calculating/choosing segmented LUT entries. That's a good idea. > Another thing that came up in offline discussions is that we could use multiple > color operations to program a single operation in hardware. As I understand it, > AMD has a ROM-defined LUT, followed by a custom 4K entry LUT, followed by an > "HDR Multiplier". On NVIDIA we don't have these as separate hardware stages, but > we could combine them into a singular LUT in software, such that you can combine > e.g. segmented PQ EOTF with night light. One caveat is that you will lose > precision from the custom LUT where it overlaps with the linear section of the > enumerated curve, but that is unavoidable and shouldn't be an issue in most > use-cases. Indeed. > Actually, the current examples in the proposal don't include a multiplier color > op, which might be useful. For AMD as above, but also for NVIDIA as the > following issue arises: > > As discussed further below, the NVIDIA "degamma" LUT performs an implicit fixed > point to FP16 conversion. In that conversion, what fixed point 0xFFFFFFFF maps > to in floating point varies depending on the source content. If it's SDR > content, we want the max value in FP16 to be 1.0 (80 nits), subject to a > potential boost multiplier if we want SDR content to be brighter. If it's HDR PQ > content, we want the max value in FP16 to be 125.0 (10,000 nits). My assumption > is that this is also what AMD's "HDR Multiplier" stage is used for, is that > correct? It would be against the UAPI design principles to tag content as HDR or SDR. What you can do instead is to expose a colorop with a multiplier of 1.0 or 125.0 to match your hardware behaviour, then tell your hardware that the input is SDR or HDR to get the expected multiplier. You will never know what the content actually is, anyway. Of course, if we want to have a arbitrary multiplier colorop that is somewhat standard, as in, exposed by many drivers to ease userspace development, you can certainly use any combination of your hardware features you need to realize the UAPI prescribed mathematical operation. Since we are talking about floating-point in hardware, a multiplier does not significantly affect precision. In order to mathematically define all colorops, I believe it is necessary to define all colorops in terms of floating-point values (as in math), even if they operate on fixed-point or integer. By this I mean that if the input is 8 bpc unsigned integer pixel format for instance, 0 raw pixel channel value is mapped to 0.0 and 255 is mapped to 1.0, and the color pipeline starts with [0.0, 1.0], not [0, 255] domain. We have to agree on this mapping for all channels on all pixel formats. However, there is a "but" further below. I also propose that quantization range is NOT considered in the raw value mapping, so that we can handle quantization range in colorops explicitly, allowing us to e.g. handle sub-blacks and super-whites when necessary. (These are currently impossible to represent in the legacy color properties, because everything is converted to full range and clipped before any color operations.) > From the given enumerated curves, it's not clear how they would map to the > above. Should sRGB EOTF have a max FP16 value of 1.0, and the PQ EOTF a max FP16 > value of 125.0? That may work, but it tends towards the "descriptive" notion of > assuming the source content, which may not be accurate in all cases. This is > also an issue for the custom 1D LUT, as the blob will need to be converted to > FP16 in order to populate our "degamma" LUT. What should the resulting max FP16 > value be, given that we no longer have any hint as to the source content? In my opinion, all finite non-negative transfer functions should operate with [0.0, 1.0] domain and [0.0, 1.0] range, and that includes all sRGB, power 2.2, and PQ curves. If we look at BT.2100, there is no such encoding even mentioned where 125.0 would correspond to 10k cd/m². That 125.0 convention already has a built-in assumption what the color spaces are and what the conversion is aiming to do. IOW, I would say that choice is opinionated from the start. The multiplier in BT.2100 is always 10000. Given that elements like various kinds of look-up tables inherently assume that the domain is [0.0, 1.0] (because the it is a table that has a beginning and an end, and the usual convention is that the beginning is zero and the end is one), I think it is best to stick to the [0.0, 1.0] range where possible. If we go out of that range, then we have to define how a LUT would apply in a sensible way. Many TFs are intended to be defined only on [0.0, 1.0] -> [0.0, 1.0]. Some curves, like power 2.2, have a mathematical form that naturally extends outside of that range. Power 2.2 generalizes to >1.0 input values as is, but not for negative input values. If needed for negative input values, it is common to use y = -TF(-x) for x < 0 mirroring. scRGB is the prime example that intentionally uses negative channel values. We can also have negative channel values with limited quantization range, sometimes even intentionally (xvYCC chroma, or PLUGE test sub-blacks). Out-of-unit-range values can also appear as a side-effect of signal processing, and they should not get clipped prematurely. This is a challenge for colorops that fundamentally cannot handle out-of-unit-range values. There are various workarounds. scRGB colorimetry can be converted into BT.2020 primaries for example, to avoid saturation induced negative values. Limited quantization range signal could be processed as-is, meaning that the limited range is mapped to [16.0/255, 235.0/255] instead of [0.0, 1.0] or so. But then, we have a complication with transfer functions. > I think a multiplier color op solves all of these issues. Named curves and > custom 1D LUTs would by default assume a max FP16 value of 1.0, which would then > be adjusted by the multiplier. Pretty much. > For 80 nit SDR content, set it to 1, for 400 > nit SDR content, set it to 5, for HDR PQ content, set it to 125, etc. That I think is a another story. > > > >>> + > > > >>> + /* custom 4k entry 1D LUT */ > > > >>> + Color operation 52 > > > >>> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D LUT > > > >>> + ├─ "BYPASS": bool {true, false} > > > >>> + ├─ "LUT_1D_SIZE": immutable range = 4096 > > > >>> + ├─ "LUT_1D": blob > > > >>> + └─ "NEXT": immutable color operation ID = 0 > > > > > > > > ... > > > > > > > >>> +Driver Forward/Backward Compatibility > > > >>> +===================================== > > > >>> + > > > >>> +As this is uAPI drivers can't regress color pipelines that have been > > > >>> +introduced for a given HW generation. New HW generations are free to > > > >>> +abandon color pipelines advertised for previous generations. > > > >>> +Nevertheless, it can be beneficial to carry support for existing color > > > >>> +pipelines forward as those will likely already have support in DRM > > > >>> +clients. > > > >>> + > > > >>> +Introducing new colorops to a pipeline is fine, as long as they can be > > > >>> +disabled or are purely informational. DRM clients implementing support > > > >>> +for the pipeline can always skip unknown properties as long as they can > > > >>> +be confident that doing so will not cause unexpected results. > > > >>> + > > > >>> +If a new colorop doesn't fall into one of the above categories > > > >>> +(bypassable or informational) the modified pipeline would be unusable > > > >>> +for user space. In this case a new pipeline should be defined. > > > >> > > > >> How can user space detect an informational element? Should we just add a > > > >> BYPASS property to informational elements, make it read only and set to > > > >> true maybe? Or something more descriptive? > > > > > > > > Read-only BYPASS set to true would be fine by me, I guess. > > > > > > > > > > Don't you mean set to false? An informational element will always do > > > something, so it can't be bypassed. > > > > Yeah, this is why we need a definition. I understand "informational" to > > not change pixel values in any way. Previously I had some weird idea > > that scaling doesn't alter color, but of course it may. > > On recent hardware, the NVIDIA pre-blending pipeline includes LUTs that do > implicit fixed-point to FP16 conversions, and vice versa. Above, I claimed that the UAPI should be defined in nominal floating-point values, but I wonder, would that work? Would we need to have explicit colorops for converting from raw pixel data values into nominal floating-point in the UAPI? > For example, the "degamma" LUT towards the beginning of the pipeline implicitly > converts from fixed point to FP16, and some of the following operations expect > to operate in FP16. As such, if you have a fixed point input and don't bypass > those following operations, you *must not* bypass the LUT, even if you are > otherwise just programming it with the identity. Conversely, if you have a > floating point input, you *must* bypass the LUT. Interesting. Since the color pipeline is not(?) meant to replace pixel format definitions which already make the difference between fixed and floating point, wouldn't this little detail need to be taken care of by the driver under the hood? What if I want to use degamma colorop with a floating-point framebuffer? Simply not possible on this hardware? > Could informational elements and allowing the exclusion of the BYPASS property > be used to convey this information to the client? For example, we could expose > one pipeline with the LUT exposed with read-only BYPASS set to false, and > sandwich it with informational "Fixed Point" and "FP16" elements to accommodate > fixed point input. Then, expose another pipeline with the LUT missing, and an > informational "FP16" element in its place to accommodate floating point input. > > That's just an example; we also have other operations in the pipeline that do > similar implicit conversions. In these cases we don't want the operations to be > bypassed individually, so instead we would expose them as mandatory in some > pipelines and missing in others, with informational elements to help inform the > client of which to choose. Is that acceptable under the current proposal? > > Note that in this case, the information just has to do with what format the > pixels should be in, it doesn't correspond to any specific operation. So, I'm > not sure that BYPASS has any meaning for informational elements in this context. Very good questions. Do we have to expose those conversions in the UAPI to make things work for this hardware? Meaning that we cannot assume all colorops work in nominal floating-point from userspace perspective (perhaps with varying degrees of precision). > > > > I think we also need a definition of "informational". > > > > > > > > Counter-example 1: a colorop that represents a non-configurable > > > > > > Not sure what's "counter" for these examples? > > > > > > > YUV<->RGB conversion. Maybe it determines its operation from FB pixel > > > > format. It cannot be set to bypass, it cannot be configured, and it > > > > will alter color values. > > Would it be reasonable to expose this is a 3x4 matrix with a read-only blob and > no BYPASS property? I already brought up a similar idea at the XDC HDR Workshop > based on the principle that read-only blobs could be used to express some static > pipeline elements without the need to define a new type, but got mixed opinions. > I think this demonstrates the principle further, as clients could detect this > programmatically instead of having to special-case the informational element. If the blob depends on the pixel format (i.e. the driver automatically chooses a different blob per pixel format), then I think we would need to expose all the blobs and how they correspond to pixel formats. Otherwise ok, I guess. However, do we want or need to make a color pipeline or colorop conditional on pixel formats? For example, if you use a YUV 4:2:0 type of pixel format, then you must use this pipeline and not any other. Or floating-point type of pixel format. I did not anticipate this before, I assumed that all color pipelines and colorops are independent of the framebuffer pixel format. A specific colorop might have a property that needs to agree with the framebuffer pixel format, but I didn't expect further limitations. "Without the need to define a new type" is something I think we need to consider case by case. I have a hard time giving a general opinion. > > > > > > > > Counter-example 2: image size scaling colorop. It might not be > > > > configurable, it is controlled by the plane CRTC_* and SRC_* > > > > properties. You still need to understand what it does, so you can > > > > arrange the scaling to work correctly. (Do not want to scale an image > > > > with PQ-encoded values as Josh demonstrated in XDC.) > > > > > > > > > > IMO the position of the scaling operation is the thing that's important > > > here as the color pipeline won't define scaling properties. > > I agree that blending should ideally be done in linear space, and I remember > that from Josh's presentation at XDC, but I don't recall the same being said for > scaling. In fact, the NVIDIA pre-blending scaler exists in a stage of the > pipeline that is meant to be in PQ space (more on this below), and that was > found to achieve better results at HDR/SDR boundaries. Of course, this only > bolsters the argument that it would be helpful to have an informational "scaler" > element to understand at which stage scaling takes place. Both blending and scaling are fundamentally the same operation: you have two or more source colors (pixels), and you want to compute a weighted average of them following what happens in nature, that is, physics, as that is what humans are used to. Both blending and scaling will suffer from the same problems if the operation is performed on not light-linear values. The result of the weighted average does not correspond to physics. The problem may be hard to observe with natural imagery, but Josh's example shows it very clearly. Maybe that effect is sometimes useful for some imagery in some use cases, but it is still an accidental side-effect. You might get even better results if you don't rely on accidental side-effects but design a separate operation for the exact goal you have. Mind, by scaling we mean changing image size. Not scaling color values. > > > > Counter-example 3: image sampling colorop. Averages FB originated color > > > > values to produce a color sample. Again do not want to do this with > > > > PQ-encoded values. > > > > > > > > > > Wouldn't this only happen during a scaling op? > > > > There is certainly some overlap between examples 2 and 3. IIRC SRC_X/Y > > coordinates can be fractional, which makes nearest vs. bilinear > > sampling have a difference even if there is no scaling. > > > > There is also the question of chroma siting with sub-sampled YUV. I > > don't know how that actually works, or how it theoretically should work. > > We have some operations in our pipeline that are intended to be static, i.e. a > static matrix that converts from RGB to LMS, and later another that converts > from LMS to ICtCp. There are even LUTs that are intended to be static, > converting from linear to PQ and vice versa. All of this is because the > pre-blending scaler and tone mapping operator are intended to operate in ICtCp > PQ space. Although the stated LUTs and matrices are intended to be static, they > are actually programmable. In offline discussions, it was indicated that it > would be helpful to actually expose the programmability, as opposed to exposing > them as non-bypassable blocks, as some compositors may have novel uses for them. Correct. Doing tone-mapping in ICtCp etc. are already policy that userspace might or might not agree with. Exposing static colorops will help usages that adhere to current prevalent standards around very specific use cases. There may be millions of devices needing exactly that processing in their usage, but it is also quite limiting in what one can do with the hardware. > Despite being programmable, the LUTs are updated in a manner that is less > efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful > if there was some way to tag operations according to their performance, > for example so that clients can prefer a high performance one when they > intend to do an animated transition? I recall from the XDC HDR workshop > that this is also an issue with AMD's 3DLUT, where updates can be too > slow to animate. I can certainly see such information being useful, but then we need to somehow quantize the performance. What I was left puzzled about after the XDC workshop is that is it possible to pre-load configurations in the background (slow), and then quickly switch between them? Hardware-wise I mean. Thanks, pq
On Thu, Oct 26, 2023 at 11:57:47AM +0300, Pekka Paalanen wrote: > On Wed, 25 Oct 2023 15:16:08 -0500 (CDT) > Alex Goins <agoins@nvidia.com> wrote: > > > Thank you Harry and all other contributors for your work on this. Responses > > inline - > > > > On Mon, 23 Oct 2023, Pekka Paalanen wrote: > > > > > On Fri, 20 Oct 2023 11:23:28 -0400 > > > Harry Wentland <harry.wentland@amd.com> wrote: > > > > > > > On 2023-10-20 10:57, Pekka Paalanen wrote: > > > > > On Fri, 20 Oct 2023 16:22:56 +0200 > > > > > Sebastian Wick <sebastian.wick@redhat.com> wrote: > > > > > > > > > >> Thanks for continuing to work on this! > > > > >> > > > > >> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: > > > > >>> v2: > > > > >>> - Update colorop visualizations to match reality (Sebastian, Alex Hung) > > > > >>> - Updated wording (Pekka) > > > > >>> - Change BYPASS wording to make it non-mandatory (Sebastian) > > > > >>> - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property > > > > >>> section (Pekka) > > > > >>> - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) > > > > >>> - Add "Driver Implementer's Guide" section (Pekka) > > > > >>> - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) > > > > > > > > > > ... > > > > > > > > > >>> +An example of a drm_colorop object might look like one of these:: > > > > >>> + > > > > >>> + /* 1D enumerated curve */ > > > > >>> + Color operation 42 > > > > >>> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D enumerated curve > > > > >>> + ├─ "BYPASS": bool {true, false} > > > > >>> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ inverse EOTF, …} > > > > >>> + └─ "NEXT": immutable color operation ID = 43 > > > > I know these are just examples, but I would also like to suggest the possibility > > of an "identity" CURVE_1D_TYPE. BYPASS = true might get different results > > compared to setting an identity in some cases depending on the hardware. See > > below for more on this, RE: implicit format conversions. > > > > Although NVIDIA hardware doesn't use a ROM for enumerated curves, it came up in > > offline discussions that it would nonetheless be helpful to expose enumerated > > curves in order to hide the vendor-specific complexities of programming > > segmented LUTs from clients. In that case, we would simply refer to the > > enumerated curve when calculating/choosing segmented LUT entries. > > That's a good idea. > > > Another thing that came up in offline discussions is that we could use multiple > > color operations to program a single operation in hardware. As I understand it, > > AMD has a ROM-defined LUT, followed by a custom 4K entry LUT, followed by an > > "HDR Multiplier". On NVIDIA we don't have these as separate hardware stages, but > > we could combine them into a singular LUT in software, such that you can combine > > e.g. segmented PQ EOTF with night light. One caveat is that you will lose > > precision from the custom LUT where it overlaps with the linear section of the > > enumerated curve, but that is unavoidable and shouldn't be an issue in most > > use-cases. > > Indeed. > > > Actually, the current examples in the proposal don't include a multiplier color > > op, which might be useful. For AMD as above, but also for NVIDIA as the > > following issue arises: > > > > As discussed further below, the NVIDIA "degamma" LUT performs an implicit fixed > > point to FP16 conversion. In that conversion, what fixed point 0xFFFFFFFF maps > > to in floating point varies depending on the source content. If it's SDR > > content, we want the max value in FP16 to be 1.0 (80 nits), subject to a > > potential boost multiplier if we want SDR content to be brighter. If it's HDR PQ > > content, we want the max value in FP16 to be 125.0 (10,000 nits). My assumption > > is that this is also what AMD's "HDR Multiplier" stage is used for, is that > > correct? > > It would be against the UAPI design principles to tag content as HDR or > SDR. What you can do instead is to expose a colorop with a multiplier of > 1.0 or 125.0 to match your hardware behaviour, then tell your hardware > that the input is SDR or HDR to get the expected multiplier. You will > never know what the content actually is, anyway. > > Of course, if we want to have a arbitrary multiplier colorop that is > somewhat standard, as in, exposed by many drivers to ease userspace > development, you can certainly use any combination of your hardware > features you need to realize the UAPI prescribed mathematical operation. > > Since we are talking about floating-point in hardware, a multiplier > does not significantly affect precision. > > In order to mathematically define all colorops, I believe it is > necessary to define all colorops in terms of floating-point values (as > in math), even if they operate on fixed-point or integer. By this I > mean that if the input is 8 bpc unsigned integer pixel format for > instance, 0 raw pixel channel value is mapped to 0.0 and 255 is mapped > to 1.0, and the color pipeline starts with [0.0, 1.0], not [0, 255] > domain. We have to agree on this mapping for all channels on all pixel > formats. However, there is a "but" further below. > > I also propose that quantization range is NOT considered in the raw > value mapping, so that we can handle quantization range in colorops > explicitly, allowing us to e.g. handle sub-blacks and super-whites when > necessary. (These are currently impossible to represent in the legacy > color properties, because everything is converted to full range and > clipped before any color operations.) > > > From the given enumerated curves, it's not clear how they would map to the > > above. Should sRGB EOTF have a max FP16 value of 1.0, and the PQ EOTF a max FP16 > > value of 125.0? That may work, but it tends towards the "descriptive" notion of > > assuming the source content, which may not be accurate in all cases. This is > > also an issue for the custom 1D LUT, as the blob will need to be converted to > > FP16 in order to populate our "degamma" LUT. What should the resulting max FP16 > > value be, given that we no longer have any hint as to the source content? > > In my opinion, all finite non-negative transfer functions should > operate with [0.0, 1.0] domain and [0.0, 1.0] range, and that includes > all sRGB, power 2.2, and PQ curves. > > If we look at BT.2100, there is no such encoding even mentioned where > 125.0 would correspond to 10k cd/m². That 125.0 convention already has > a built-in assumption what the color spaces are and what the conversion > is aiming to do. IOW, I would say that choice is opinionated from the > start. The multiplier in BT.2100 is always 10000. > > Given that elements like various kinds of look-up tables inherently > assume that the domain is [0.0, 1.0] (because the it is a table that > has a beginning and an end, and the usual convention is that the > beginning is zero and the end is one), I think it is best to stick to > the [0.0, 1.0] range where possible. If we go out of that range, then > we have to define how a LUT would apply in a sensible way. > > Many TFs are intended to be defined only on [0.0, 1.0] -> [0.0, 1.0]. > Some curves, like power 2.2, have a mathematical form that naturally > extends outside of that range. Power 2.2 generalizes to >1.0 input > values as is, but not for negative input values. If needed for negative > input values, it is common to use y = -TF(-x) for x < 0 mirroring. > > scRGB is the prime example that intentionally uses negative channel > values. We can also have negative channel values with limited > quantization range, sometimes even intentionally (xvYCC chroma, or > PLUGE test sub-blacks). Out-of-unit-range values can also appear as a > side-effect of signal processing, and they should not get clipped > prematurely. This is a challenge for colorops that fundamentally cannot > handle out-of-unit-range values. > > There are various workarounds. scRGB colorimetry can be converted into > BT.2020 primaries for example, to avoid saturation induced negative > values. Limited quantization range signal could be processed as-is, > meaning that the limited range is mapped to [16.0/255, 235.0/255] > instead of [0.0, 1.0] or so. But then, we have a complication with > transfer functions. > > > I think a multiplier color op solves all of these issues. Named curves and > > custom 1D LUTs would by default assume a max FP16 value of 1.0, which would then > > be adjusted by the multiplier. > > Pretty much. > > > For 80 nit SDR content, set it to 1, for 400 > > nit SDR content, set it to 5, for HDR PQ content, set it to 125, etc. > > That I think is a another story. > > > > > >>> + > > > > >>> + /* custom 4k entry 1D LUT */ > > > > >>> + Color operation 52 > > > > >>> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D LUT > > > > >>> + ├─ "BYPASS": bool {true, false} > > > > >>> + ├─ "LUT_1D_SIZE": immutable range = 4096 > > > > >>> + ├─ "LUT_1D": blob > > > > >>> + └─ "NEXT": immutable color operation ID = 0 > > > > > > > > > > ... > > > > > > > > > >>> +Driver Forward/Backward Compatibility > > > > >>> +===================================== > > > > >>> + > > > > >>> +As this is uAPI drivers can't regress color pipelines that have been > > > > >>> +introduced for a given HW generation. New HW generations are free to > > > > >>> +abandon color pipelines advertised for previous generations. > > > > >>> +Nevertheless, it can be beneficial to carry support for existing color > > > > >>> +pipelines forward as those will likely already have support in DRM > > > > >>> +clients. > > > > >>> + > > > > >>> +Introducing new colorops to a pipeline is fine, as long as they can be > > > > >>> +disabled or are purely informational. DRM clients implementing support > > > > >>> +for the pipeline can always skip unknown properties as long as they can > > > > >>> +be confident that doing so will not cause unexpected results. > > > > >>> + > > > > >>> +If a new colorop doesn't fall into one of the above categories > > > > >>> +(bypassable or informational) the modified pipeline would be unusable > > > > >>> +for user space. In this case a new pipeline should be defined. > > > > >> > > > > >> How can user space detect an informational element? Should we just add a > > > > >> BYPASS property to informational elements, make it read only and set to > > > > >> true maybe? Or something more descriptive? > > > > > > > > > > Read-only BYPASS set to true would be fine by me, I guess. > > > > > > > > > > > > > Don't you mean set to false? An informational element will always do > > > > something, so it can't be bypassed. > > > > > > Yeah, this is why we need a definition. I understand "informational" to > > > not change pixel values in any way. Previously I had some weird idea > > > that scaling doesn't alter color, but of course it may. > > > > On recent hardware, the NVIDIA pre-blending pipeline includes LUTs that do > > implicit fixed-point to FP16 conversions, and vice versa. > > Above, I claimed that the UAPI should be defined in nominal > floating-point values, but I wonder, would that work? Would we need to > have explicit colorops for converting from raw pixel data values into > nominal floating-point in the UAPI? > > > For example, the "degamma" LUT towards the beginning of the pipeline implicitly > > converts from fixed point to FP16, and some of the following operations expect > > to operate in FP16. As such, if you have a fixed point input and don't bypass > > those following operations, you *must not* bypass the LUT, even if you are > > otherwise just programming it with the identity. Conversely, if you have a > > floating point input, you *must* bypass the LUT. > > Interesting. Since the color pipeline is not(?) meant to replace pixel > format definitions which already make the difference between fixed and > floating point, wouldn't this little detail need to be taken care of by > the driver under the hood? > > What if I want to use degamma colorop with a floating-point > framebuffer? Simply not possible on this hardware? > > > Could informational elements and allowing the exclusion of the BYPASS property > > be used to convey this information to the client? For example, we could expose > > one pipeline with the LUT exposed with read-only BYPASS set to false, and > > sandwich it with informational "Fixed Point" and "FP16" elements to accommodate > > fixed point input. Then, expose another pipeline with the LUT missing, and an > > informational "FP16" element in its place to accommodate floating point input. > > > > That's just an example; we also have other operations in the pipeline that do > > similar implicit conversions. In these cases we don't want the operations to be > > bypassed individually, so instead we would expose them as mandatory in some > > pipelines and missing in others, with informational elements to help inform the > > client of which to choose. Is that acceptable under the current proposal? > > > > Note that in this case, the information just has to do with what format the > > pixels should be in, it doesn't correspond to any specific operation. So, I'm > > not sure that BYPASS has any meaning for informational elements in this context. > > Very good questions. Do we have to expose those conversions in the UAPI > to make things work for this hardware? Meaning that we cannot assume all > colorops work in nominal floating-point from userspace perspective > (perhaps with varying degrees of precision). I had this in my original proposal I think (maybe I only thought about it, not sure). We really should figure this one out. Can we get away with normalized [0,1] fp as a user space abstraction or not? > > > > > > I think we also need a definition of "informational". > > > > > > > > > > Counter-example 1: a colorop that represents a non-configurable > > > > > > > > Not sure what's "counter" for these examples? > > > > > > > > > YUV<->RGB conversion. Maybe it determines its operation from FB pixel > > > > > format. It cannot be set to bypass, it cannot be configured, and it > > > > > will alter color values. > > > > Would it be reasonable to expose this is a 3x4 matrix with a read-only blob and > > no BYPASS property? I already brought up a similar idea at the XDC HDR Workshop > > based on the principle that read-only blobs could be used to express some static > > pipeline elements without the need to define a new type, but got mixed opinions. > > I think this demonstrates the principle further, as clients could detect this > > programmatically instead of having to special-case the informational element. > I'm all for exposing fixed color ops but I suspect that most of those follow some standard and in those cases instead of exposing the matrix values one should prefer to expose a named matrix (e.g. BT.601, BT.709, BT.2020). As a general rule: always expose the highest level description. Going from a name to exact values is trivial, going from values to a name is much harder. > If the blob depends on the pixel format (i.e. the driver automatically > chooses a different blob per pixel format), then I think we would need > to expose all the blobs and how they correspond to pixel formats. > Otherwise ok, I guess. > > However, do we want or need to make a color pipeline or colorop > conditional on pixel formats? For example, if you use a YUV 4:2:0 type > of pixel format, then you must use this pipeline and not any other. Or > floating-point type of pixel format. I did not anticipate this before, > I assumed that all color pipelines and colorops are independent of the > framebuffer pixel format. A specific colorop might have a property that > needs to agree with the framebuffer pixel format, but I didn't expect > further limitations. We could simply fail commits when the pipeline and pixel format don't work together. We'll probably need some kind of ingress no-op node anyway and maybe could list pixel formats there if required to make it easier to find a working configuration. > "Without the need to define a new type" is something I think we need to > consider case by case. I have a hard time giving a general opinion. > > > > > > > > > > > Counter-example 2: image size scaling colorop. It might not be > > > > > configurable, it is controlled by the plane CRTC_* and SRC_* > > > > > properties. You still need to understand what it does, so you can > > > > > arrange the scaling to work correctly. (Do not want to scale an image > > > > > with PQ-encoded values as Josh demonstrated in XDC.) > > > > > > > > > > > > > IMO the position of the scaling operation is the thing that's important > > > > here as the color pipeline won't define scaling properties. > > > > I agree that blending should ideally be done in linear space, and I remember > > that from Josh's presentation at XDC, but I don't recall the same being said for > > scaling. In fact, the NVIDIA pre-blending scaler exists in a stage of the > > pipeline that is meant to be in PQ space (more on this below), and that was > > found to achieve better results at HDR/SDR boundaries. Of course, this only > > bolsters the argument that it would be helpful to have an informational "scaler" > > element to understand at which stage scaling takes place. > > Both blending and scaling are fundamentally the same operation: you > have two or more source colors (pixels), and you want to compute a > weighted average of them following what happens in nature, that is, > physics, as that is what humans are used to. > > Both blending and scaling will suffer from the same problems if the > operation is performed on not light-linear values. The result of the > weighted average does not correspond to physics. > > The problem may be hard to observe with natural imagery, but Josh's > example shows it very clearly. Maybe that effect is sometimes useful > for some imagery in some use cases, but it is still an accidental > side-effect. You might get even better results if you don't rely on > accidental side-effects but design a separate operation for the exact > goal you have. > > Mind, by scaling we mean changing image size. Not scaling color values. > > > > > > Counter-example 3: image sampling colorop. Averages FB originated color > > > > > values to produce a color sample. Again do not want to do this with > > > > > PQ-encoded values. > > > > > > > > > > > > > Wouldn't this only happen during a scaling op? > > > > > > There is certainly some overlap between examples 2 and 3. IIRC SRC_X/Y > > > coordinates can be fractional, which makes nearest vs. bilinear > > > sampling have a difference even if there is no scaling. > > > > > > There is also the question of chroma siting with sub-sampled YUV. I > > > don't know how that actually works, or how it theoretically should work. > > > > We have some operations in our pipeline that are intended to be static, i.e. a > > static matrix that converts from RGB to LMS, and later another that converts > > from LMS to ICtCp. There are even LUTs that are intended to be static, > > converting from linear to PQ and vice versa. All of this is because the > > pre-blending scaler and tone mapping operator are intended to operate in ICtCp > > PQ space. Although the stated LUTs and matrices are intended to be static, they > > are actually programmable. In offline discussions, it was indicated that it > > would be helpful to actually expose the programmability, as opposed to exposing > > them as non-bypassable blocks, as some compositors may have novel uses for them. > > Correct. Doing tone-mapping in ICtCp etc. are already policy that > userspace might or might not agree with. > > Exposing static colorops will help usages that adhere to current > prevalent standards around very specific use cases. There may be > millions of devices needing exactly that processing in their usage, but > it is also quite limiting in what one can do with the hardware. > > > Despite being programmable, the LUTs are updated in a manner that is less > > efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful > > if there was some way to tag operations according to their performance, > > for example so that clients can prefer a high performance one when they > > intend to do an animated transition? I recall from the XDC HDR workshop > > that this is also an issue with AMD's 3DLUT, where updates can be too > > slow to animate. > > I can certainly see such information being useful, but then we need to > somehow quantize the performance. > > What I was left puzzled about after the XDC workshop is that is it > possible to pre-load configurations in the background (slow), and then > quickly switch between them? Hardware-wise I mean. We could define that pipelines with a lower ID are to be preferred over higher IDs. The issue is that if programming a pipeline becomes too slow to be useful it probably should just not be made available to user space. The prepare-commit idea for blob properties would help to make the pipelines usable again, but until then it's probably a good idea to just not expose those pipelines. > > > Thanks, > pq
On Thu, 26 Oct 2023, Sebastian Wick wrote: > On Thu, Oct 26, 2023 at 11:57:47AM +0300, Pekka Paalanen wrote: > > On Wed, 25 Oct 2023 15:16:08 -0500 (CDT) > > Alex Goins <agoins@nvidia.com> wrote: > > > > > Thank you Harry and all other contributors for your work on this. Responses > > > inline - > > > > > > On Mon, 23 Oct 2023, Pekka Paalanen wrote: > > > > > > > On Fri, 20 Oct 2023 11:23:28 -0400 > > > > Harry Wentland <harry.wentland@amd.com> wrote: > > > > > > > > > On 2023-10-20 10:57, Pekka Paalanen wrote: > > > > > > On Fri, 20 Oct 2023 16:22:56 +0200 > > > > > > Sebastian Wick <sebastian.wick@redhat.com> wrote: > > > > > > > > > > > >> Thanks for continuing to work on this! > > > > > >> > > > > > >> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: > > > > > >>> v2: > > > > > >>> - Update colorop visualizations to match reality (Sebastian, Alex Hung) > > > > > >>> - Updated wording (Pekka) > > > > > >>> - Change BYPASS wording to make it non-mandatory (Sebastian) > > > > > >>> - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property > > > > > >>> section (Pekka) > > > > > >>> - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) > > > > > >>> - Add "Driver Implementer's Guide" section (Pekka) > > > > > >>> - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) > > > > > > > > > > > > ... > > > > > > > > > > > >>> +An example of a drm_colorop object might look like one of these:: > > > > > >>> + > > > > > >>> + /* 1D enumerated curve */ > > > > > >>> + Color operation 42 > > > > > >>> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D enumerated curve > > > > > >>> + ├─ "BYPASS": bool {true, false} > > > > > >>> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ inverse EOTF, …} > > > > > >>> + └─ "NEXT": immutable color operation ID = 43 > > > > > > I know these are just examples, but I would also like to suggest the possibility > > > of an "identity" CURVE_1D_TYPE. BYPASS = true might get different results > > > compared to setting an identity in some cases depending on the hardware. See > > > below for more on this, RE: implicit format conversions. > > > > > > Although NVIDIA hardware doesn't use a ROM for enumerated curves, it came up in > > > offline discussions that it would nonetheless be helpful to expose enumerated > > > curves in order to hide the vendor-specific complexities of programming > > > segmented LUTs from clients. In that case, we would simply refer to the > > > enumerated curve when calculating/choosing segmented LUT entries. > > > > That's a good idea. > > > > > Another thing that came up in offline discussions is that we could use multiple > > > color operations to program a single operation in hardware. As I understand it, > > > AMD has a ROM-defined LUT, followed by a custom 4K entry LUT, followed by an > > > "HDR Multiplier". On NVIDIA we don't have these as separate hardware stages, but > > > we could combine them into a singular LUT in software, such that you can combine > > > e.g. segmented PQ EOTF with night light. One caveat is that you will lose > > > precision from the custom LUT where it overlaps with the linear section of the > > > enumerated curve, but that is unavoidable and shouldn't be an issue in most > > > use-cases. > > > > Indeed. > > > > > Actually, the current examples in the proposal don't include a multiplier color > > > op, which might be useful. For AMD as above, but also for NVIDIA as the > > > following issue arises: > > > > > > As discussed further below, the NVIDIA "degamma" LUT performs an implicit fixed > > > point to FP16 conversion. In that conversion, what fixed point 0xFFFFFFFF maps > > > to in floating point varies depending on the source content. If it's SDR > > > content, we want the max value in FP16 to be 1.0 (80 nits), subject to a > > > potential boost multiplier if we want SDR content to be brighter. If it's HDR PQ > > > content, we want the max value in FP16 to be 125.0 (10,000 nits). My assumption > > > is that this is also what AMD's "HDR Multiplier" stage is used for, is that > > > correct? > > > > It would be against the UAPI design principles to tag content as HDR or > > SDR. What you can do instead is to expose a colorop with a multiplier of > > 1.0 or 125.0 to match your hardware behaviour, then tell your hardware > > that the input is SDR or HDR to get the expected multiplier. You will > > never know what the content actually is, anyway. Right, I didn't mean to suggest that we should tag content as HDR or SDR in the UAPI, just relating to the end result in the pipe, ultimately it would be determined by the multiplier color op. > > > > Of course, if we want to have a arbitrary multiplier colorop that is > > somewhat standard, as in, exposed by many drivers to ease userspace > > development, you can certainly use any combination of your hardware > > features you need to realize the UAPI prescribed mathematical operation. > > > > Since we are talking about floating-point in hardware, a multiplier > > does not significantly affect precision. > > > > In order to mathematically define all colorops, I believe it is > > necessary to define all colorops in terms of floating-point values (as > > in math), even if they operate on fixed-point or integer. By this I > > mean that if the input is 8 bpc unsigned integer pixel format for > > instance, 0 raw pixel channel value is mapped to 0.0 and 255 is mapped > > to 1.0, and the color pipeline starts with [0.0, 1.0], not [0, 255] > > domain. We have to agree on this mapping for all channels on all pixel > > formats. However, there is a "but" further below. I think this makes sense insofar as how we interact with the UAPI, and that's basically how fixed point works for us anyway. However, relating to your "but", it doesn't avoid the issue with hardware expectations about pixel formats since it doesn't change the underlying pixel format. > > > > I also propose that quantization range is NOT considered in the raw > > value mapping, so that we can handle quantization range in colorops > > explicitly, allowing us to e.g. handle sub-blacks and super-whites when > > necessary. (These are currently impossible to represent in the legacy > > color properties, because everything is converted to full range and > > clipped before any color operations.) > > > > > From the given enumerated curves, it's not clear how they would map to the > > > above. Should sRGB EOTF have a max FP16 value of 1.0, and the PQ EOTF a max FP16 > > > value of 125.0? That may work, but it tends towards the "descriptive" notion of > > > assuming the source content, which may not be accurate in all cases. This is > > > also an issue for the custom 1D LUT, as the blob will need to be converted to > > > FP16 in order to populate our "degamma" LUT. What should the resulting max FP16 > > > value be, given that we no longer have any hint as to the source content? > > > > In my opinion, all finite non-negative transfer functions should > > operate with [0.0, 1.0] domain and [0.0, 1.0] range, and that includes > > all sRGB, power 2.2, and PQ curves. Right, I think so too, otherwise you are making assumptions about the source content. For example, it's possible to do HDR with a basic gamma curve, so you can't really assume that gamma should always go up to 1.0, but PQ up to 125.0. If you did that, it would necessitate adding an "HDR Gamma" curve, which is converging back on a "descriptive" UAPI. By leaving the final range up to the subsequent multiplier, the client gets to choose independently from the TF, which seems more in line with the goals of this proposal. > > > > If we look at BT.2100, there is no such encoding even mentioned where > > 125.0 would correspond to 10k cd/m². That 125.0 convention already has > > a built-in assumption what the color spaces are and what the conversion > > is aiming to do. IOW, I would say that choice is opinionated from the > > start. The multiplier in BT.2100 is always 10000. Be that as it may, the convention of FP16 125.0 corresponding to 10k nits is baked in our hardware, so it's unavoidable at least for NVIDIA pipelines. > > > > Given that elements like various kinds of look-up tables inherently > > assume that the domain is [0.0, 1.0] (because the it is a table that > > has a beginning and an end, and the usual convention is that the > > beginning is zero and the end is one), I think it is best to stick to > > the [0.0, 1.0] range where possible. If we go out of that range, then > > we have to define how a LUT would apply in a sensible way. In my last reply I mentioned a static (but actually programmable) LUT that is typically used to convert FP16 linear pixels to fixed point PQ before handing them to the scaler and tone mapping operator. You're actually right that it indexes in the fixed point [0.0, 1.0] range for the reasons you describe, but because the input pixels are expected to be FP16 in the [0.0, 125.0] range, it applies a non-programmable 1/125.0 normalization factor first. In this case, you could think of the LUT as indexing on [0.0, 125.0], but as you point out there would need to be some way to describe that. Maybe we actually need a fractional multiplier / divider color op. NVIDIA pipes that include this LUT would need to include a mandatory 1/125.0 factor immediately prior to the LUT, then LUT can continue assuming a range of [0.0, 1.0]. Assuming you are using the hardware in a conventional way, specifying a multiplier of 1.0 after the "degamma" LUT would then map to the 80-nit PQ range after the static (but actually programmable) PQ LUT, whereas specifying a multiplier of 125.0 would map to the 10,000-nit PQ range, which is what we want. I guess it's kind of messy, but the effect would be that color ops other than multipliers/dividers would still be in the [0.0, 1.0] domain, and any multiplier that exceeds that range would have to be normalized by a divider before any other color op. > > > > Many TFs are intended to be defined only on [0.0, 1.0] -> [0.0, 1.0]. > > Some curves, like power 2.2, have a mathematical form that naturally > > extends outside of that range. Power 2.2 generalizes to >1.0 input > > values as is, but not for negative input values. If needed for negative > > input values, it is common to use y = -TF(-x) for x < 0 mirroring. > > > > scRGB is the prime example that intentionally uses negative channel > > values. We can also have negative channel values with limited > > quantization range, sometimes even intentionally (xvYCC chroma, or > > PLUGE test sub-blacks). Out-of-unit-range values can also appear as a > > side-effect of signal processing, and they should not get clipped > > prematurely. This is a challenge for colorops that fundamentally cannot > > handle out-of-unit-range values. > > > > There are various workarounds. scRGB colorimetry can be converted into > > BT.2020 primaries for example, to avoid saturation induced negative > > values. Limited quantization range signal could be processed as-is, > > meaning that the limited range is mapped to [16.0/255, 235.0/255] > > instead of [0.0, 1.0] or so. But then, we have a complication with > > transfer functions. > > > > > I think a multiplier color op solves all of these issues. Named curves and > > > custom 1D LUTs would by default assume a max FP16 value of 1.0, which would then > > > be adjusted by the multiplier. > > > > Pretty much. > > > > > For 80 nit SDR content, set it to 1, for 400 > > > nit SDR content, set it to 5, for HDR PQ content, set it to 125, etc. > > > > That I think is a another story. > > > > > > > >>> + > > > > > >>> + /* custom 4k entry 1D LUT */ > > > > > >>> + Color operation 52 > > > > > >>> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D LUT > > > > > >>> + ├─ "BYPASS": bool {true, false} > > > > > >>> + ├─ "LUT_1D_SIZE": immutable range = 4096 > > > > > >>> + ├─ "LUT_1D": blob > > > > > >>> + └─ "NEXT": immutable color operation ID = 0 > > > > > > > > > > > > ... > > > > > > > > > > > >>> +Driver Forward/Backward Compatibility > > > > > >>> +===================================== > > > > > >>> + > > > > > >>> +As this is uAPI drivers can't regress color pipelines that have been > > > > > >>> +introduced for a given HW generation. New HW generations are free to > > > > > >>> +abandon color pipelines advertised for previous generations. > > > > > >>> +Nevertheless, it can be beneficial to carry support for existing color > > > > > >>> +pipelines forward as those will likely already have support in DRM > > > > > >>> +clients. > > > > > >>> + > > > > > >>> +Introducing new colorops to a pipeline is fine, as long as they can be > > > > > >>> +disabled or are purely informational. DRM clients implementing support > > > > > >>> +for the pipeline can always skip unknown properties as long as they can > > > > > >>> +be confident that doing so will not cause unexpected results. > > > > > >>> + > > > > > >>> +If a new colorop doesn't fall into one of the above categories > > > > > >>> +(bypassable or informational) the modified pipeline would be unusable > > > > > >>> +for user space. In this case a new pipeline should be defined. > > > > > >> > > > > > >> How can user space detect an informational element? Should we just add a > > > > > >> BYPASS property to informational elements, make it read only and set to > > > > > >> true maybe? Or something more descriptive? > > > > > > > > > > > > Read-only BYPASS set to true would be fine by me, I guess. > > > > > > > > > > > > > > > > Don't you mean set to false? An informational element will always do > > > > > something, so it can't be bypassed. > > > > > > > > Yeah, this is why we need a definition. I understand "informational" to > > > > not change pixel values in any way. Previously I had some weird idea > > > > that scaling doesn't alter color, but of course it may. > > > > > > On recent hardware, the NVIDIA pre-blending pipeline includes LUTs that do > > > implicit fixed-point to FP16 conversions, and vice versa. > > > > Above, I claimed that the UAPI should be defined in nominal > > floating-point values, but I wonder, would that work? Would we need to > > have explicit colorops for converting from raw pixel data values into > > nominal floating-point in the UAPI? Yeah, I think something like that is needed, or another solution as discussed below. Even if we define the UAPI in terms of floating point, the actual underlying pixel format needs to match the expectations of each stage as it flows through the pipe. > > > > > For example, the "degamma" LUT towards the beginning of the pipeline implicitly > > > converts from fixed point to FP16, and some of the following operations expect > > > to operate in FP16. As such, if you have a fixed point input and don't bypass > > > those following operations, you *must not* bypass the LUT, even if you are > > > otherwise just programming it with the identity. Conversely, if you have a > > > floating point input, you *must* bypass the LUT. > > > > Interesting. Since the color pipeline is not(?) meant to replace pixel > > format definitions which already make the difference between fixed and > > floating point, wouldn't this little detail need to be taken care of by > > the driver under the hood? We could take care of it under the hood in the case where the pixel format is fixed point but the "degamma" LUT is bypassed, simply by programming it with the identity to allow for the conversion to take place. But when the pixel format is FP16 and the "degamma" LUT is *not* bypassed, we would need to either ignore the LUT (bad) or fail the atomic commit. That's why we need some way to communicate the restriction to the client, otherwise they are left guessing why the atomic commit failed. > > > > What if I want to use degamma colorop with a floating-point > > framebuffer? Simply not possible on this hardware? Right, it's not possible. The "degamma" LUT always does an implicit conversion from fixed point to FP16, so if the pixel format is already FP16 it isn't usable. However, the aforementioned static (actually programmable) LUT that follows the "degamma" LUT expects FP16 pixels, so you could still use that to do some kind of transformation. That's actually a good example of a novel use that justifies compositors being able to program it. > > > > > Could informational elements and allowing the exclusion of the BYPASS property > > > be used to convey this information to the client? For example, we could expose > > > one pipeline with the LUT exposed with read-only BYPASS set to false, and > > > sandwich it with informational "Fixed Point" and "FP16" elements to accommodate > > > fixed point input. Then, expose another pipeline with the LUT missing, and an > > > informational "FP16" element in its place to accommodate floating point input. > > > > > > That's just an example; we also have other operations in the pipeline that do > > > similar implicit conversions. In these cases we don't want the operations to be > > > bypassed individually, so instead we would expose them as mandatory in some > > > pipelines and missing in others, with informational elements to help inform the > > > client of which to choose. Is that acceptable under the current proposal? > > > > > > Note that in this case, the information just has to do with what format the > > > pixels should be in, it doesn't correspond to any specific operation. So, I'm > > > not sure that BYPASS has any meaning for informational elements in this context. > > > > Very good questions. Do we have to expose those conversions in the UAPI > > to make things work for this hardware? Meaning that we cannot assume all > > colorops work in nominal floating-point from userspace perspective > > (perhaps with varying degrees of precision). > > I had this in my original proposal I think (maybe I only thought about > it, not sure). > > We really should figure this one out. Can we get away with normalized > [0,1] fp as a user space abstraction or not? I think the conversion needs to be exposed at least just the one time at the beginning alongside the "degamma" LUT, since the choice is influenced an outside factor (the input pixel format). There are subsequent intermediate conversions as well, but that's only an issue if we allow the relevant color ops to be bypassed individually. If we expose a multitude of pipes where the relevant ops are either missing or mandatory in unison, we can avoid mismatched pixel formats while maintaining the illusion of a pipe that operates entirely in floating point. Or, pipes could just have explicit associated input pixel format(s). The above technique of exposing multiple pipes instead of bypassing color ops individually would still work, and clients would just have to choose a pipe that matches the input pixel format. That way, the actual color ops themselves could still be defined in terms of normalized [0.0, 1.0] floating point (multipliers/dividers excepted), and clients can continue thinking in terms of that after making the initial selection. > > > > > > > > > I think we also need a definition of "informational". > > > > > > > > > > > > Counter-example 1: a colorop that represents a non-configurable > > > > > > > > > > Not sure what's "counter" for these examples? > > > > > > > > > > > YUV<->RGB conversion. Maybe it determines its operation from FB pixel > > > > > > format. It cannot be set to bypass, it cannot be configured, and it > > > > > > will alter color values. > > > > > > Would it be reasonable to expose this is a 3x4 matrix with a read-only blob and > > > no BYPASS property? I already brought up a similar idea at the XDC HDR Workshop > > > based on the principle that read-only blobs could be used to express some static > > > pipeline elements without the need to define a new type, but got mixed opinions. > > > I think this demonstrates the principle further, as clients could detect this > > > programmatically instead of having to special-case the informational element. > > > > I'm all for exposing fixed color ops but I suspect that most of those > follow some standard and in those cases instead of exposing the matrix > values one should prefer to expose a named matrix (e.g. BT.601, BT.709, > BT.2020). > > As a general rule: always expose the highest level description. Going > from a name to exact values is trivial, going from values to a name is > much harder. Good point. It would need to be a conversion between any two defined color spaces e.g. BT.709-to-BT.2020, hence why it's much harder to go backwards. > > If the blob depends on the pixel format (i.e. the driver automatically > > chooses a different blob per pixel format), then I think we would need > > to expose all the blobs and how they correspond to pixel formats. > > Otherwise ok, I guess. > > > > However, do we want or need to make a color pipeline or colorop > > conditional on pixel formats? For example, if you use a YUV 4:2:0 type > > of pixel format, then you must use this pipeline and not any other. Or > > floating-point type of pixel format. I did not anticipate this before, > > I assumed that all color pipelines and colorops are independent of the > > framebuffer pixel format. A specific colorop might have a property that > > needs to agree with the framebuffer pixel format, but I didn't expect > > further limitations. > > We could simply fail commits when the pipeline and pixel format don't > work together. We'll probably need some kind of ingress no-op node > anyway and maybe could list pixel formats there if required to make it > easier to find a working configuration. Yeah, we could, but having to figure that out through trial and error would be unfortunate. Per above, it might be easiest to just tag pipelines with a pixel format instead of trying to include the pixel format conversion as a color op. > > "Without the need to define a new type" is something I think we need to > > consider case by case. I have a hard time giving a general opinion. > > > > > > > > > > > > > > Counter-example 2: image size scaling colorop. It might not be > > > > > > configurable, it is controlled by the plane CRTC_* and SRC_* > > > > > > properties. You still need to understand what it does, so you can > > > > > > arrange the scaling to work correctly. (Do not want to scale an image > > > > > > with PQ-encoded values as Josh demonstrated in XDC.) > > > > > > > > > > > > > > > > IMO the position of the scaling operation is the thing that's important > > > > > here as the color pipeline won't define scaling properties. > > > > > > I agree that blending should ideally be done in linear space, and I remember > > > that from Josh's presentation at XDC, but I don't recall the same being said for > > > scaling. In fact, the NVIDIA pre-blending scaler exists in a stage of the > > > pipeline that is meant to be in PQ space (more on this below), and that was > > > found to achieve better results at HDR/SDR boundaries. Of course, this only > > > bolsters the argument that it would be helpful to have an informational "scaler" > > > element to understand at which stage scaling takes place. > > > > Both blending and scaling are fundamentally the same operation: you > > have two or more source colors (pixels), and you want to compute a > > weighted average of them following what happens in nature, that is, > > physics, as that is what humans are used to. > > > > Both blending and scaling will suffer from the same problems if the > > operation is performed on not light-linear values. The result of the > > weighted average does not correspond to physics. > > > > The problem may be hard to observe with natural imagery, but Josh's > > example shows it very clearly. Maybe that effect is sometimes useful > > for some imagery in some use cases, but it is still an accidental > > side-effect. You might get even better results if you don't rely on > > accidental side-effects but design a separate operation for the exact > > goal you have. > > > > Mind, by scaling we mean changing image size. Not scaling color values. > > Fair enough, but it might not always be a choice given the hardware. > > > > > > Counter-example 3: image sampling colorop. Averages FB originated color > > > > > > values to produce a color sample. Again do not want to do this with > > > > > > PQ-encoded values. > > > > > > > > > > > > > > > > Wouldn't this only happen during a scaling op? > > > > > > > > There is certainly some overlap between examples 2 and 3. IIRC SRC_X/Y > > > > coordinates can be fractional, which makes nearest vs. bilinear > > > > sampling have a difference even if there is no scaling. > > > > > > > > There is also the question of chroma siting with sub-sampled YUV. I > > > > don't know how that actually works, or how it theoretically should work. > > > > > > We have some operations in our pipeline that are intended to be static, i.e. a > > > static matrix that converts from RGB to LMS, and later another that converts > > > from LMS to ICtCp. There are even LUTs that are intended to be static, > > > converting from linear to PQ and vice versa. All of this is because the > > > pre-blending scaler and tone mapping operator are intended to operate in ICtCp > > > PQ space. Although the stated LUTs and matrices are intended to be static, they > > > are actually programmable. In offline discussions, it was indicated that it > > > would be helpful to actually expose the programmability, as opposed to exposing > > > them as non-bypassable blocks, as some compositors may have novel uses for them. > > > > Correct. Doing tone-mapping in ICtCp etc. are already policy that > > userspace might or might not agree with. > > > > Exposing static colorops will help usages that adhere to current > > prevalent standards around very specific use cases. There may be > > millions of devices needing exactly that processing in their usage, but > > it is also quite limiting in what one can do with the hardware. > > > > > Despite being programmable, the LUTs are updated in a manner that is less > > > efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful > > > if there was some way to tag operations according to their performance, > > > for example so that clients can prefer a high performance one when they > > > intend to do an animated transition? I recall from the XDC HDR workshop > > > that this is also an issue with AMD's 3DLUT, where updates can be too > > > slow to animate. > > > > I can certainly see such information being useful, but then we need to > > somehow quantize the performance. Right, which wouldn't even necessarily be universal, could depend on the given host, GPU, etc. It could just be a relative performance indication, to give an order of preference. That wouldn't tell you if it can or can't be animated, but when choosing between two LUTs to animate you could prefer the higher performance one. > > > > What I was left puzzled about after the XDC workshop is that is it > > possible to pre-load configurations in the background (slow), and then > > quickly switch between them? Hardware-wise I mean. This works fine for our "fast" LUTs, you just point them to a surface in video memory and they flip to it. You could keep multiple surfaces around and flip between them without having to reprogram them in software. We can easily do that with enumerated curves, populating them when the driver initializes instead of waiting for the client to request them. You can even point multiple hardware LUTs to the same video memory surface, if they need the same curve. > > We could define that pipelines with a lower ID are to be preferred over > higher IDs. Sure, but this isn't just an issue with a pipeline as a whole, but the individual elements within it and how to use them in a given context. > > The issue is that if programming a pipeline becomes too slow to be > useful it probably should just not be made available to user space. It's not that programming the pipeline is overall too slow. The LUTs we have that are relatively slow to program are meant to be set infrequently, or even just once, to allow the scaler and tone mapping operator to operate in fixed point PQ space. You might still want the tone mapper, so you would choose a pipeline that includes them, but when it comes to e.g. animating a night light, you would want to choose a different LUT for that purpose. > > The prepare-commit idea for blob properties would help to make the > pipelines usable again, but until then it's probably a good idea to just > not expose those pipelines. The prepare-commit idea actually wouldn't work for these LUTs, because they are programmed using methods instead of pointing them to a surface. I'm actually not sure how slow it actually is, would need to benchmark it. I think not exposing them at all would be overkill, since it would mean you can't use the preblending scaler or tonemapper, and animation isn't necessary for that. The AMD 3DLUT is another example of a LUT that is slow to update, and it would obviously be a major loss if that wasn't exposed. There just needs to be some way for clients to know if they are going to kill performance by trying to change it every frame. Thanks, Alex > > > > > > > Thanks, > > pq > > >
On 10/26/23 21:25, Alex Goins wrote: > On Thu, 26 Oct 2023, Sebastian Wick wrote: >> On Thu, Oct 26, 2023 at 11:57:47AM +0300, Pekka Paalanen wrote: >>> On Wed, 25 Oct 2023 15:16:08 -0500 (CDT) >>> Alex Goins <agoins@nvidia.com> wrote: >>> >>>> Despite being programmable, the LUTs are updated in a manner that is less >>>> efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful >>>> if there was some way to tag operations according to their performance, >>>> for example so that clients can prefer a high performance one when they >>>> intend to do an animated transition? I recall from the XDC HDR workshop >>>> that this is also an issue with AMD's 3DLUT, where updates can be too >>>> slow to animate. >>> >>> I can certainly see such information being useful, but then we need to >>> somehow quantize the performance. > > Right, which wouldn't even necessarily be universal, could depend on the given > host, GPU, etc. It could just be a relative performance indication, to give an > order of preference. That wouldn't tell you if it can or can't be animated, but > when choosing between two LUTs to animate you could prefer the higher > performance one. > >>> >>> What I was left puzzled about after the XDC workshop is that is it >>> possible to pre-load configurations in the background (slow), and then >>> quickly switch between them? Hardware-wise I mean. > > This works fine for our "fast" LUTs, you just point them to a surface in video > memory and they flip to it. You could keep multiple surfaces around and flip > between them without having to reprogram them in software. We can easily do that > with enumerated curves, populating them when the driver initializes instead of > waiting for the client to request them. You can even point multiple hardware > LUTs to the same video memory surface, if they need the same curve. > >> >> We could define that pipelines with a lower ID are to be preferred over >> higher IDs. > > Sure, but this isn't just an issue with a pipeline as a whole, but the > individual elements within it and how to use them in a given context. > >> >> The issue is that if programming a pipeline becomes too slow to be >> useful it probably should just not be made available to user space. > > It's not that programming the pipeline is overall too slow. The LUTs we have > that are relatively slow to program are meant to be set infrequently, or even > just once, to allow the scaler and tone mapping operator to operate in fixed > point PQ space. You might still want the tone mapper, so you would choose a > pipeline that includes them, but when it comes to e.g. animating a night light, > you would want to choose a different LUT for that purpose. > >> >> The prepare-commit idea for blob properties would help to make the >> pipelines usable again, but until then it's probably a good idea to just >> not expose those pipelines. > > The prepare-commit idea actually wouldn't work for these LUTs, because they are > programmed using methods instead of pointing them to a surface. I'm actually not > sure how slow it actually is, would need to benchmark it. I think not exposing > them at all would be overkill, since it would mean you can't use the preblending > scaler or tonemapper, and animation isn't necessary for that. > > The AMD 3DLUT is another example of a LUT that is slow to update, and it would > obviously be a major loss if that wasn't exposed. There just needs to be some > way for clients to know if they are going to kill performance by trying to > change it every frame. Might a first step be to require the ALLOW_MODESET flag to be set when changing the values for a colorop which is too slow to be updated per refresh cycle? This would tell the compositor: You can use this colorop, but you can't change its values on the fly.
On Fri, Oct 27, 2023 at 10:59:25AM +0200, Michel Dänzer wrote: > On 10/26/23 21:25, Alex Goins wrote: > > On Thu, 26 Oct 2023, Sebastian Wick wrote: > >> On Thu, Oct 26, 2023 at 11:57:47AM +0300, Pekka Paalanen wrote: > >>> On Wed, 25 Oct 2023 15:16:08 -0500 (CDT) > >>> Alex Goins <agoins@nvidia.com> wrote: > >>> > >>>> Despite being programmable, the LUTs are updated in a manner that is less > >>>> efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful > >>>> if there was some way to tag operations according to their performance, > >>>> for example so that clients can prefer a high performance one when they > >>>> intend to do an animated transition? I recall from the XDC HDR workshop > >>>> that this is also an issue with AMD's 3DLUT, where updates can be too > >>>> slow to animate. > >>> > >>> I can certainly see such information being useful, but then we need to > >>> somehow quantize the performance. > > > > Right, which wouldn't even necessarily be universal, could depend on the given > > host, GPU, etc. It could just be a relative performance indication, to give an > > order of preference. That wouldn't tell you if it can or can't be animated, but > > when choosing between two LUTs to animate you could prefer the higher > > performance one. > > > >>> > >>> What I was left puzzled about after the XDC workshop is that is it > >>> possible to pre-load configurations in the background (slow), and then > >>> quickly switch between them? Hardware-wise I mean. > > > > This works fine for our "fast" LUTs, you just point them to a surface in video > > memory and they flip to it. You could keep multiple surfaces around and flip > > between them without having to reprogram them in software. We can easily do that > > with enumerated curves, populating them when the driver initializes instead of > > waiting for the client to request them. You can even point multiple hardware > > LUTs to the same video memory surface, if they need the same curve. > > > >> > >> We could define that pipelines with a lower ID are to be preferred over > >> higher IDs. > > > > Sure, but this isn't just an issue with a pipeline as a whole, but the > > individual elements within it and how to use them in a given context. > > > >> > >> The issue is that if programming a pipeline becomes too slow to be > >> useful it probably should just not be made available to user space. > > > > It's not that programming the pipeline is overall too slow. The LUTs we have > > that are relatively slow to program are meant to be set infrequently, or even > > just once, to allow the scaler and tone mapping operator to operate in fixed > > point PQ space. You might still want the tone mapper, so you would choose a > > pipeline that includes them, but when it comes to e.g. animating a night light, > > you would want to choose a different LUT for that purpose. > > > >> > >> The prepare-commit idea for blob properties would help to make the > >> pipelines usable again, but until then it's probably a good idea to just > >> not expose those pipelines. > > > > The prepare-commit idea actually wouldn't work for these LUTs, because they are > > programmed using methods instead of pointing them to a surface. I'm actually not > > sure how slow it actually is, would need to benchmark it. I think not exposing > > them at all would be overkill, since it would mean you can't use the preblending > > scaler or tonemapper, and animation isn't necessary for that. > > > > The AMD 3DLUT is another example of a LUT that is slow to update, and it would > > obviously be a major loss if that wasn't exposed. There just needs to be some > > way for clients to know if they are going to kill performance by trying to > > change it every frame. > > Might a first step be to require the ALLOW_MODESET flag to be set when changing the values for a colorop which is too slow to be updated per refresh cycle? > > This would tell the compositor: You can use this colorop, but you can't change its values on the fly. I argued before that changing any color op to passthrough should never require ALLOW_MODESET and while this is really hard to guarantee from a driver perspective I still believe that it's better to not expose any feature requiring ALLOW_MODESET or taking too long to program to be useful for per-frame changes. When user space has ways to figure out if going back to a specific state (in this case setting everything to bypass) without ALLOW_MODESET we can revisit this decision, but until then, let's keep things simple and only expose things that work reliably without ALLOW_MODESET and fast enough to work for per-frame changes. Harry, Pekka: Should we document this? It obviously restricts what can be exposed but exposing things that can't be used by user space isn't useful. > > -- > Earthling Michel Dänzer | https://redhat.com > Libre software enthusiast | Mesa and Xwayland developer >
On Fri, 27 Oct 2023 12:01:32 +0200 Sebastian Wick <sebastian.wick@redhat.com> wrote: > On Fri, Oct 27, 2023 at 10:59:25AM +0200, Michel Dänzer wrote: > > On 10/26/23 21:25, Alex Goins wrote: > > > On Thu, 26 Oct 2023, Sebastian Wick wrote: > > >> On Thu, Oct 26, 2023 at 11:57:47AM +0300, Pekka Paalanen wrote: > > >>> On Wed, 25 Oct 2023 15:16:08 -0500 (CDT) > > >>> Alex Goins <agoins@nvidia.com> wrote: > > >>> > > >>>> Despite being programmable, the LUTs are updated in a manner that is less > > >>>> efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful > > >>>> if there was some way to tag operations according to their performance, > > >>>> for example so that clients can prefer a high performance one when they > > >>>> intend to do an animated transition? I recall from the XDC HDR workshop > > >>>> that this is also an issue with AMD's 3DLUT, where updates can be too > > >>>> slow to animate. > > >>> > > >>> I can certainly see such information being useful, but then we need to > > >>> somehow quantize the performance. > > > > > > Right, which wouldn't even necessarily be universal, could depend on the given > > > host, GPU, etc. It could just be a relative performance indication, to give an > > > order of preference. That wouldn't tell you if it can or can't be animated, but > > > when choosing between two LUTs to animate you could prefer the higher > > > performance one. > > > > > >>> > > >>> What I was left puzzled about after the XDC workshop is that is it > > >>> possible to pre-load configurations in the background (slow), and then > > >>> quickly switch between them? Hardware-wise I mean. > > > > > > This works fine for our "fast" LUTs, you just point them to a surface in video > > > memory and they flip to it. You could keep multiple surfaces around and flip > > > between them without having to reprogram them in software. We can easily do that > > > with enumerated curves, populating them when the driver initializes instead of > > > waiting for the client to request them. You can even point multiple hardware > > > LUTs to the same video memory surface, if they need the same curve. > > > > > >> > > >> We could define that pipelines with a lower ID are to be preferred over > > >> higher IDs. > > > > > > Sure, but this isn't just an issue with a pipeline as a whole, but the > > > individual elements within it and how to use them in a given context. > > > > > >> > > >> The issue is that if programming a pipeline becomes too slow to be > > >> useful it probably should just not be made available to user space. > > > > > > It's not that programming the pipeline is overall too slow. The LUTs we have > > > that are relatively slow to program are meant to be set infrequently, or even > > > just once, to allow the scaler and tone mapping operator to operate in fixed > > > point PQ space. You might still want the tone mapper, so you would choose a > > > pipeline that includes them, but when it comes to e.g. animating a night light, > > > you would want to choose a different LUT for that purpose. > > > > > >> > > >> The prepare-commit idea for blob properties would help to make the > > >> pipelines usable again, but until then it's probably a good idea to just > > >> not expose those pipelines. > > > > > > The prepare-commit idea actually wouldn't work for these LUTs, because they are > > > programmed using methods instead of pointing them to a surface. I'm actually not > > > sure how slow it actually is, would need to benchmark it. I think not exposing > > > them at all would be overkill, since it would mean you can't use the preblending > > > scaler or tonemapper, and animation isn't necessary for that. > > > > > > The AMD 3DLUT is another example of a LUT that is slow to update, and it would > > > obviously be a major loss if that wasn't exposed. There just needs to be some > > > way for clients to know if they are going to kill performance by trying to > > > change it every frame. > > > > Might a first step be to require the ALLOW_MODESET flag to be set when changing the values for a colorop which is too slow to be updated per refresh cycle? > > > > This would tell the compositor: You can use this colorop, but you can't change its values on the fly. > > I argued before that changing any color op to passthrough should never > require ALLOW_MODESET and while this is really hard to guarantee from a > driver perspective I still believe that it's better to not expose any > feature requiring ALLOW_MODESET or taking too long to program to be > useful for per-frame changes. > > When user space has ways to figure out if going back to a specific state > (in this case setting everything to bypass) without ALLOW_MODESET we can > revisit this decision, but until then, let's keep things simple and only > expose things that work reliably without ALLOW_MODESET and fast enough > to work for per-frame changes. > > Harry, Pekka: Should we document this? It obviously restricts what can > be exposed but exposing things that can't be used by user space isn't > useful. In an ideal world... but in real world, I don't know. Would it help if there was a list collected, with all the things in various hardware that is known to be too heavy to reprogram every refresh? Maybe that would allow a more educated decision? I bet that depends also on the refresh rate. I would probably be fine with some sort of update cost classification on colorops, and the kernel keeping track of blobs: if userspace sets the same blob on the same colorop that is already there (by blob ID, no need to compare contents), then it's a no-op change. Anyway, I really like reading Alex Goins' reply, it seems we are very much on the same page here. :-) Thanks, pq
Just want to loop back to before we branched off deeper into the programming performance talk On 10/26/2023 3:25 PM, Alex Goins wrote: > On Thu, 26 Oct 2023, Sebastian Wick wrote: > >> On Thu, Oct 26, 2023 at 11:57:47AM +0300, Pekka Paalanen wrote: >>> On Wed, 25 Oct 2023 15:16:08 -0500 (CDT) >>> Alex Goins <agoins@nvidia.com> wrote: >>> >>>> Thank you Harry and all other contributors for your work on this. Responses >>>> inline - >>>> >>>> On Mon, 23 Oct 2023, Pekka Paalanen wrote: >>>> >>>>> On Fri, 20 Oct 2023 11:23:28 -0400 >>>>> Harry Wentland <harry.wentland@amd.com> wrote: >>>>> >>>>>> On 2023-10-20 10:57, Pekka Paalanen wrote: >>>>>>> On Fri, 20 Oct 2023 16:22:56 +0200 >>>>>>> Sebastian Wick <sebastian.wick@redhat.com> wrote: >>>>>>> >>>>>>>> Thanks for continuing to work on this! >>>>>>>> >>>>>>>> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: >>>>>>>>> v2: >>>>>>>>> - Update colorop visualizations to match reality (Sebastian, Alex Hung) >>>>>>>>> - Updated wording (Pekka) >>>>>>>>> - Change BYPASS wording to make it non-mandatory (Sebastian) >>>>>>>>> - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property >>>>>>>>> section (Pekka) >>>>>>>>> - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) >>>>>>>>> - Add "Driver Implementer's Guide" section (Pekka) >>>>>>>>> - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) >>>>>>> >>>>>>> ... >>>>>>> >>>>>>>>> +An example of a drm_colorop object might look like one of these:: >>>>>>>>> + >>>>>>>>> + /* 1D enumerated curve */ >>>>>>>>> + Color operation 42 >>>>>>>>> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D enumerated curve >>>>>>>>> + ├─ "BYPASS": bool {true, false} >>>>>>>>> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ inverse EOTF, …} >>>>>>>>> + └─ "NEXT": immutable color operation ID = 43 >>>> >>>> I know these are just examples, but I would also like to suggest the possibility >>>> of an "identity" CURVE_1D_TYPE. BYPASS = true might get different results >>>> compared to setting an identity in some cases depending on the hardware. See >>>> below for more on this, RE: implicit format conversions. >>>> >>>> Although NVIDIA hardware doesn't use a ROM for enumerated curves, it came up in >>>> offline discussions that it would nonetheless be helpful to expose enumerated >>>> curves in order to hide the vendor-specific complexities of programming >>>> segmented LUTs from clients. In that case, we would simply refer to the >>>> enumerated curve when calculating/choosing segmented LUT entries. >>> >>> That's a good idea. >>> >>>> Another thing that came up in offline discussions is that we could use multiple >>>> color operations to program a single operation in hardware. As I understand it, >>>> AMD has a ROM-defined LUT, followed by a custom 4K entry LUT, followed by an >>>> "HDR Multiplier". On NVIDIA we don't have these as separate hardware stages, but >>>> we could combine them into a singular LUT in software, such that you can combine >>>> e.g. segmented PQ EOTF with night light. One caveat is that you will lose >>>> precision from the custom LUT where it overlaps with the linear section of the >>>> enumerated curve, but that is unavoidable and shouldn't be an issue in most >>>> use-cases. >>> >>> Indeed. >>> >>>> Actually, the current examples in the proposal don't include a multiplier color >>>> op, which might be useful. For AMD as above, but also for NVIDIA as the >>>> following issue arises: >>>> >>>> As discussed further below, the NVIDIA "degamma" LUT performs an implicit fixed If possible, let's declare this as two blocks. One that informatively declares the conversion is present, and another for the de-gamma. This will help with block-reuse between vendors. >>>> point to FP16 conversion. In that conversion, what fixed point 0xFFFFFFFF maps >>>> to in floating point varies depending on the source content. If it's SDR >>>> content, we want the max value in FP16 to be 1.0 (80 nits), subject to a >>>> potential boost multiplier if we want SDR content to be brighter. If it's HDR PQ >>>> content, we want the max value in FP16 to be 125.0 (10,000 nits). My assumption >>>> is that this is also what AMD's "HDR Multiplier" stage is used for, is that >>>> correct? >>> >>> It would be against the UAPI design principles to tag content as HDR or >>> SDR. What you can do instead is to expose a colorop with a multiplier of >>> 1.0 or 125.0 to match your hardware behaviour, then tell your hardware >>> that the input is SDR or HDR to get the expected multiplier. You will >>> never know what the content actually is, anyway. > > Right, I didn't mean to suggest that we should tag content as HDR or SDR in the > UAPI, just relating to the end result in the pipe, ultimately it would be > determined by the multiplier color op. > A multiplier could work but we would should give OEMs the option to either make it "informative" and fixed by the hardware, or fully configurable. With the Qualcomm pipeline how we absorb FP16 pixel buffers, as well as how we convert them to fixed point data actually has a dependency on the desired de-gamma and gamma processing. So for an example: If a source pixel buffer is scRGB encoded FP16 content we would expect input pixel content to be up to 7.5, with the IGC output reaching 125 as in the NVIDIA case. Likewise gamma 2.2 encoded FP16 content would be 0-1 in and 0-1 out. So in the Qualcomm case the expectations are fixed depending on the use case. It is sounding to me like we would need to be able to declare three things here: 1. Value range expectations *into* the de-gamma block. A multiplier wouldn't work here because it would be more of a clipping operation. I guess we would have to add an explicit clamping block as well. 2. What the value range expectations at the *output* of de-gamma processing block. Also covered by using another multiplier block. 3. Value range expectations *into* a gamma processing block. This should be covered by declaring a multiplier post-csc, but only assuming CSC output is normalized in the desired value range. A clamping block would be preferable because it describes what happens when it isn't. All this is do-able, but it seems like it would require the definition of multiple color pipelines to expose the different limitations for color block configuration combinations. Additionally, would it be easy for user space to find the right pipeline? >>> >>> Of course, if we want to have a arbitrary multiplier colorop that is >>> somewhat standard, as in, exposed by many drivers to ease userspace >>> development, you can certainly use any combination of your hardware >>> features you need to realize the UAPI prescribed mathematical operation. >>> >>> Since we are talking about floating-point in hardware, a multiplier >>> does not significantly affect precision. >>> >>> In order to mathematically define all colorops, I believe it is >>> necessary to define all colorops in terms of floating-point values (as >>> in math), even if they operate on fixed-point or integer. By this I >>> mean that if the input is 8 bpc unsigned integer pixel format for >>> instance, 0 raw pixel channel value is mapped to 0.0 and 255 is mapped >>> to 1.0, and the color pipeline starts with [0.0, 1.0], not [0, 255] >>> domain. We have to agree on this mapping for all channels on all pixel >>> formats. However, there is a "but" further below. > > I think this makes sense insofar as how we interact with the UAPI, and that's > basically how fixed point works for us anyway. However, relating to your "but", > it doesn't avoid the issue with hardware expectations about pixel formats since > it doesn't change the underlying pixel format. > >>> >>> I also propose that quantization range is NOT considered in the raw >>> value mapping, so that we can handle quantization range in colorops >>> explicitly, allowing us to e.g. handle sub-blacks and super-whites when >>> necessary. (These are currently impossible to represent in the legacy >>> color properties, because everything is converted to full range and >>> clipped before any color operations.) >>> >>>> From the given enumerated curves, it's not clear how they would map to the >>>> above. Should sRGB EOTF have a max FP16 value of 1.0, and the PQ EOTF a max FP16 >>>> value of 125.0? That may work, but it tends towards the "descriptive" notion of >>>> assuming the source content, which may not be accurate in all cases. This is >>>> also an issue for the custom 1D LUT, as the blob will need to be converted to >>>> FP16 in order to populate our "degamma" LUT. What should the resulting max FP16 >>>> value be, given that we no longer have any hint as to the source content? >>> >>> In my opinion, all finite non-negative transfer functions should >>> operate with [0.0, 1.0] domain and [0.0, 1.0] range, and that includes >>> all sRGB, power 2.2, and PQ curves. > > Right, I think so too, otherwise you are making assumptions about the source > content. For example, it's possible to do HDR with a basic gamma curve, so you > can't really assume that gamma should always go up to 1.0, but PQ up to 125.0. > If you did that, it would necessitate adding an "HDR Gamma" curve, which is > converging back on a "descriptive" UAPI. By leaving the final range up to the > subsequent multiplier, the client gets to choose independently from the TF, > which seems more in line with the goals of this proposal. > >>> >>> If we look at BT.2100, there is no such encoding even mentioned where >>> 125.0 would correspond to 10k cd/m². That 125.0 convention already has >>> a built-in assumption what the color spaces are and what the conversion >>> is aiming to do. IOW, I would say that choice is opinionated from the >>> start. The multiplier in BT.2100 is always 10000. > > Be that as it may, the convention of FP16 125.0 corresponding to 10k nits is > baked in our hardware, so it's unavoidable at least for NVIDIA pipelines. > >>> >>> Given that elements like various kinds of look-up tables inherently >>> assume that the domain is [0.0, 1.0] (because the it is a table that >>> has a beginning and an end, and the usual convention is that the >>> beginning is zero and the end is one), I think it is best to stick to >>> the [0.0, 1.0] range where possible. If we go out of that range, then >>> we have to define how a LUT would apply in a sensible way. > > In my last reply I mentioned a static (but actually programmable) LUT that is > typically used to convert FP16 linear pixels to fixed point PQ before handing > them to the scaler and tone mapping operator. You're actually right that it > indexes in the fixed point [0.0, 1.0] range for the reasons you describe, but > because the input pixels are expected to be FP16 in the [0.0, 125.0] range, it > applies a non-programmable 1/125.0 normalization factor first. > > In this case, you could think of the LUT as indexing on [0.0, 125.0], but as you > point out there would need to be some way to describe that. Maybe we actually > need a fractional multiplier / divider color op. NVIDIA pipes that include this > LUT would need to include a mandatory 1/125.0 factor immediately prior to the > LUT, then LUT can continue assuming a range of [0.0, 1.0]. > > Assuming you are using the hardware in a conventional way, specifying a > multiplier of 1.0 after the "degamma" LUT would then map to the 80-nit PQ range > after the static (but actually programmable) PQ LUT, whereas specifying a > multiplier of 125.0 would map to the 10,000-nit PQ range, which is what we want. > I guess it's kind of messy, but the effect would be that color ops other than > multipliers/dividers would still be in the [0.0, 1.0] domain, and any multiplier > that exceeds that range would have to be normalized by a divider before any > other color op. > Hmm. A multiplier would resolve issues when input linear FP16 data that has different ideas on what 1.0 means in regards to nits values (think of Apple's EDR as an example). For a client to go from their definition to hardware definition of 1.0 = x nits, we would need to expose what the pipeline sees as 1.0 though. So in this case the multiplier would be programmable, but the divisor is informational? It seems like the later would have an influence on how the former is programmed. >>> >>> Many TFs are intended to be defined only on [0.0, 1.0] -> [0.0, 1.0]. >>> Some curves, like power 2.2, have a mathematical form that naturally >>> extends outside of that range. Power 2.2 generalizes to >1.0 input >>> values as is, but not for negative input values. If needed for negative >>> input values, it is common to use y = -TF(-x) for x < 0 mirroring. >>> >>> scRGB is the prime example that intentionally uses negative channel >>> values. We can also have negative channel values with limited >>> quantization range, sometimes even intentionally (xvYCC chroma, or >>> PLUGE test sub-blacks). Out-of-unit-range values can also appear as a >>> side-effect of signal processing, and they should not get clipped >>> prematurely. This is a challenge for colorops that fundamentally cannot >>> handle out-of-unit-range values. >>> >>> There are various workarounds. scRGB colorimetry can be converted into >>> BT.2020 primaries for example, to avoid saturation induced negative >>> values. Limited quantization range signal could be processed as-is, >>> meaning that the limited range is mapped to [16.0/255, 235.0/255] >>> instead of [0.0, 1.0] or so. But then, we have a complication with >>> transfer functions. >>> >>>> I think a multiplier color op solves all of these issues. Named curves and >>>> custom 1D LUTs would by default assume a max FP16 value of 1.0, which would then >>>> be adjusted by the multiplier. >>> >>> Pretty much. >>> >>>> For 80 nit SDR content, set it to 1, for 400 >>>> nit SDR content, set it to 5, for HDR PQ content, set it to 125, etc. >>> >>> That I think is a another story. >>> >>>>>>>>> + >>>>>>>>> + /* custom 4k entry 1D LUT */ >>>>>>>>> + Color operation 52 >>>>>>>>> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D LUT >>>>>>>>> + ├─ "BYPASS": bool {true, false} >>>>>>>>> + ├─ "LUT_1D_SIZE": immutable range = 4096 >>>>>>>>> + ├─ "LUT_1D": blob >>>>>>>>> + └─ "NEXT": immutable color operation ID = 0 >>>>>>> >>>>>>> ... >>>>>>> >>>>>>>>> +Driver Forward/Backward Compatibility >>>>>>>>> +===================================== >>>>>>>>> + >>>>>>>>> +As this is uAPI drivers can't regress color pipelines that have been >>>>>>>>> +introduced for a given HW generation. New HW generations are free to >>>>>>>>> +abandon color pipelines advertised for previous generations. >>>>>>>>> +Nevertheless, it can be beneficial to carry support for existing color >>>>>>>>> +pipelines forward as those will likely already have support in DRM >>>>>>>>> +clients. >>>>>>>>> + >>>>>>>>> +Introducing new colorops to a pipeline is fine, as long as they can be >>>>>>>>> +disabled or are purely informational. DRM clients implementing support >>>>>>>>> +for the pipeline can always skip unknown properties as long as they can >>>>>>>>> +be confident that doing so will not cause unexpected results. >>>>>>>>> + >>>>>>>>> +If a new colorop doesn't fall into one of the above categories >>>>>>>>> +(bypassable or informational) the modified pipeline would be unusable >>>>>>>>> +for user space. In this case a new pipeline should be defined. >>>>>>>> >>>>>>>> How can user space detect an informational element? Should we just add a >>>>>>>> BYPASS property to informational elements, make it read only and set to >>>>>>>> true maybe? Or something more descriptive? >>>>>>> >>>>>>> Read-only BYPASS set to true would be fine by me, I guess. >>>>>>> >>>>>> >>>>>> Don't you mean set to false? An informational element will always do >>>>>> something, so it can't be bypassed. >>>>> >>>>> Yeah, this is why we need a definition. I understand "informational" to >>>>> not change pixel values in any way. Previously I had some weird idea >>>>> that scaling doesn't alter color, but of course it may. >>>> >>>> On recent hardware, the NVIDIA pre-blending pipeline includes LUTs that do >>>> implicit fixed-point to FP16 conversions, and vice versa. >>> >>> Above, I claimed that the UAPI should be defined in nominal >>> floating-point values, but I wonder, would that work? Would we need to >>> have explicit colorops for converting from raw pixel data values into >>> nominal floating-point in the UAPI? > > Yeah, I think something like that is needed, or another solution as discussed > below. Even if we define the UAPI in terms of floating point, the actual > underlying pixel format needs to match the expectations of each stage as it > flows through the pipe. > Strongly agree on this. Pixel format and block relationships definitely exist. >>> >>>> For example, the "degamma" LUT towards the beginning of the pipeline implicitly >>>> converts from fixed point to FP16, and some of the following operations expect >>>> to operate in FP16. As such, if you have a fixed point input and don't bypass >>>> those following operations, you *must not* bypass the LUT, even if you are >>>> otherwise just programming it with the identity. Conversely, if you have a >>>> floating point input, you *must* bypass the LUT. >>> >>> Interesting. Since the color pipeline is not(?) meant to replace pixel >>> format definitions which already make the difference between fixed and >>> floating point, wouldn't this little detail need to be taken care of by >>> the driver under the hood? > > We could take care of it under the hood in the case where the pixel format is > fixed point but the "degamma" LUT is bypassed, simply by programming it with the > identity to allow for the conversion to take place. But when the pixel format is > FP16 and the "degamma" LUT is *not* bypassed, we would need to either ignore the > LUT (bad) or fail the atomic commit. That's why we need some way to communicate > the restriction to the client, otherwise they are left guessing why the atomic > commit failed. > >>> >>> What if I want to use degamma colorop with a floating-point >>> framebuffer? Simply not possible on this hardware? > > Right, it's not possible. The "degamma" LUT always does an implicit conversion > from fixed point to FP16, so if the pixel format is already FP16 it isn't > usable. However, the aforementioned static (actually programmable) LUT that > follows the "degamma" LUT expects FP16 pixels, so you could still use that to do > some kind of transformation. That's actually a good example of a novel use that > justifies compositors being able to program it. > >>> >>>> Could informational elements and allowing the exclusion of the BYPASS property >>>> be used to convey this information to the client? For example, we could expose >>>> one pipeline with the LUT exposed with read-only BYPASS set to false, and >>>> sandwich it with informational "Fixed Point" and "FP16" elements to accommodate >>>> fixed point input. Then, expose another pipeline with the LUT missing, and an >>>> informational "FP16" element in its place to accommodate floating point input. >>>> >>>> That's just an example; we also have other operations in the pipeline that do >>>> similar implicit conversions. In these cases we don't want the operations to be >>>> bypassed individually, so instead we would expose them as mandatory in some >>>> pipelines and missing in others, with informational elements to help inform the >>>> client of which to choose. Is that acceptable under the current proposal? >>>> >>>> Note that in this case, the information just has to do with what format the >>>> pixels should be in, it doesn't correspond to any specific operation. So, I'm >>>> not sure that BYPASS has any meaning for informational elements in this context. >>> >>> Very good questions. Do we have to expose those conversions in the UAPI >>> to make things work for this hardware? Meaning that we cannot assume all >>> colorops work in nominal floating-point from userspace perspective >>> (perhaps with varying degrees of precision). >> >> I had this in my original proposal I think (maybe I only thought about >> it, not sure). >> >> We really should figure this one out. Can we get away with normalized >> [0,1] fp as a user space abstraction or not? > > I think the conversion needs to be exposed at least just the one time at the > beginning alongside the "degamma" LUT, since the choice is influenced an outside > factor (the input pixel format). There are subsequent intermediate conversions > as well, but that's only an issue if we allow the relevant color ops to be > bypassed individually. If we expose a multitude of pipes where the relevant ops > are either missing or mandatory in unison, we can avoid mismatched pixel formats > while maintaining the illusion of a pipe that operates entirely in floating > point. > > Or, pipes could just have explicit associated input pixel format(s). The above > technique of exposing multiple pipes instead of bypassing color ops individually > would still work, and clients would just have to choose a pipe that matches the > input pixel format. That way, the actual color ops themselves could still be > defined in terms of normalized [0.0, 1.0] floating point (multipliers/dividers > excepted), and clients can continue thinking in terms of that after making the > initial selection. > >> >>> >>>>>>> I think we also need a definition of "informational". >>>>>>> >>>>>>> Counter-example 1: a colorop that represents a non-configurable >>>>>> >>>>>> Not sure what's "counter" for these examples? >>>>>> >>>>>>> YUV<->RGB conversion. Maybe it determines its operation from FB pixel >>>>>>> format. It cannot be set to bypass, it cannot be configured, and it >>>>>>> will alter color values. >>>> >>>> Would it be reasonable to expose this is a 3x4 matrix with a read-only blob and >>>> no BYPASS property? I already brought up a similar idea at the XDC HDR Workshop >>>> based on the principle that read-only blobs could be used to express some static >>>> pipeline elements without the need to define a new type, but got mixed opinions. >>>> I think this demonstrates the principle further, as clients could detect this >>>> programmatically instead of having to special-case the informational element. >>> >> >> I'm all for exposing fixed color ops but I suspect that most of those >> follow some standard and in those cases instead of exposing the matrix >> values one should prefer to expose a named matrix (e.g. BT.601, BT.709, >> BT.2020). >> >> As a general rule: always expose the highest level description. Going >> from a name to exact values is trivial, going from values to a name is >> much harder. > > Good point. It would need to be a conversion between any two defined color > spaces e.g. BT.709-to-BT.2020, hence why it's much harder to go backwards. > A small advantage of providing name + values (or just blob ID) is that if the compositor needs to make a GPU shader that matches the hardware they could refer to the matrix values from the driver instead of having their own copy of what the standard says the conversion should be. >>> If the blob depends on the pixel format (i.e. the driver automatically >>> chooses a different blob per pixel format), then I think we would need >>> to expose all the blobs and how they correspond to pixel formats. >>> Otherwise ok, I guess. >>> >>> However, do we want or need to make a color pipeline or colorop >>> conditional on pixel formats? For example, if you use a YUV 4:2:0 type >>> of pixel format, then you must use this pipeline and not any other. Or >>> floating-point type of pixel format. I did not anticipate this before, >>> I assumed that all color pipelines and colorops are independent of the >>> framebuffer pixel format. A specific colorop might have a property that >>> needs to agree with the framebuffer pixel format, but I didn't expect >>> further limitations. >> >> We could simply fail commits when the pipeline and pixel format don't >> work together. We'll probably need some kind of ingress no-op node >> anyway and maybe could list pixel formats there if required to make it >> easier to find a working configuration. > > Yeah, we could, but having to figure that out through trial and error would be > unfortunate. Per above, it might be easiest to just tag pipelines with a pixel > format instead of trying to include the pixel format conversion as a color op. > I definitely think this is going to be needed. That said, this also means that compositors that don't know how to configure this pipeline might not be able to use the format. If we take the FP16 example again, there may be able to be some sort of default programming to allow the hardware to absorb the content, but avoiding clipping of the content couldn't be guaranteed. We would end up having a functional pipeline, but the output result could end up being less than ideal. It really will depend on how the input content is packed. >>> "Without the need to define a new type" is something I think we need to >>> consider case by case. I have a hard time giving a general opinion. >>> >>>>>>> >>>>>>> Counter-example 2: image size scaling colorop. It might not be >>>>>>> configurable, it is controlled by the plane CRTC_* and SRC_* >>>>>>> properties. You still need to understand what it does, so you can >>>>>>> arrange the scaling to work correctly. (Do not want to scale an image >>>>>>> with PQ-encoded values as Josh demonstrated in XDC.) >>>>>>> >>>>>> >>>>>> IMO the position of the scaling operation is the thing that's important >>>>>> here as the color pipeline won't define scaling properties. >>>> >>>> I agree that blending should ideally be done in linear space, and I remember >>>> that from Josh's presentation at XDC, but I don't recall the same being said for >>>> scaling. In fact, the NVIDIA pre-blending scaler exists in a stage of the >>>> pipeline that is meant to be in PQ space (more on this below), and that was >>>> found to achieve better results at HDR/SDR boundaries. Of course, this only >>>> bolsters the argument that it would be helpful to have an informational "scaler" >>>> element to understand at which stage scaling takes place. >>> >>> Both blending and scaling are fundamentally the same operation: you >>> have two or more source colors (pixels), and you want to compute a >>> weighted average of them following what happens in nature, that is, >>> physics, as that is what humans are used to. >>> >>> Both blending and scaling will suffer from the same problems if the >>> operation is performed on not light-linear values. The result of the >>> weighted average does not correspond to physics. >>> >>> The problem may be hard to observe with natural imagery, but Josh's >>> example shows it very clearly. Maybe that effect is sometimes useful >>> for some imagery in some use cases, but it is still an accidental >>> side-effect. You might get even better results if you don't rely on >>> accidental side-effects but design a separate operation for the exact >>> goal you have. >>> >>> Mind, by scaling we mean changing image size. Not scaling color values. >>> > > Fair enough, but it might not always be a choice given the hardware. > Agreeing with Alex here. I get there is some debate over the best way to do this, but I think it is best to leave it up to the driver to declare how that is done. >>>>>>> Counter-example 3: image sampling colorop. Averages FB originated color >>>>>>> values to produce a color sample. Again do not want to do this with >>>>>>> PQ-encoded values. >>>>>>> >>>>>> >>>>>> Wouldn't this only happen during a scaling op? >>>>> >>>>> There is certainly some overlap between examples 2 and 3. IIRC SRC_X/Y >>>>> coordinates can be fractional, which makes nearest vs. bilinear >>>>> sampling have a difference even if there is no scaling. >>>>> >>>>> There is also the question of chroma siting with sub-sampled YUV. I >>>>> don't know how that actually works, or how it theoretically should work. >>>> >>>> We have some operations in our pipeline that are intended to be static, i.e. a >>>> static matrix that converts from RGB to LMS, and later another that converts >>>> from LMS to ICtCp. There are even LUTs that are intended to be static, >>>> converting from linear to PQ and vice versa. All of this is because the >>>> pre-blending scaler and tone mapping operator are intended to operate in ICtCp >>>> PQ space. Although the stated LUTs and matrices are intended to be static, they >>>> are actually programmable. In offline discussions, it was indicated that it >>>> would be helpful to actually expose the programmability, as opposed to exposing >>>> them as non-bypassable blocks, as some compositors may have novel uses for them. >>> >>> Correct. Doing tone-mapping in ICtCp etc. are already policy that >>> userspace might or might not agree with. >>> >>> Exposing static colorops will help usages that adhere to current >>> prevalent standards around very specific use cases. There may be >>> millions of devices needing exactly that processing in their usage, but >>> it is also quite limiting in what one can do with the hardware. >>> >>>> Despite being programmable, the LUTs are updated in a manner that is less >>>> efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful >>>> if there was some way to tag operations according to their performance, >>>> for example so that clients can prefer a high performance one when they >>>> intend to do an animated transition? I recall from the XDC HDR workshop >>>> that this is also an issue with AMD's 3DLUT, where updates can be too >>>> slow to animate. >>> >>> I can certainly see such information being useful, but then we need to >>> somehow quantize the performance. > > Right, which wouldn't even necessarily be universal, could depend on the given > host, GPU, etc. It could just be a relative performance indication, to give an > order of preference. That wouldn't tell you if it can or can't be animated, but > when choosing between two LUTs to animate you could prefer the higher > performance one. > >>> >>> What I was left puzzled about after the XDC workshop is that is it >>> possible to pre-load configurations in the background (slow), and then >>> quickly switch between them? Hardware-wise I mean. > > This works fine for our "fast" LUTs, you just point them to a surface in video > memory and they flip to it. You could keep multiple surfaces around and flip > between them without having to reprogram them in software. We can easily do that > with enumerated curves, populating them when the driver initializes instead of > waiting for the client to request them. You can even point multiple hardware > LUTs to the same video memory surface, if they need the same curve. > >> >> We could define that pipelines with a lower ID are to be preferred over >> higher IDs. > > Sure, but this isn't just an issue with a pipeline as a whole, but the > individual elements within it and how to use them in a given context. > >> >> The issue is that if programming a pipeline becomes too slow to be >> useful it probably should just not be made available to user space. > > It's not that programming the pipeline is overall too slow. The LUTs we have > that are relatively slow to program are meant to be set infrequently, or even > just once, to allow the scaler and tone mapping operator to operate in fixed > point PQ space. You might still want the tone mapper, so you would choose a > pipeline that includes them, but when it comes to e.g. animating a night light, > you would want to choose a different LUT for that purpose. > >> >> The prepare-commit idea for blob properties would help to make the >> pipelines usable again, but until then it's probably a good idea to just >> not expose those pipelines. > > The prepare-commit idea actually wouldn't work for these LUTs, because they are > programmed using methods instead of pointing them to a surface. I'm actually not > sure how slow it actually is, would need to benchmark it. I think not exposing > them at all would be overkill, since it would mean you can't use the preblending > scaler or tonemapper, and animation isn't necessary for that. > > The AMD 3DLUT is another example of a LUT that is slow to update, and it would > obviously be a major loss if that wasn't exposed. There just needs to be some > way for clients to know if they are going to kill performance by trying to > change it every frame. > > Thanks, > Alex > To clarify, what are we defining as slow to update here? Something we aren't able to update within a frame (let's say at a low frame rate such as 30 fps for discussion's sake)? A block that requires a programming sequence of disable + program + enable to update? Defining performance seems like it can get murky if we start to consider frame concurrent updates among multiple color blocks as well. Thanks, Christopher >> >>> >>> >>> Thanks, >>> pq >> >>
On 2023-10-25 16:16, Alex Goins wrote: > Thank you Harry and all other contributors for your work on this. Responses > inline - > Thanks for your comments on this. Apologies for the late response. I was focussing on the simpler responses to my patch set first and left your last as it's the most interesting. > On Mon, 23 Oct 2023, Pekka Paalanen wrote: > >> On Fri, 20 Oct 2023 11:23:28 -0400 >> Harry Wentland <harry.wentland@amd.com> wrote: >> >>> On 2023-10-20 10:57, Pekka Paalanen wrote: >>>> On Fri, 20 Oct 2023 16:22:56 +0200 >>>> Sebastian Wick <sebastian.wick@redhat.com> wrote: >>>> >>>>> Thanks for continuing to work on this! >>>>> >>>>> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: >>>>>> v2: >>>>>> - Update colorop visualizations to match reality (Sebastian, Alex Hung) >>>>>> - Updated wording (Pekka) >>>>>> - Change BYPASS wording to make it non-mandatory (Sebastian) >>>>>> - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property >>>>>> section (Pekka) >>>>>> - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) >>>>>> - Add "Driver Implementer's Guide" section (Pekka) >>>>>> - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) >>>> >>>> ... >>>> >>>>>> +An example of a drm_colorop object might look like one of these:: >>>>>> + >>>>>> + /* 1D enumerated curve */ >>>>>> + Color operation 42 >>>>>> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D enumerated curve >>>>>> + ├─ "BYPASS": bool {true, false} >>>>>> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ inverse EOTF, …} >>>>>> + └─ "NEXT": immutable color operation ID = 43 > > I know these are just examples, but I would also like to suggest the possibility > of an "identity" CURVE_1D_TYPE. BYPASS = true might get different results > compared to setting an identity in some cases depending on the hardware. See > below for more on this, RE: implicit format conversions. > > Although NVIDIA hardware doesn't use a ROM for enumerated curves, it came up in > offline discussions that it would nonetheless be helpful to expose enumerated > curves in order to hide the vendor-specific complexities of programming > segmented LUTs from clients. In that case, we would simply refer to the > enumerated curve when calculating/choosing segmented LUT entries. > > Another thing that came up in offline discussions is that we could use multiple > color operations to program a single operation in hardware. As I understand it, > AMD has a ROM-defined LUT, followed by a custom 4K entry LUT, followed by an > "HDR Multiplier". On NVIDIA we don't have these as separate hardware stages, but > we could combine them into a singular LUT in software, such that you can combine > e.g. segmented PQ EOTF with night light. One caveat is that you will lose > precision from the custom LUT where it overlaps with the linear section of the > enumerated curve, but that is unavoidable and shouldn't be an issue in most > use-cases. > FWIW, for the most part we don't have ROMs followed by custom LUTs. We have either a ROM-based HW block or a segmented programmable LUT. In the case of the former we will only expose named transfer functions. In the case of the latter we expose a named TF, followed by custom LUT and merge them into one segmented LUT. > Actually, the current examples in the proposal don't include a multiplier color > op, which might be useful. For AMD as above, but also for NVIDIA as the > following issue arises: > The current examples are only examples. A multiplier coloro opwould make a lot of sense. > As discussed further below, the NVIDIA "degamma" LUT performs an implicit fixed > point to FP16 conversion. In that conversion, what fixed point 0xFFFFFFFF maps > to in floating point varies depending on the source content. If it's SDR > content, we want the max value in FP16 to be 1.0 (80 nits), subject to a > potential boost multiplier if we want SDR content to be brighter. If it's HDR PQ > content, we want the max value in FP16 to be 125.0 (10,000 nits). My assumption > is that this is also what AMD's "HDR Multiplier" stage is used for, is that > correct? > Our PQ transfer function will also map to [0.0, 125.0] without use of the HDR multiplier. The HDR multiplier is intended to be used to scale SDR brightness when the user moves the SDR brightness slider in the OS. > From the given enumerated curves, it's not clear how they would map to the > above. Should sRGB EOTF have a max FP16 value of 1.0, and the PQ EOTF a max FP16 > value of 125.0? That may work, but it tends towards the "descriptive" notion of Yes, I think we need to be clear about the output range of a named transfer function. While AMD and NVidia map PQ to [0.0, 125.0] I could see others map it to [0.0, 1.0] (and maybe scale sRGB down to 1/125.0 or some other value). > assuming the source content, which may not be accurate in all cases. This is > also an issue for the custom 1D LUT, as the blob will need to be converted to > FP16 in order to populate our "degamma" LUT. What should the resulting max FP16 > value be, given that we no longer have any hint as to the source content? > I consider input data to be in UNORM and convert that to [0.0, 1.0]. Transfer functions (such as PQ) might then scale that beyond the [0.0, 1.0] range. > I think a multiplier color op solves all of these issues. Named curves and > custom 1D LUTs would by default assume a max FP16 value of 1.0, which would then > be adjusted by the multiplier. For 80 nit SDR content, set it to 1, for 400 > nit SDR content, set it to 5, for HDR PQ content, set it to 125, etc. > The custom ROMs won't allow adjustment on AMD HW, so it would then need to be a fixed multiplier. I would be in favor of defining the named PQ curve as DRM_COLOROP_1D_CURVE_PQ_125_EOTF for the [0.0, 125.0] TF, or as DRM_COLOROP_1D_CURVE_PQ_1_EOTF for HW that maps it to [0.0, 1.0] >>>>>> + >>>>>> + /* custom 4k entry 1D LUT */ >>>>>> + Color operation 52 >>>>>> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D LUT >>>>>> + ├─ "BYPASS": bool {true, false} >>>>>> + ├─ "LUT_1D_SIZE": immutable range = 4096 >>>>>> + ├─ "LUT_1D": blob >>>>>> + └─ "NEXT": immutable color operation ID = 0 >>>> >>>> ... >>>> >>>>>> +Driver Forward/Backward Compatibility >>>>>> +===================================== >>>>>> + >>>>>> +As this is uAPI drivers can't regress color pipelines that have been >>>>>> +introduced for a given HW generation. New HW generations are free to >>>>>> +abandon color pipelines advertised for previous generations. >>>>>> +Nevertheless, it can be beneficial to carry support for existing color >>>>>> +pipelines forward as those will likely already have support in DRM >>>>>> +clients. >>>>>> + >>>>>> +Introducing new colorops to a pipeline is fine, as long as they can be >>>>>> +disabled or are purely informational. DRM clients implementing support >>>>>> +for the pipeline can always skip unknown properties as long as they can >>>>>> +be confident that doing so will not cause unexpected results. >>>>>> + >>>>>> +If a new colorop doesn't fall into one of the above categories >>>>>> +(bypassable or informational) the modified pipeline would be unusable >>>>>> +for user space. In this case a new pipeline should be defined. >>>>> >>>>> How can user space detect an informational element? Should we just add a >>>>> BYPASS property to informational elements, make it read only and set to >>>>> true maybe? Or something more descriptive? >>>> >>>> Read-only BYPASS set to true would be fine by me, I guess. >>>> >>> >>> Don't you mean set to false? An informational element will always do >>> something, so it can't be bypassed. >> >> Yeah, this is why we need a definition. I understand "informational" to >> not change pixel values in any way. Previously I had some weird idea >> that scaling doesn't alter color, but of course it may. > > On recent hardware, the NVIDIA pre-blending pipeline includes LUTs that do > implicit fixed-point to FP16 conversions, and vice versa. > > For example, the "degamma" LUT towards the beginning of the pipeline implicitly > converts from fixed point to FP16, and some of the following operations expect > to operate in FP16. As such, if you have a fixed point input and don't bypass > those following operations, you *must not* bypass the LUT, even if you are > otherwise just programming it with the identity. Conversely, if you have a > floating point input, you *must* bypass the LUT. > > Could informational elements and allowing the exclusion of the BYPASS property > be used to convey this information to the client? For example, we could expose > one pipeline with the LUT exposed with read-only BYPASS set to false, and > sandwich it with informational "Fixed Point" and "FP16" elements to accommodate > fixed point input. Then, expose another pipeline with the LUT missing, and an > informational "FP16" element in its place to accommodate floating point input. > I wonder if an informational element at the beginning of the pipeline can advertise the FOURCC formats this pipeline can operate on. For AMD HW we also have certain things we can only do on RGB and not on NV12, for example. > That's just an example; we also have other operations in the pipeline that do > similar implicit conversions. In these cases we don't want the operations to be > bypassed individually, so instead we would expose them as mandatory in some > pipelines and missing in others, with informational elements to help inform the > client of which to choose. Is that acceptable under the current proposal? > > Note that in this case, the information just has to do with what format the > pixels should be in, it doesn't correspond to any specific operation. So, I'm > not sure that BYPASS has any meaning for informational elements in this context. > >>>> I think we also need a definition of "informational". >>>> >>>> Counter-example 1: a colorop that represents a non-configurable >>> >>> Not sure what's "counter" for these examples? >>> >>>> YUV<->RGB conversion. Maybe it determines its operation from FB pixel >>>> format. It cannot be set to bypass, it cannot be configured, and it >>>> will alter color values. > > Would it be reasonable to expose this is a 3x4 matrix with a read-only blob and > no BYPASS property? I already brought up a similar idea at the XDC HDR Workshop > based on the principle that read-only blobs could be used to express some static > pipeline elements without the need to define a new type, but got mixed opinions. > I think this demonstrates the principle further, as clients could detect this > programmatically instead of having to special-case the informational element. > That's an option. But I think a "named matrix" type might make more sense so you don't need to create a pipeline for each read-only matrix and so userspace doesn't need to parse the read-only matrix to find out which conversion it does. >>>> >>>> Counter-example 2: image size scaling colorop. It might not be >>>> configurable, it is controlled by the plane CRTC_* and SRC_* >>>> properties. You still need to understand what it does, so you can >>>> arrange the scaling to work correctly. (Do not want to scale an image >>>> with PQ-encoded values as Josh demonstrated in XDC.) >>>> >>> >>> IMO the position of the scaling operation is the thing that's important >>> here as the color pipeline won't define scaling properties. > > I agree that blending should ideally be done in linear space, and I remember > that from Josh's presentation at XDC, but I don't recall the same being said for > scaling. In fact, the NVIDIA pre-blending scaler exists in a stage of the > pipeline that is meant to be in PQ space (more on this below), and that was > found to achieve better results at HDR/SDR boundaries. Of course, this only > bolsters the argument that it would be helpful to have an informational "scaler" > element to understand at which stage scaling takes place. > I think an informational scaler makes sense. It's interesting how different HW vendors made different design decisions here as no OS ever really defined which space they want scaling to be performed in. >>>> Counter-example 3: image sampling colorop. Averages FB originated color >>>> values to produce a color sample. Again do not want to do this with >>>> PQ-encoded values. >>>> >>> >>> Wouldn't this only happen during a scaling op? >> >> There is certainly some overlap between examples 2 and 3. IIRC SRC_X/Y >> coordinates can be fractional, which makes nearest vs. bilinear >> sampling have a difference even if there is no scaling. >> >> There is also the question of chroma siting with sub-sampled YUV. I >> don't know how that actually works, or how it theoretically should work. > > We have some operations in our pipeline that are intended to be static, i.e. a > static matrix that converts from RGB to LMS, and later another that converts > from LMS to ICtCp. There are even LUTs that are intended to be static, > converting from linear to PQ and vice versa. All of this is because the > pre-blending scaler and tone mapping operator are intended to operate in ICtCp > PQ space. Although the stated LUTs and matrices are intended to be static, they > are actually programmable. In offline discussions, it was indicated that it > would be helpful to actually expose the programmability, as opposed to exposing > them as non-bypassable blocks, as some compositors may have novel uses for them. > > Despite being programmable, the LUTs are updated in a manner that is less > efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful > if there was some way to tag operations according to their performance, > for example so that clients can prefer a high performance one when they > intend to do an animated transition? I recall from the XDC HDR workshop > that this is also an issue with AMD's 3DLUT, where updates can be too > slow to animate. > That's an interesting idea. Harry > Thanks, > Alex Goins > NVIDIA Linux Driver Team > >> Thanks, >> pq
On 2023-10-26 04:57, Pekka Paalanen wrote: > On Wed, 25 Oct 2023 15:16:08 -0500 (CDT) > Alex Goins <agoins@nvidia.com> wrote: > >> Thank you Harry and all other contributors for your work on this. Responses >> inline - >> >> On Mon, 23 Oct 2023, Pekka Paalanen wrote: >> >>> On Fri, 20 Oct 2023 11:23:28 -0400 >>> Harry Wentland <harry.wentland@amd.com> wrote: >>> >>>> On 2023-10-20 10:57, Pekka Paalanen wrote: >>>>> On Fri, 20 Oct 2023 16:22:56 +0200 >>>>> Sebastian Wick <sebastian.wick@redhat.com> wrote: >>>>> >>>>>> Thanks for continuing to work on this! >>>>>> >>>>>> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: >>>>>>> v2: >>>>>>> - Update colorop visualizations to match reality (Sebastian, Alex Hung) >>>>>>> - Updated wording (Pekka) >>>>>>> - Change BYPASS wording to make it non-mandatory (Sebastian) >>>>>>> - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property >>>>>>> section (Pekka) >>>>>>> - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) >>>>>>> - Add "Driver Implementer's Guide" section (Pekka) >>>>>>> - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) >>>>> >>>>> ... >>>>> >>>>>>> +An example of a drm_colorop object might look like one of these:: >>>>>>> + >>>>>>> + /* 1D enumerated curve */ >>>>>>> + Color operation 42 >>>>>>> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D enumerated curve >>>>>>> + ├─ "BYPASS": bool {true, false} >>>>>>> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ inverse EOTF, …} >>>>>>> + └─ "NEXT": immutable color operation ID = 43 >> >> I know these are just examples, but I would also like to suggest the possibility >> of an "identity" CURVE_1D_TYPE. BYPASS = true might get different results >> compared to setting an identity in some cases depending on the hardware. See >> below for more on this, RE: implicit format conversions. >> >> Although NVIDIA hardware doesn't use a ROM for enumerated curves, it came up in >> offline discussions that it would nonetheless be helpful to expose enumerated >> curves in order to hide the vendor-specific complexities of programming >> segmented LUTs from clients. In that case, we would simply refer to the >> enumerated curve when calculating/choosing segmented LUT entries. > > That's a good idea. > >> Another thing that came up in offline discussions is that we could use multiple >> color operations to program a single operation in hardware. As I understand it, >> AMD has a ROM-defined LUT, followed by a custom 4K entry LUT, followed by an >> "HDR Multiplier". On NVIDIA we don't have these as separate hardware stages, but >> we could combine them into a singular LUT in software, such that you can combine >> e.g. segmented PQ EOTF with night light. One caveat is that you will lose >> precision from the custom LUT where it overlaps with the linear section of the >> enumerated curve, but that is unavoidable and shouldn't be an issue in most >> use-cases. > > Indeed. > >> Actually, the current examples in the proposal don't include a multiplier color >> op, which might be useful. For AMD as above, but also for NVIDIA as the >> following issue arises: >> >> As discussed further below, the NVIDIA "degamma" LUT performs an implicit fixed >> point to FP16 conversion. In that conversion, what fixed point 0xFFFFFFFF maps >> to in floating point varies depending on the source content. If it's SDR >> content, we want the max value in FP16 to be 1.0 (80 nits), subject to a >> potential boost multiplier if we want SDR content to be brighter. If it's HDR PQ >> content, we want the max value in FP16 to be 125.0 (10,000 nits). My assumption >> is that this is also what AMD's "HDR Multiplier" stage is used for, is that >> correct? > > It would be against the UAPI design principles to tag content as HDR or > SDR. What you can do instead is to expose a colorop with a multiplier of > 1.0 or 125.0 to match your hardware behaviour, then tell your hardware > that the input is SDR or HDR to get the expected multiplier. You will > never know what the content actually is, anyway. > > Of course, if we want to have a arbitrary multiplier colorop that is > somewhat standard, as in, exposed by many drivers to ease userspace > development, you can certainly use any combination of your hardware > features you need to realize the UAPI prescribed mathematical operation. > > Since we are talking about floating-point in hardware, a multiplier > does not significantly affect precision. > > In order to mathematically define all colorops, I believe it is > necessary to define all colorops in terms of floating-point values (as > in math), even if they operate on fixed-point or integer. By this I > mean that if the input is 8 bpc unsigned integer pixel format for > instance, 0 raw pixel channel value is mapped to 0.0 and 255 is mapped > to 1.0, and the color pipeline starts with [0.0, 1.0], not [0, 255] > domain. We have to agree on this mapping for all channels on all pixel > formats. However, there is a "but" further below. > > I also propose that quantization range is NOT considered in the raw > value mapping, so that we can handle quantization range in colorops > explicitly, allowing us to e.g. handle sub-blacks and super-whites when > necessary. (These are currently impossible to represent in the legacy > color properties, because everything is converted to full range and > clipped before any color operations.) > I pretty much agree with anything you say up to here. :) >> From the given enumerated curves, it's not clear how they would map to the >> above. Should sRGB EOTF have a max FP16 value of 1.0, and the PQ EOTF a max FP16 >> value of 125.0? That may work, but it tends towards the "descriptive" notion of >> assuming the source content, which may not be accurate in all cases. This is >> also an issue for the custom 1D LUT, as the blob will need to be converted to >> FP16 in order to populate our "degamma" LUT. What should the resulting max FP16 >> value be, given that we no longer have any hint as to the source content? > > In my opinion, all finite non-negative transfer functions should > operate with [0.0, 1.0] domain and [0.0, 1.0] range, and that includes > all sRGB, power 2.2, and PQ curves. > That wouldn't work with AMD HW that encodes a PQ transfer function that has an output range of [0.0, 125.0]. I suggest making the range a part of the named TF definition. > If we look at BT.2100, there is no such encoding even mentioned where > 125.0 would correspond to 10k cd/m². That 125.0 convention already has > a built-in assumption what the color spaces are and what the conversion > is aiming to do. IOW, I would say that choice is opinionated from the > start. The multiplier in BT.2100 is always 10000. > Sure, the choice is opinionated but a certain large OS vendor has had a large influence in how HW vendors designed their color pipelines. snip >> On recent hardware, the NVIDIA pre-blending pipeline includes LUTs that do >> implicit fixed-point to FP16 conversions, and vice versa. > > Above, I claimed that the UAPI should be defined in nominal > floating-point values, but I wonder, would that work? Would we need to > have explicit colorops for converting from raw pixel data values into > nominal floating-point in the UAPI? > I think it's important that we keep a level of abstraction a the driver level. I'm not sure it would serve anyone if we defined this. snip >>>>> I think we also need a definition of "informational". >>>>> >>>>> Counter-example 1: a colorop that represents a non-configurable >>>> >>>> Not sure what's "counter" for these examples? >>>> >>>>> YUV<->RGB conversion. Maybe it determines its operation from FB pixel >>>>> format. It cannot be set to bypass, it cannot be configured, and it >>>>> will alter color values. >> >> Would it be reasonable to expose this is a 3x4 matrix with a read-only blob and >> no BYPASS property? I already brought up a similar idea at the XDC HDR Workshop >> based on the principle that read-only blobs could be used to express some static >> pipeline elements without the need to define a new type, but got mixed opinions. >> I think this demonstrates the principle further, as clients could detect this >> programmatically instead of having to special-case the informational element. > > If the blob depends on the pixel format (i.e. the driver automatically > chooses a different blob per pixel format), then I think we would need > to expose all the blobs and how they correspond to pixel formats. > Otherwise ok, I guess. > > However, do we want or need to make a color pipeline or colorop > conditional on pixel formats? For example, if you use a YUV 4:2:0 type > of pixel format, then you must use this pipeline and not any other. Or > floating-point type of pixel format. I did not anticipate this before, > I assumed that all color pipelines and colorops are independent of the > framebuffer pixel format. A specific colorop might have a property that > needs to agree with the framebuffer pixel format, but I didn't expect > further limitations. > Yes, I think we'll want that. > "Without the need to define a new type" is something I think we need to > consider case by case. I have a hard time giving a general opinion. > >>>>> >>>>> Counter-example 2: image size scaling colorop. It might not be >>>>> configurable, it is controlled by the plane CRTC_* and SRC_* >>>>> properties. You still need to understand what it does, so you can >>>>> arrange the scaling to work correctly. (Do not want to scale an image >>>>> with PQ-encoded values as Josh demonstrated in XDC.) >>>>> >>>> >>>> IMO the position of the scaling operation is the thing that's important >>>> here as the color pipeline won't define scaling properties. >> >> I agree that blending should ideally be done in linear space, and I remember >> that from Josh's presentation at XDC, but I don't recall the same being said for >> scaling. In fact, the NVIDIA pre-blending scaler exists in a stage of the >> pipeline that is meant to be in PQ space (more on this below), and that was >> found to achieve better results at HDR/SDR boundaries. Of course, this only >> bolsters the argument that it would be helpful to have an informational "scaler" >> element to understand at which stage scaling takes place. > > Both blending and scaling are fundamentally the same operation: you > have two or more source colors (pixels), and you want to compute a > weighted average of them following what happens in nature, that is, > physics, as that is what humans are used to. > > Both blending and scaling will suffer from the same problems if the > operation is performed on not light-linear values. The result of the > weighted average does not correspond to physics. > > The problem may be hard to observe with natural imagery, but Josh's > example shows it very clearly. Maybe that effect is sometimes useful > for some imagery in some use cases, but it is still an accidental > side-effect. You might get even better results if you don't rely on > accidental side-effects but design a separate operation for the exact > goal you have. > Many people looked at this problem inside AMD and probably at other companies. Not all of them arrive at the same conclusion. The type of image will also greatly affect what one considers better. But it sounds like we'll need an informational scaling element at least for compositors that care. Do we need that as a first iteration of a working DRM/KMS solution, though? So far other OSes have not cared and people have (probably) not complained about it. snip >> Despite being programmable, the LUTs are updated in a manner that is less >> efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful >> if there was some way to tag operations according to their performance, >> for example so that clients can prefer a high performance one when they >> intend to do an animated transition? I recall from the XDC HDR workshop >> that this is also an issue with AMD's 3DLUT, where updates can be too >> slow to animate. > > I can certainly see such information being useful, but then we need to > somehow quantize the performance. > > What I was left puzzled about after the XDC workshop is that is it > possible to pre-load configurations in the background (slow), and then > quickly switch between them? Hardware-wise I mean. > On AMD HW, yes. How to fit that into the atomic API is a separate question. :D Harry > > Thanks, > pq
On 2023-10-26 13:30, Sebastian Wick wrote: > On Thu, Oct 26, 2023 at 11:57:47AM +0300, Pekka Paalanen wrote: >> On Wed, 25 Oct 2023 15:16:08 -0500 (CDT) >> Alex Goins <agoins@nvidia.com> wrote: >> >>> Thank you Harry and all other contributors for your work on this. Responses >>> inline - >>> >>> On Mon, 23 Oct 2023, Pekka Paalanen wrote: >>> >>>> On Fri, 20 Oct 2023 11:23:28 -0400 >>>> Harry Wentland <harry.wentland@amd.com> wrote: >>>> >>>>> On 2023-10-20 10:57, Pekka Paalanen wrote: >>>>>> On Fri, 20 Oct 2023 16:22:56 +0200 >>>>>> Sebastian Wick <sebastian.wick@redhat.com> wrote: >>>>>> >>>>>>> Thanks for continuing to work on this! >>>>>>> >>>>>>> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: snip >> >>>>>> I think we also need a definition of "informational". >>>>>> >>>>>> Counter-example 1: a colorop that represents a non-configurable >>>>> >>>>> Not sure what's "counter" for these examples? >>>>> >>>>>> YUV<->RGB conversion. Maybe it determines its operation from FB pixel >>>>>> format. It cannot be set to bypass, it cannot be configured, and it >>>>>> will alter color values. >>> >>> Would it be reasonable to expose this is a 3x4 matrix with a read-only blob and >>> no BYPASS property? I already brought up a similar idea at the XDC HDR Workshop >>> based on the principle that read-only blobs could be used to express some static >>> pipeline elements without the need to define a new type, but got mixed opinions. >>> I think this demonstrates the principle further, as clients could detect this >>> programmatically instead of having to special-case the informational element. >> > > I'm all for exposing fixed color ops but I suspect that most of those > follow some standard and in those cases instead of exposing the matrix > values one should prefer to expose a named matrix (e.g. BT.601, BT.709, > BT.2020). > Agreed. > As a general rule: always expose the highest level description. Going > from a name to exact values is trivial, going from values to a name is > much harder. > >> If the blob depends on the pixel format (i.e. the driver automatically >> chooses a different blob per pixel format), then I think we would need >> to expose all the blobs and how they correspond to pixel formats. >> Otherwise ok, I guess. >> >> However, do we want or need to make a color pipeline or colorop >> conditional on pixel formats? For example, if you use a YUV 4:2:0 type >> of pixel format, then you must use this pipeline and not any other. Or >> floating-point type of pixel format. I did not anticipate this before, >> I assumed that all color pipelines and colorops are independent of the >> framebuffer pixel format. A specific colorop might have a property that >> needs to agree with the framebuffer pixel format, but I didn't expect >> further limitations. > > We could simply fail commits when the pipeline and pixel format don't > work together. We'll probably need some kind of ingress no-op node > anyway and maybe could list pixel formats there if required to make it > easier to find a working configuration. > The problem with failing commits is that user-space has no idea why it failed. If this means that userspace falls back to SW composition for NV12 and P010 it would avoid HW offloading in one of the most important use-cases on AMD HW for power-saving purposes. snip >>> Despite being programmable, the LUTs are updated in a manner that is less >>> efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful >>> if there was some way to tag operations according to their performance, >>> for example so that clients can prefer a high performance one when they >>> intend to do an animated transition? I recall from the XDC HDR workshop >>> that this is also an issue with AMD's 3DLUT, where updates can be too >>> slow to animate. >> >> I can certainly see such information being useful, but then we need to >> somehow quantize the performance. >> >> What I was left puzzled about after the XDC workshop is that is it >> possible to pre-load configurations in the background (slow), and then >> quickly switch between them? Hardware-wise I mean. > > We could define that pipelines with a lower ID are to be preferred over > higher IDs. > > The issue is that if programming a pipeline becomes too slow to be > useful it probably should just not be made available to user space. > > The prepare-commit idea for blob properties would help to make the > pipelines usable again, but until then it's probably a good idea to just > not expose those pipelines. > It's a bit of a judgment call what's too slow, though. The value of having a HW colorop might outweigh the cost of the programming time for some compositors but not for others. Harry >> >> >> Thanks, >> pq > >
On 2023-10-26 15:25, Alex Goins wrote: > On Thu, 26 Oct 2023, Sebastian Wick wrote: > >> On Thu, Oct 26, 2023 at 11:57:47AM +0300, Pekka Paalanen wrote: >>> On Wed, 25 Oct 2023 15:16:08 -0500 (CDT) >>> Alex Goins <agoins@nvidia.com> wrote: >>> >>>> Thank you Harry and all other contributors for your work on this. Responses >>>> inline - >>>> >>>> On Mon, 23 Oct 2023, Pekka Paalanen wrote: >>>> >>>>> On Fri, 20 Oct 2023 11:23:28 -0400 >>>>> Harry Wentland <harry.wentland@amd.com> wrote: >>>>> >>>>>> On 2023-10-20 10:57, Pekka Paalanen wrote: >>>>>>> On Fri, 20 Oct 2023 16:22:56 +0200 >>>>>>> Sebastian Wick <sebastian.wick@redhat.com> wrote: >>>>>>> >>>>>>>> Thanks for continuing to work on this! >>>>>>>> >>>>>>>> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: snip >>> >>> If we look at BT.2100, there is no such encoding even mentioned where >>> 125.0 would correspond to 10k cd/m². That 125.0 convention already has >>> a built-in assumption what the color spaces are and what the conversion >>> is aiming to do. IOW, I would say that choice is opinionated from the >>> start. The multiplier in BT.2100 is always 10000. > > Be that as it may, the convention of FP16 125.0 corresponding to 10k nits is > baked in our hardware, so it's unavoidable at least for NVIDIA pipelines. > Yeah, that's not just NVidia, it's basically the same for AMD. Though I think we can work without that assumption, but the PQ TF you get from AMD will map to [0.0, 125.0]. snip >> >> We could simply fail commits when the pipeline and pixel format don't >> work together. We'll probably need some kind of ingress no-op node >> anyway and maybe could list pixel formats there if required to make it >> easier to find a working configuration. > > Yeah, we could, but having to figure that out through trial and error would be > unfortunate. Per above, it might be easiest to just tag pipelines with a pixel > format instead of trying to include the pixel format conversion as a color op. > Agreed, We've been looking at libliftoff a bit but one of the problem is that it does a lot of atomic checks to figure out an optimal HW plane configuration and we run out of time budget before we're able to check all options. Atomic check failure is really not well suited for this stuff. >>> "Without the need to define a new type" is something I think we need to >>> consider case by case. I have a hard time giving a general opinion. >>> >>>>>>> >>>>>>> Counter-example 2: image size scaling colorop. It might not be >>>>>>> configurable, it is controlled by the plane CRTC_* and SRC_* >>>>>>> properties. You still need to understand what it does, so you can >>>>>>> arrange the scaling to work correctly. (Do not want to scale an image >>>>>>> with PQ-encoded values as Josh demonstrated in XDC.) >>>>>>> >>>>>> >>>>>> IMO the position of the scaling operation is the thing that's important >>>>>> here as the color pipeline won't define scaling properties. >>>> >>>> I agree that blending should ideally be done in linear space, and I remember >>>> that from Josh's presentation at XDC, but I don't recall the same being said for >>>> scaling. In fact, the NVIDIA pre-blending scaler exists in a stage of the >>>> pipeline that is meant to be in PQ space (more on this below), and that was >>>> found to achieve better results at HDR/SDR boundaries. Of course, this only >>>> bolsters the argument that it would be helpful to have an informational "scaler" >>>> element to understand at which stage scaling takes place. >>> >>> Both blending and scaling are fundamentally the same operation: you >>> have two or more source colors (pixels), and you want to compute a >>> weighted average of them following what happens in nature, that is, >>> physics, as that is what humans are used to. >>> >>> Both blending and scaling will suffer from the same problems if the >>> operation is performed on not light-linear values. The result of the >>> weighted average does not correspond to physics. >>> >>> The problem may be hard to observe with natural imagery, but Josh's >>> example shows it very clearly. Maybe that effect is sometimes useful >>> for some imagery in some use cases, but it is still an accidental >>> side-effect. You might get even better results if you don't rely on >>> accidental side-effects but design a separate operation for the exact >>> goal you have. >>> >>> Mind, by scaling we mean changing image size. Not scaling color values. >>> > > Fair enough, but it might not always be a choice given the hardware. > I'm thinking of this as an information element, not a programmable. Some HW could define this as programmable, but I probably wouldn't on AMD HW. snip >>> >>> What I was left puzzled about after the XDC workshop is that is it >>> possible to pre-load configurations in the background (slow), and then >>> quickly switch between them? Hardware-wise I mean. > > This works fine for our "fast" LUTs, you just point them to a surface in video > memory and they flip to it. You could keep multiple surfaces around and flip > between them without having to reprogram them in software. We can easily do that > with enumerated curves, populating them when the driver initializes instead of > waiting for the client to request them. You can even point multiple hardware > LUTs to the same video memory surface, if they need the same curve. > Ultimately I think that's the best way to solve this problem, but it needs HW that can do this. snip >> >> The prepare-commit idea for blob properties would help to make the >> pipelines usable again, but until then it's probably a good idea to just >> not expose those pipelines. > > The prepare-commit idea actually wouldn't work for these LUTs, because they are > programmed using methods instead of pointing them to a surface. I'm actually not > sure how slow it actually is, would need to benchmark it. I think not exposing > them at all would be overkill, since it would mean you can't use the preblending > scaler or tonemapper, and animation isn't necessary for that. > I tend to agree. Maybe a "Heavy Operation" flag that tells userspace they can use it but it might come at a significant cost. Harry > The AMD 3DLUT is another example of a LUT that is slow to update, and it would > obviously be a major loss if that wasn't exposed. There just needs to be some > way for clients to know if they are going to kill performance by trying to > change it every frame. > > Thanks, > Alex > >> >>> >>> >>> Thanks, >>> pq >> >>
On 2023-11-04 19:01, Christopher Braga wrote: > Just want to loop back to before we branched off deeper into the programming performance talk > > On 10/26/2023 3:25 PM, Alex Goins wrote: >> On Thu, 26 Oct 2023, Sebastian Wick wrote: >> >>> On Thu, Oct 26, 2023 at 11:57:47AM +0300, Pekka Paalanen wrote: >>>> On Wed, 25 Oct 2023 15:16:08 -0500 (CDT) >>>> Alex Goins <agoins@nvidia.com> wrote: >>>> >>>>> Thank you Harry and all other contributors for your work on this. Responses >>>>> inline - >>>>> >>>>> On Mon, 23 Oct 2023, Pekka Paalanen wrote: >>>>> >>>>>> On Fri, 20 Oct 2023 11:23:28 -0400 >>>>>> Harry Wentland <harry.wentland@amd.com> wrote: >>>>>> >>>>>>> On 2023-10-20 10:57, Pekka Paalanen wrote: >>>>>>>> On Fri, 20 Oct 2023 16:22:56 +0200 >>>>>>>> Sebastian Wick <sebastian.wick@redhat.com> wrote: >>>>>>>> >>>>>>>>> Thanks for continuing to work on this! >>>>>>>>> >>>>>>>>> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: snip >>>>> Actually, the current examples in the proposal don't include a multiplier color >>>>> op, which might be useful. For AMD as above, but also for NVIDIA as the >>>>> following issue arises: >>>>> >>>>> As discussed further below, the NVIDIA "degamma" LUT performs an implicit fixed > > If possible, let's declare this as two blocks. One that informatively declares the conversion is present, and another for the de-gamma. This will help with block-reuse between vendors. > >>>>> point to FP16 conversion. In that conversion, what fixed point 0xFFFFFFFF maps >>>>> to in floating point varies depending on the source content. If it's SDR >>>>> content, we want the max value in FP16 to be 1.0 (80 nits), subject to a >>>>> potential boost multiplier if we want SDR content to be brighter. If it's HDR PQ >>>>> content, we want the max value in FP16 to be 125.0 (10,000 nits). My assumption >>>>> is that this is also what AMD's "HDR Multiplier" stage is used for, is that >>>>> correct? >>>> >>>> It would be against the UAPI design principles to tag content as HDR or >>>> SDR. What you can do instead is to expose a colorop with a multiplier of >>>> 1.0 or 125.0 to match your hardware behaviour, then tell your hardware >>>> that the input is SDR or HDR to get the expected multiplier. You will >>>> never know what the content actually is, anyway. >> >> Right, I didn't mean to suggest that we should tag content as HDR or SDR in the >> UAPI, just relating to the end result in the pipe, ultimately it would be >> determined by the multiplier color op. >> > > A multiplier could work but we would should give OEMs the option to either make it "informative" and fixed by the hardware, or fully configurable. With the Qualcomm pipeline how we absorb FP16 pixel buffers, as well as how we convert them to fixed point data actually has a dependency on the desired de-gamma and gamma processing. So for an example: > > If a source pixel buffer is scRGB encoded FP16 content we would expect input pixel content to be up to 7.5, with the IGC output reaching 125 as in the NVIDIA case. Likewise gamma 2.2 encoded FP16 content would be 0-1 in and 0-1 out. > > So in the Qualcomm case the expectations are fixed depending on the use case. > > It is sounding to me like we would need to be able to declare three things here: > 1. Value range expectations *into* the de-gamma block. A multiplier wouldn't work here because it would be more of a clipping operation. I guess we would have to add an explicit clamping block as well. > 2. What the value range expectations at the *output* of de-gamma processing block. Also covered by using another multiplier block. > 3. Value range expectations *into* a gamma processing block. This should be covered by declaring a multiplier post-csc, but only assuming CSC output is normalized in the desired value range. A clamping block would be preferable because it describes what happens when it isn't. > What about adding informational input and output range properties to colorops? I think Intel's PWL definitions had something like that, but I'd have to take a look at that again. While I'm not in favor of defining segmented LUTs at the uAPI the input/output ranges seem to be something of value. > All this is do-able, but it seems like it would require the definition of multiple color pipelines to expose the different limitations for color block configuration combinations. Additionally, would it be easy for user space to find the right pipeline? > I'm also a little concerned that some of these proposals mean we'd have to expose an inordinate number of color pipelines and color pipeline selection becomes difficult and error prone. snip >>>> Given that elements like various kinds of look-up tables inherently >>>> assume that the domain is [0.0, 1.0] (because the it is a table that >>>> has a beginning and an end, and the usual convention is that the >>>> beginning is zero and the end is one), I think it is best to stick to >>>> the [0.0, 1.0] range where possible. If we go out of that range, then >>>> we have to define how a LUT would apply in a sensible way. >> >> In my last reply I mentioned a static (but actually programmable) LUT that is >> typically used to convert FP16 linear pixels to fixed point PQ before handing >> them to the scaler and tone mapping operator. You're actually right that it >> indexes in the fixed point [0.0, 1.0] range for the reasons you describe, but >> because the input pixels are expected to be FP16 in the [0.0, 125.0] range, it >> applies a non-programmable 1/125.0 normalization factor first. >> >> In this case, you could think of the LUT as indexing on [0.0, 125.0], but as you >> point out there would need to be some way to describe that. Maybe we actually >> need a fractional multiplier / divider color op. NVIDIA pipes that include this >> LUT would need to include a mandatory 1/125.0 factor immediately prior to the >> LUT, then LUT can continue assuming a range of [0.0, 1.0]. >> >> Assuming you are using the hardware in a conventional way, specifying a >> multiplier of 1.0 after the "degamma" LUT would then map to the 80-nit PQ range >> after the static (but actually programmable) PQ LUT, whereas specifying a >> multiplier of 125.0 would map to the 10,000-nit PQ range, which is what we want. >> I guess it's kind of messy, but the effect would be that color ops other than >> multipliers/dividers would still be in the [0.0, 1.0] domain, and any multiplier >> that exceeds that range would have to be normalized by a divider before any >> other color op. >> > > Hmm. A multiplier would resolve issues when input linear FP16 data that has different ideas on what 1.0 means in regards to nits values (think of Apple's EDR as an example). For a client to go from their definition to hardware definition of 1.0 = x nits, we would need to expose what the pipeline sees as 1.0 though. So in this case the multiplier would be programmable, but the divisor is informational? It seems like the later would have an influence on how the former is programmed. > A programmable multiplier would either need to be backed by a HW block to perform the operation or require a driver to scale the LUT or matrix values of an adjacent LUT or matrix block. snip >>>>>> >>>>>> Yeah, this is why we need a definition. I understand "informational" to >>>>>> not change pixel values in any way. Previously I had some weird idea >>>>>> that scaling doesn't alter color, but of course it may. >>>>> >>>>> On recent hardware, the NVIDIA pre-blending pipeline includes LUTs that do >>>>> implicit fixed-point to FP16 conversions, and vice versa. >>>> >>>> Above, I claimed that the UAPI should be defined in nominal >>>> floating-point values, but I wonder, would that work? Would we need to >>>> have explicit colorops for converting from raw pixel data values into >>>> nominal floating-point in the UAPI? >> >> Yeah, I think something like that is needed, or another solution as discussed >> below. Even if we define the UAPI in terms of floating point, the actual >> underlying pixel format needs to match the expectations of each stage as it >> flows through the pipe. >> > > Strongly agree on this. Pixel format and block relationships definitely exist. > Interesting to see this isn't just an AMD thing. :) snip >>>> >>>> Both blending and scaling are fundamentally the same operation: you >>>> have two or more source colors (pixels), and you want to compute a >>>> weighted average of them following what happens in nature, that is, >>>> physics, as that is what humans are used to. >>>> >>>> Both blending and scaling will suffer from the same problems if the >>>> operation is performed on not light-linear values. The result of the >>>> weighted average does not correspond to physics. >>>> >>>> The problem may be hard to observe with natural imagery, but Josh's >>>> example shows it very clearly. Maybe that effect is sometimes useful >>>> for some imagery in some use cases, but it is still an accidental >>>> side-effect. You might get even better results if you don't rely on >>>> accidental side-effects but design a separate operation for the exact >>>> goal you have. >>>> >>>> Mind, by scaling we mean changing image size. Not scaling color values. >>>> >> >> Fair enough, but it might not always be a choice given the hardware. >> > > Agreeing with Alex here. I get there is some debate over the best way to do this, but I think it is best to leave it up to the driver to declare how that is done. Same. snip >>>> >>>> What I was left puzzled about after the XDC workshop is that is it >>>> possible to pre-load configurations in the background (slow), and then >>>> quickly switch between them? Hardware-wise I mean. >> >> This works fine for our "fast" LUTs, you just point them to a surface in video >> memory and they flip to it. You could keep multiple surfaces around and flip >> between them without having to reprogram them in software. We can easily do that >> with enumerated curves, populating them when the driver initializes instead of >> waiting for the client to request them. You can even point multiple hardware >> LUTs to the same video memory surface, if they need the same curve. >> >>> >>> We could define that pipelines with a lower ID are to be preferred over >>> higher IDs. >> >> Sure, but this isn't just an issue with a pipeline as a whole, but the >> individual elements within it and how to use them in a given context. >> >>> >>> The issue is that if programming a pipeline becomes too slow to be >>> useful it probably should just not be made available to user space. >> >> It's not that programming the pipeline is overall too slow. The LUTs we have >> that are relatively slow to program are meant to be set infrequently, or even >> just once, to allow the scaler and tone mapping operator to operate in fixed >> point PQ space. You might still want the tone mapper, so you would choose a >> pipeline that includes them, but when it comes to e.g. animating a night light, >> you would want to choose a different LUT for that purpose. >> >>> >>> The prepare-commit idea for blob properties would help to make the >>> pipelines usable again, but until then it's probably a good idea to just >>> not expose those pipelines. >> >> The prepare-commit idea actually wouldn't work for these LUTs, because they are >> programmed using methods instead of pointing them to a surface. I'm actually not >> sure how slow it actually is, would need to benchmark it. I think not exposing >> them at all would be overkill, since it would mean you can't use the preblending >> scaler or tonemapper, and animation isn't necessary for that. >> >> The AMD 3DLUT is another example of a LUT that is slow to update, and it would >> obviously be a major loss if that wasn't exposed. There just needs to be some >> way for clients to know if they are going to kill performance by trying to >> change it every frame. >> >> Thanks, >> Alex >> > > To clarify, what are we defining as slow to update here? Something we aren't able to update within a frame (let's say at a low frame rate such as 30 fps for discussion's sake)? A block that requires a programming sequence of disable + program + enable to update? Defining performance seems like it can get murky if we start to consider frame concurrent updates among multiple color blocks as well. > I think any definition for slow would need to be imprecise on some level. In the AMD 3DLUT case we can take around 8 ms. Some compositors need the programming time to be well under 1 ms, even for low frame rates. Those compositors might want to know if an operation might be undesirable if they care about latency. I'm not sure we could reliably indicate more. Harry > Thanks, > Christopher > >>> >>>> >>>> >>>> Thanks, >>>> pq >>> >>>
On Tue, Nov 07, 2023 at 11:52:11AM -0500, Harry Wentland wrote: > > > On 2023-10-26 13:30, Sebastian Wick wrote: > > On Thu, Oct 26, 2023 at 11:57:47AM +0300, Pekka Paalanen wrote: > >> On Wed, 25 Oct 2023 15:16:08 -0500 (CDT) > >> Alex Goins <agoins@nvidia.com> wrote: > >> > >>> Thank you Harry and all other contributors for your work on this. Responses > >>> inline - > >>> > >>> On Mon, 23 Oct 2023, Pekka Paalanen wrote: > >>> > >>>> On Fri, 20 Oct 2023 11:23:28 -0400 > >>>> Harry Wentland <harry.wentland@amd.com> wrote: > >>>> > >>>>> On 2023-10-20 10:57, Pekka Paalanen wrote: > >>>>>> On Fri, 20 Oct 2023 16:22:56 +0200 > >>>>>> Sebastian Wick <sebastian.wick@redhat.com> wrote: > >>>>>> > >>>>>>> Thanks for continuing to work on this! > >>>>>>> > >>>>>>> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote: > > snip > > >> > >>>>>> I think we also need a definition of "informational". > >>>>>> > >>>>>> Counter-example 1: a colorop that represents a non-configurable > >>>>> > >>>>> Not sure what's "counter" for these examples? > >>>>> > >>>>>> YUV<->RGB conversion. Maybe it determines its operation from FB pixel > >>>>>> format. It cannot be set to bypass, it cannot be configured, and it > >>>>>> will alter color values. > >>> > >>> Would it be reasonable to expose this is a 3x4 matrix with a read-only blob and > >>> no BYPASS property? I already brought up a similar idea at the XDC HDR Workshop > >>> based on the principle that read-only blobs could be used to express some static > >>> pipeline elements without the need to define a new type, but got mixed opinions. > >>> I think this demonstrates the principle further, as clients could detect this > >>> programmatically instead of having to special-case the informational element. > >> > > > > I'm all for exposing fixed color ops but I suspect that most of those > > follow some standard and in those cases instead of exposing the matrix > > values one should prefer to expose a named matrix (e.g. BT.601, BT.709, > > BT.2020). > > > > Agreed. > > > As a general rule: always expose the highest level description. Going > > from a name to exact values is trivial, going from values to a name is > > much harder. > > > >> If the blob depends on the pixel format (i.e. the driver automatically > >> chooses a different blob per pixel format), then I think we would need > >> to expose all the blobs and how they correspond to pixel formats. > >> Otherwise ok, I guess. > >> > >> However, do we want or need to make a color pipeline or colorop > >> conditional on pixel formats? For example, if you use a YUV 4:2:0 type > >> of pixel format, then you must use this pipeline and not any other. Or > >> floating-point type of pixel format. I did not anticipate this before, > >> I assumed that all color pipelines and colorops are independent of the > >> framebuffer pixel format. A specific colorop might have a property that > >> needs to agree with the framebuffer pixel format, but I didn't expect > >> further limitations. > > > > We could simply fail commits when the pipeline and pixel format don't > > work together. We'll probably need some kind of ingress no-op node > > anyway and maybe could list pixel formats there if required to make it > > easier to find a working configuration. > > > > The problem with failing commits is that user-space has no idea why it > failed. If this means that userspace falls back to SW composition for > NV12 and P010 it would avoid HW offloading in one of the most important > use-cases on AMD HW for power-saving purposes. Exposing which pixel formats work with a pipeline should be uncontroversial, and so should be an informative scaler op. Both can be added without a problem at a later time, so let's not make any of that mandatory for the first version. One step after the other. > > snip > > >>> Despite being programmable, the LUTs are updated in a manner that is less > >>> efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful > >>> if there was some way to tag operations according to their performance, > >>> for example so that clients can prefer a high performance one when they > >>> intend to do an animated transition? I recall from the XDC HDR workshop > >>> that this is also an issue with AMD's 3DLUT, where updates can be too > >>> slow to animate. > >> > >> I can certainly see such information being useful, but then we need to > >> somehow quantize the performance. > >> > >> What I was left puzzled about after the XDC workshop is that is it > >> possible to pre-load configurations in the background (slow), and then > >> quickly switch between them? Hardware-wise I mean. > > > > We could define that pipelines with a lower ID are to be preferred over > > higher IDs. > > > > The issue is that if programming a pipeline becomes too slow to be > > useful it probably should just not be made available to user space. > > > > The prepare-commit idea for blob properties would help to make the > > pipelines usable again, but until then it's probably a good idea to just > > not expose those pipelines. > > > > It's a bit of a judgment call what's too slow, though. The value of having > a HW colorop might outweigh the cost of the programming time for some > compositors but not for others. > > Harry > > >> > >> > >> Thanks, > >> pq > > > > >
> -----Original Message----- > From: Harry Wentland <harry.wentland@amd.com> > Sent: Friday, October 20, 2023 2:51 AM > To: dri-devel@lists.freedesktop.org > Cc: wayland-devel@lists.freedesktop.org; Harry Wentland > <harry.wentland@amd.com>; Ville Syrjala <ville.syrjala@linux.intel.com>; Pekka > Paalanen <pekka.paalanen@collabora.com>; Simon Ser <contact@emersion.fr>; > Melissa Wen <mwen@igalia.com>; Jonas Ådahl <jadahl@redhat.com>; Sebastian > Wick <sebastian.wick@redhat.com>; Shashank Sharma > <shashank.sharma@amd.com>; Alexander Goins <agoins@nvidia.com>; Joshua > Ashton <joshua@froggi.es>; Michel Dänzer <mdaenzer@redhat.com>; Aleix Pol > <aleixpol@kde.org>; Xaver Hugl <xaver.hugl@gmail.com>; Victoria Brekenfeld > <victoria@system76.com>; Sima <daniel@ffwll.ch>; Shankar, Uma > <uma.shankar@intel.com>; Naseer Ahmed <quic_naseer@quicinc.com>; > Christopher Braga <quic_cbraga@quicinc.com>; Abhinav Kumar > <quic_abhinavk@quicinc.com>; Arthur Grillo <arthurgrillo@riseup.net>; Hector > Martin <marcan@marcan.st>; Liviu Dudau <Liviu.Dudau@arm.com>; Sasha > McIntosh <sashamcintosh@google.com> > Subject: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive color > pipeline is needed > > v2: > - Update colorop visualizations to match reality (Sebastian, Alex Hung) > - Updated wording (Pekka) > - Change BYPASS wording to make it non-mandatory (Sebastian) > - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property > section (Pekka) > - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) > - Add "Driver Implementer's Guide" section (Pekka) > - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) > > Signed-off-by: Harry Wentland <harry.wentland@amd.com> > Cc: Ville Syrjala <ville.syrjala@linux.intel.com> > Cc: Pekka Paalanen <pekka.paalanen@collabora.com> > Cc: Simon Ser <contact@emersion.fr> > Cc: Harry Wentland <harry.wentland@amd.com> > Cc: Melissa Wen <mwen@igalia.com> > Cc: Jonas Ådahl <jadahl@redhat.com> > Cc: Sebastian Wick <sebastian.wick@redhat.com> > Cc: Shashank Sharma <shashank.sharma@amd.com> > Cc: Alexander Goins <agoins@nvidia.com> > Cc: Joshua Ashton <joshua@froggi.es> > Cc: Michel Dänzer <mdaenzer@redhat.com> > Cc: Aleix Pol <aleixpol@kde.org> > Cc: Xaver Hugl <xaver.hugl@gmail.com> > Cc: Victoria Brekenfeld <victoria@system76.com> > Cc: Sima <daniel@ffwll.ch> > Cc: Uma Shankar <uma.shankar@intel.com> > Cc: Naseer Ahmed <quic_naseer@quicinc.com> > Cc: Christopher Braga <quic_cbraga@quicinc.com> > Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> > Cc: Arthur Grillo <arthurgrillo@riseup.net> > Cc: Hector Martin <marcan@marcan.st> > Cc: Liviu Dudau <Liviu.Dudau@arm.com> > Cc: Sasha McIntosh <sashamcintosh@google.com> > --- > Documentation/gpu/rfc/color_pipeline.rst | 347 +++++++++++++++++++++++ > 1 file changed, 347 insertions(+) > create mode 100644 Documentation/gpu/rfc/color_pipeline.rst > > diff --git a/Documentation/gpu/rfc/color_pipeline.rst > b/Documentation/gpu/rfc/color_pipeline.rst > new file mode 100644 > index 000000000000..af5f2ea29116 > --- /dev/null > +++ b/Documentation/gpu/rfc/color_pipeline.rst > @@ -0,0 +1,347 @@ > +======================== > +Linux Color Pipeline API > +======================== > + > +What problem are we solving? > +============================ > + > +We would like to support pre-, and post-blending complex color > +transformations in display controller hardware in order to allow for > +HW-supported HDR use-cases, as well as to provide support to > +color-managed applications, such as video or image editors. > + > +It is possible to support an HDR output on HW supporting the Colorspace > +and HDR Metadata drm_connector properties, but that requires the > +compositor or application to render and compose the content into one > +final buffer intended for display. Doing so is costly. > + > +Most modern display HW offers various 1D LUTs, 3D LUTs, matrices, and > +other operations to support color transformations. These operations are > +often implemented in fixed-function HW and therefore much more power > +efficient than performing similar operations via shaders or CPU. > + > +We would like to make use of this HW functionality to support complex > +color transformations with no, or minimal CPU or shader load. > + > + > +How are other OSes solving this problem? > +======================================== > + > +The most widely supported use-cases regard HDR content, whether video > +or gaming. > + > +Most OSes will specify the source content format (color gamut, encoding > +transfer function, and other metadata, such as max and average light levels) to a > driver. > +Drivers will then program their fixed-function HW accordingly to map > +from a source content buffer's space to a display's space. > + > +When fixed-function HW is not available the compositor will assemble a > +shader to ask the GPU to perform the transformation from the source > +content format to the display's format. > + > +A compositor's mapping function and a driver's mapping function are > +usually entirely separate concepts. On OSes where a HW vendor has no > +insight into closed-source compositor code such a vendor will tune > +their color management code to visually match the compositor's. On > +other OSes, where both mapping functions are open to an implementer they will > ensure both mappings match. > + > +This results in mapping algorithm lock-in, meaning that no-one alone > +can experiment with or introduce new mapping algorithms and achieve > +consistent results regardless of which implementation path is taken. > + > +Why is Linux different? > +======================= > + > +Unlike other OSes, where there is one compositor for one or more > +drivers, on Linux we have a many-to-many relationship. Many compositors; > many drivers. > +In addition each compositor vendor or community has their own view of > +how color management should be done. This is what makes Linux so beautiful. > + > +This means that a HW vendor can now no longer tune their driver to one > +compositor, as tuning it to one could make it look fairly different > +from another compositor's color mapping. > + > +We need a better solution. > + > + > +Descriptive API > +=============== > + > +An API that describes the source and destination colorspaces is a > +descriptive API. It describes the input and output color spaces but > +does not describe how precisely they should be mapped. Such a mapping > +includes many minute design decision that can greatly affect the look of the final > result. > + > +It is not feasible to describe such mapping with enough detail to > +ensure the same result from each implementation. In fact, these > +mappings are a very active research area. > + > + > +Prescriptive API > +================ > + > +A prescriptive API describes not the source and destination > +colorspaces. It instead prescribes a recipe for how to manipulate pixel > +values to arrive at the desired outcome. > + > +This recipe is generally an ordered list of straight-forward > +operations, with clear mathematical definitions, such as 1D LUTs, 3D > +LUTs, matrices, or other operations that can be described in a precise manner. > + > + > +The Color Pipeline API > +====================== > + > +HW color management pipelines can significantly differ between HW > +vendors in terms of availability, ordering, and capabilities of HW > +blocks. This makes a common definition of color management blocks and > +their ordering nigh impossible. Instead we are defining an API that > +allows user space to discover the HW capabilities in a generic manner, > +agnostic of specific drivers and hardware. > + > + > +drm_colorop Object & IOCTLs > +=========================== > + > +To support the definition of color pipelines we define the DRM core > +object type drm_colorop. Individual drm_colorop objects will be chained > +via the NEXT property of a drm_colorop to constitute a color pipeline. > +Each drm_colorop object is unique, i.e., even if multiple color > +pipelines have the same operation they won't share the same drm_colorop > +object to describe that operation. > + > +Note that drivers are not expected to map drm_colorop objects > +statically to specific HW blocks. The mapping of drm_colorop objects is > +entirely a driver-internal detail and can be as dynamic or static as a > +driver needs it to be. See more in the Driver Implementation Guide section > below. > + > +Just like other DRM objects the drm_colorop objects are discovered via > +IOCTLs: > + > +DRM_IOCTL_MODE_GETCOLOROPRESOURCES: This IOCTL is used to retrieve > the > +number of all drm_colorop objects. > + > +DRM_IOCTL_MODE_GETCOLOROP: This IOCTL is used to read one drm_colorop. > +It includes the ID for the colorop object, as well as the plane_id of > +the associated plane. All other values should be registered as > +properties. > + > +Each drm_colorop has three core properties: > + > +TYPE: The type of transformation, such as > +* enumerated curve > +* custom (uniform) 1D LUT > +* 3x3 matrix > +* 3x4 matrix > +* 3D LUT > +* etc. > + > +Depending on the type of transformation other properties will describe > +more details. > + > +BYPASS: A boolean property that can be used to easily put a block into > +bypass mode. While setting other properties might fail atomic check, > +setting the BYPASS property to true should never fail. The BYPASS > +property is not mandatory for a colorop, as long as the entire pipeline > +can get bypassed by setting the COLOR_PIPELINE on a plane to '0'. > + > +NEXT: The ID of the next drm_colorop in a color pipeline, or 0 if this > +drm_colorop is the last in the chain. > + > +An example of a drm_colorop object might look like one of these:: > + > + /* 1D enumerated curve */ > + Color operation 42 > + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 > matrix, 3D LUT, etc.} = 1D enumerated curve > + ├─ "BYPASS": bool {true, false} > + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ > inverse EOTF, …} Having the fixed function enum for some targeted input/output may not be scalable for all usecases. There are multiple colorspaces and transfer functions possible, so it will not be possible to cover all these by any enum definitions. Also, this will depend on the capabilities of respective hardware from various vendors. > + └─ "NEXT": immutable color operation ID = 43 > + > + /* custom 4k entry 1D LUT */ > + Color operation 52 > + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 > matrix, 3D LUT, etc.} = 1D LUT > + ├─ "BYPASS": bool {true, false} > + ├─ "LUT_1D_SIZE": immutable range = 4096 For the size and capability of individual LUT block, it would be good to add this as a blob as defined in the blob approach we were planning earlier. So just taking that part of the series to have this capability detection generic. Refer below: https://patchwork.freedesktop.org/patch/554855/?series=123023&rev=1 Basically, use this structure for lut capability and arrangement: struct drm_color_lut_range { /* DRM_MODE_LUT_* */ __u32 flags; /* number of points on the curve */ __u16 count; /* input/output bits per component */ __u8 input_bpc, output_bpc; /* input start/end values */ __s32 start, end; /* output min/max values */ __s32 min, max; }; If the intention is to have just 1 segment with 4096, it can be easily described there. Additionally, this can also cater to any kind of lut arrangement, PWL, segmented or logarithmic. > + ├─ "LUT_1D": blob > + └─ "NEXT": immutable color operation ID = 0 > + > + /* 17^3 3D LUT */ > + Color operation 72 > + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 > matrix, 3D LUT, etc.} = 3D LUT > + ├─ "BYPASS": bool {true, false} > + ├─ "LUT_3D_SIZE": immutable range = 17 > + ├─ "LUT_3D": blob > + └─ "NEXT": immutable color operation ID = 73 > + > + > +COLOR_PIPELINE Plane Property > +============================= > + > +Color Pipelines are created by a driver and advertised via a new > +COLOR_PIPELINE enum property on each plane. Values of the property > +always include '0', which is the default and means all color processing > +is disabled. Additional values will be the object IDs of the first > +drm_colorop in a pipeline. A driver can create and advertise none, one, > +or more possible color pipelines. A DRM client will select a color > +pipeline by setting the COLOR PIPELINE to the respective value. > + > +In the case where drivers have custom support for pre-blending color > +processing those drivers shall reject atomic commits that are trying to > +use both the custom color properties, as well as the COLOR_PIPELINE > +property. > + > +An example of a COLOR_PIPELINE property on a plane might look like this:: > + > + Plane 10 > + ├─ "type": immutable enum {Overlay, Primary, Cursor} = Primary > + ├─ … > + └─ "color_pipeline": enum {0, 42, 52} = 0 > + > + > +Color Pipeline Discovery > +======================== > + > +A DRM client wanting color management on a drm_plane will: > + > +1. Read all drm_colorop objects > +2. Get the COLOR_PIPELINE property of the plane 3. iterate all > +COLOR_PIPELINE enum values 4. for each enum value walk the color > +pipeline (via the NEXT pointers) > + and see if the available color operations are suitable for the > + desired color management operations > + > +An example of chained properties to define an AMD pre-blending color > +pipeline might look like this:: > + > + Plane 10 > + ├─ "TYPE" (immutable) = Primary > + └─ "COLOR_PIPELINE": enum {0, 44} = 0 > + > + Color operation 44 > + ├─ "TYPE" (immutable) = 1D enumerated curve > + ├─ "BYPASS": bool > + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, PQ EOTF} = sRGB EOTF > + └─ "NEXT" (immutable) = 45 > + > + Color operation 45 > + ├─ "TYPE" (immutable) = 3x4 Matrix > + ├─ "BYPASS": bool > + ├─ "MATRIX_3_4": blob > + └─ "NEXT" (immutable) = 46 > + > + Color operation 46 > + ├─ "TYPE" (immutable) = 1D enumerated curve > + ├─ "BYPASS": bool > + ├─ "CURVE_1D_TYPE": enum {sRGB Inverse EOTF, PQ Inverse EOTF} = sRGB > EOTF > + └─ "NEXT" (immutable) = 47 > + > + Color operation 47 > + ├─ "TYPE" (immutable) = 1D LUT > + ├─ "LUT_1D_SIZE": immutable range = 4096 > + ├─ "LUT_1D_DATA": blob > + └─ "NEXT" (immutable) = 48 > + > + Color operation 48 > + ├─ "TYPE" (immutable) = 3D LUT > + ├─ "LUT_3D_SIZE" (immutable) = 17 > + ├─ "LUT_3D_DATA": blob > + └─ "NEXT" (immutable) = 49 > + > + Color operation 49 > + ├─ "TYPE" (immutable) = 1D enumerated curve > + ├─ "BYPASS": bool > + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, PQ EOTF} = sRGB EOTF > + └─ "NEXT" (immutable) = 0 > + > + > +Color Pipeline Programming > +========================== > + > +Once a DRM client has found a suitable pipeline it will: > + > +1. Set the COLOR_PIPELINE enum value to the one pointing at the first > + drm_colorop object of the desired pipeline 2. Set the properties for > +all drm_colorop objects in the pipeline to the > + desired values, setting BYPASS to true for unused drm_colorop blocks, > + and false for enabled drm_colorop blocks 3. Perform > +atomic_check/commit as desired > + > +To configure the pipeline for an HDR10 PQ plane and blending in linear > +space, a compositor might perform an atomic commit with the following > +property values:: > + > + Plane 10 > + └─ "COLOR_PIPELINE" = 42 > + > + Color operation 42 (input CSC) > + └─ "BYPASS" = true > + > + Color operation 44 (DeGamma) > + └─ "BYPASS" = true > + > + Color operation 45 (gamut remap) > + └─ "BYPASS" = true > + > + Color operation 46 (shaper LUT RAM) > + └─ "BYPASS" = true > + > + Color operation 47 (3D LUT RAM) > + └─ "LUT_3D_DATA" = Gamut mapping + tone mapping + night mode > + > + Color operation 48 (blend gamma) > + └─ "CURVE_1D_TYPE" = PQ EOTF > + > + > +Driver Implementer's Guide > +========================== > + > +What does this all mean for driver implementations? As noted above the > +colorops can map to HW directly but don't need to do so. Here are some > +suggestions on how to think about creating your color pipelines: > + > +- Try to expose pipelines that use already defined colorops, even if > + your hardware pipeline is split differently. This allows existing > + userspace to immediately take advantage of the hardware. > + > +- Additionally, try to expose your actual hardware blocks as colorops. > + Define new colorop types where you believe it can offer significant > + benefits if userspace learns to program them. > + > +- Avoid defining new colorops for compound operations with very narrow > + scope. If you have a hardware block for a special operation that > + cannot be split further, you can expose that as a new colorop type. > + However, try to not define colorops for "use cases", especially if > + they require you to combine multiple hardware blocks. > + > +- Design new colorops as prescriptive, not descriptive; by the > + mathematical formula, not by the assumed input and output. > + > +A defined colorop type must be deterministic. Its operation can depend > +only on its properties and input and nothing else, allowed error > +tolerance notwithstanding. > + > + > +Driver Forward/Backward Compatibility > +===================================== > + > +As this is uAPI drivers can't regress color pipelines that have been > +introduced for a given HW generation. New HW generations are free to > +abandon color pipelines advertised for previous generations. > +Nevertheless, it can be beneficial to carry support for existing color > +pipelines forward as those will likely already have support in DRM > +clients. > + > +Introducing new colorops to a pipeline is fine, as long as they can be > +disabled or are purely informational. DRM clients implementing support > +for the pipeline can always skip unknown properties as long as they can > +be confident that doing so will not cause unexpected results. > + > +If a new colorop doesn't fall into one of the above categories > +(bypassable or informational) the modified pipeline would be unusable > +for user space. In this case a new pipeline should be defined. Thanks again for this nice documentation and capturing all the details clearly. Regards, Uma Shankar > + > +References > +========== > + > +1. > +https://lore.kernel.org/dri-devel/QMers3awXvNCQlyhWdTtsPwkp5ie9bze_hD5n > +AccFW7a_RXlWjYB7MoUW_8CKLT2bSQwIXVi5H6VULYIxCdgvryZoAoJnC5lZgyK1 > QWn488= > +@emersion.fr/ > \ No newline at end of file > -- > 2.42.0
On 11/8/23 12:18, Shankar, Uma wrote: > > >> -----Original Message----- >> From: Harry Wentland <harry.wentland@amd.com> >> Sent: Friday, October 20, 2023 2:51 AM >> To: dri-devel@lists.freedesktop.org >> Cc: wayland-devel@lists.freedesktop.org; Harry Wentland >> <harry.wentland@amd.com>; Ville Syrjala <ville.syrjala@linux.intel.com>; Pekka >> Paalanen <pekka.paalanen@collabora.com>; Simon Ser <contact@emersion.fr>; >> Melissa Wen <mwen@igalia.com>; Jonas Ådahl <jadahl@redhat.com>; Sebastian >> Wick <sebastian.wick@redhat.com>; Shashank Sharma >> <shashank.sharma@amd.com>; Alexander Goins <agoins@nvidia.com>; Joshua >> Ashton <joshua@froggi.es>; Michel Dänzer <mdaenzer@redhat.com>; Aleix Pol >> <aleixpol@kde.org>; Xaver Hugl <xaver.hugl@gmail.com>; Victoria Brekenfeld >> <victoria@system76.com>; Sima <daniel@ffwll.ch>; Shankar, Uma >> <uma.shankar@intel.com>; Naseer Ahmed <quic_naseer@quicinc.com>; >> Christopher Braga <quic_cbraga@quicinc.com>; Abhinav Kumar >> <quic_abhinavk@quicinc.com>; Arthur Grillo <arthurgrillo@riseup.net>; Hector >> Martin <marcan@marcan.st>; Liviu Dudau <Liviu.Dudau@arm.com>; Sasha >> McIntosh <sashamcintosh@google.com> >> Subject: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive color >> pipeline is needed >> >> v2: >> - Update colorop visualizations to match reality (Sebastian, Alex Hung) >> - Updated wording (Pekka) >> - Change BYPASS wording to make it non-mandatory (Sebastian) >> - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property >> section (Pekka) >> - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) >> - Add "Driver Implementer's Guide" section (Pekka) >> - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) >> >> Signed-off-by: Harry Wentland <harry.wentland@amd.com> >> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> >> Cc: Pekka Paalanen <pekka.paalanen@collabora.com> >> Cc: Simon Ser <contact@emersion.fr> >> Cc: Harry Wentland <harry.wentland@amd.com> >> Cc: Melissa Wen <mwen@igalia.com> >> Cc: Jonas Ådahl <jadahl@redhat.com> >> Cc: Sebastian Wick <sebastian.wick@redhat.com> >> Cc: Shashank Sharma <shashank.sharma@amd.com> >> Cc: Alexander Goins <agoins@nvidia.com> >> Cc: Joshua Ashton <joshua@froggi.es> >> Cc: Michel Dänzer <mdaenzer@redhat.com> >> Cc: Aleix Pol <aleixpol@kde.org> >> Cc: Xaver Hugl <xaver.hugl@gmail.com> >> Cc: Victoria Brekenfeld <victoria@system76.com> >> Cc: Sima <daniel@ffwll.ch> >> Cc: Uma Shankar <uma.shankar@intel.com> >> Cc: Naseer Ahmed <quic_naseer@quicinc.com> >> Cc: Christopher Braga <quic_cbraga@quicinc.com> >> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> >> Cc: Arthur Grillo <arthurgrillo@riseup.net> >> Cc: Hector Martin <marcan@marcan.st> >> Cc: Liviu Dudau <Liviu.Dudau@arm.com> >> Cc: Sasha McIntosh <sashamcintosh@google.com> >> --- >> Documentation/gpu/rfc/color_pipeline.rst | 347 +++++++++++++++++++++++ >> 1 file changed, 347 insertions(+) >> create mode 100644 Documentation/gpu/rfc/color_pipeline.rst >> >> diff --git a/Documentation/gpu/rfc/color_pipeline.rst >> b/Documentation/gpu/rfc/color_pipeline.rst >> new file mode 100644 >> index 000000000000..af5f2ea29116 >> --- /dev/null >> +++ b/Documentation/gpu/rfc/color_pipeline.rst >> @@ -0,0 +1,347 @@ >> +======================== >> +Linux Color Pipeline API >> +======================== >> + >> +What problem are we solving? >> +============================ >> + >> +We would like to support pre-, and post-blending complex color >> +transformations in display controller hardware in order to allow for >> +HW-supported HDR use-cases, as well as to provide support to >> +color-managed applications, such as video or image editors. >> + >> +It is possible to support an HDR output on HW supporting the Colorspace >> +and HDR Metadata drm_connector properties, but that requires the >> +compositor or application to render and compose the content into one >> +final buffer intended for display. Doing so is costly. >> + >> +Most modern display HW offers various 1D LUTs, 3D LUTs, matrices, and >> +other operations to support color transformations. These operations are >> +often implemented in fixed-function HW and therefore much more power >> +efficient than performing similar operations via shaders or CPU. >> + >> +We would like to make use of this HW functionality to support complex >> +color transformations with no, or minimal CPU or shader load. >> + >> + >> +How are other OSes solving this problem? >> +======================================== >> + >> +The most widely supported use-cases regard HDR content, whether video >> +or gaming. >> + >> +Most OSes will specify the source content format (color gamut, encoding >> +transfer function, and other metadata, such as max and average light levels) to a >> driver. >> +Drivers will then program their fixed-function HW accordingly to map >> +from a source content buffer's space to a display's space. >> + >> +When fixed-function HW is not available the compositor will assemble a >> +shader to ask the GPU to perform the transformation from the source >> +content format to the display's format. >> + >> +A compositor's mapping function and a driver's mapping function are >> +usually entirely separate concepts. On OSes where a HW vendor has no >> +insight into closed-source compositor code such a vendor will tune >> +their color management code to visually match the compositor's. On >> +other OSes, where both mapping functions are open to an implementer they will >> ensure both mappings match. >> + >> +This results in mapping algorithm lock-in, meaning that no-one alone >> +can experiment with or introduce new mapping algorithms and achieve >> +consistent results regardless of which implementation path is taken. >> + >> +Why is Linux different? >> +======================= >> + >> +Unlike other OSes, where there is one compositor for one or more >> +drivers, on Linux we have a many-to-many relationship. Many compositors; >> many drivers. >> +In addition each compositor vendor or community has their own view of >> +how color management should be done. This is what makes Linux so beautiful. >> + >> +This means that a HW vendor can now no longer tune their driver to one >> +compositor, as tuning it to one could make it look fairly different >> +from another compositor's color mapping. >> + >> +We need a better solution. >> + >> + >> +Descriptive API >> +=============== >> + >> +An API that describes the source and destination colorspaces is a >> +descriptive API. It describes the input and output color spaces but >> +does not describe how precisely they should be mapped. Such a mapping >> +includes many minute design decision that can greatly affect the look of the final >> result. >> + >> +It is not feasible to describe such mapping with enough detail to >> +ensure the same result from each implementation. In fact, these >> +mappings are a very active research area. >> + >> + >> +Prescriptive API >> +================ >> + >> +A prescriptive API describes not the source and destination >> +colorspaces. It instead prescribes a recipe for how to manipulate pixel >> +values to arrive at the desired outcome. >> + >> +This recipe is generally an ordered list of straight-forward >> +operations, with clear mathematical definitions, such as 1D LUTs, 3D >> +LUTs, matrices, or other operations that can be described in a precise manner. >> + >> + >> +The Color Pipeline API >> +====================== >> + >> +HW color management pipelines can significantly differ between HW >> +vendors in terms of availability, ordering, and capabilities of HW >> +blocks. This makes a common definition of color management blocks and >> +their ordering nigh impossible. Instead we are defining an API that >> +allows user space to discover the HW capabilities in a generic manner, >> +agnostic of specific drivers and hardware. >> + >> + >> +drm_colorop Object & IOCTLs >> +=========================== >> + >> +To support the definition of color pipelines we define the DRM core >> +object type drm_colorop. Individual drm_colorop objects will be chained >> +via the NEXT property of a drm_colorop to constitute a color pipeline. >> +Each drm_colorop object is unique, i.e., even if multiple color >> +pipelines have the same operation they won't share the same drm_colorop >> +object to describe that operation. >> + >> +Note that drivers are not expected to map drm_colorop objects >> +statically to specific HW blocks. The mapping of drm_colorop objects is >> +entirely a driver-internal detail and can be as dynamic or static as a >> +driver needs it to be. See more in the Driver Implementation Guide section >> below. >> + >> +Just like other DRM objects the drm_colorop objects are discovered via >> +IOCTLs: >> + >> +DRM_IOCTL_MODE_GETCOLOROPRESOURCES: This IOCTL is used to retrieve >> the >> +number of all drm_colorop objects. >> + >> +DRM_IOCTL_MODE_GETCOLOROP: This IOCTL is used to read one drm_colorop. >> +It includes the ID for the colorop object, as well as the plane_id of >> +the associated plane. All other values should be registered as >> +properties. >> + >> +Each drm_colorop has three core properties: >> + >> +TYPE: The type of transformation, such as >> +* enumerated curve >> +* custom (uniform) 1D LUT >> +* 3x3 matrix >> +* 3x4 matrix >> +* 3D LUT >> +* etc. >> + >> +Depending on the type of transformation other properties will describe >> +more details. >> + >> +BYPASS: A boolean property that can be used to easily put a block into >> +bypass mode. While setting other properties might fail atomic check, >> +setting the BYPASS property to true should never fail. The BYPASS >> +property is not mandatory for a colorop, as long as the entire pipeline >> +can get bypassed by setting the COLOR_PIPELINE on a plane to '0'. >> + >> +NEXT: The ID of the next drm_colorop in a color pipeline, or 0 if this >> +drm_colorop is the last in the chain. >> + >> +An example of a drm_colorop object might look like one of these:: >> + >> + /* 1D enumerated curve */ >> + Color operation 42 >> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 >> matrix, 3D LUT, etc.} = 1D enumerated curve >> + ├─ "BYPASS": bool {true, false} >> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ >> inverse EOTF, …} > > Having the fixed function enum for some targeted input/output may not be scalable > for all usecases. There are multiple colorspaces and transfer functions possible, > so it will not be possible to cover all these by any enum definitions. Also, this will > depend on the capabilities of respective hardware from various vendors. The reason this exists is such that certain HW vendors such as AMD have transfer functions implemented in HW. It is important to take advantage of these for both precision and power reasons. Additionally, not every vendor implements bucketed/segemented LUTs the same way, so it's not feasible to expose that in a way that's particularly useful or not vendor-specific. Thus we decided to have a regular 1D LUT modulated onto a known curve. This is the only real cross-vendor solution here that allows HW curve implementations to be taken advantage of and also works with bucketing/segemented LUTs. (Including vendors we are not aware of yet). This also means that vendors that only support HW curves at some stages without an actual LUT are also serviced. You are right that there *might* be some usecase not covered by this right now, and that it would need kernel churn to implement new curves, but unfortunately that's the compromise that we (so-far) have decided on in order to ensure everyone can have good, precise, power-efficient support. It is always possible for us to extend the uAPI at a later date for other curves, or other properties that might expose a generic segmented LUT interface (such as what you have proposed for a while) for vendors that can support it. (With the whole color pipeline thing, we can essentially do 'versioning' with that, if we wanted a new 1D LUT type.) Thanks! - Joshie
On 2023-11-08 07:18, Shankar, Uma wrote: > > >> -----Original Message----- >> From: Harry Wentland <harry.wentland@amd.com> >> Sent: Friday, October 20, 2023 2:51 AM >> To: dri-devel@lists.freedesktop.org >> Cc: wayland-devel@lists.freedesktop.org; Harry Wentland >> <harry.wentland@amd.com>; Ville Syrjala <ville.syrjala@linux.intel.com>; Pekka >> Paalanen <pekka.paalanen@collabora.com>; Simon Ser <contact@emersion.fr>; >> Melissa Wen <mwen@igalia.com>; Jonas Ådahl <jadahl@redhat.com>; Sebastian >> Wick <sebastian.wick@redhat.com>; Shashank Sharma >> <shashank.sharma@amd.com>; Alexander Goins <agoins@nvidia.com>; Joshua >> Ashton <joshua@froggi.es>; Michel Dänzer <mdaenzer@redhat.com>; Aleix Pol >> <aleixpol@kde.org>; Xaver Hugl <xaver.hugl@gmail.com>; Victoria Brekenfeld >> <victoria@system76.com>; Sima <daniel@ffwll.ch>; Shankar, Uma >> <uma.shankar@intel.com>; Naseer Ahmed <quic_naseer@quicinc.com>; >> Christopher Braga <quic_cbraga@quicinc.com>; Abhinav Kumar >> <quic_abhinavk@quicinc.com>; Arthur Grillo <arthurgrillo@riseup.net>; Hector >> Martin <marcan@marcan.st>; Liviu Dudau <Liviu.Dudau@arm.com>; Sasha >> McIntosh <sashamcintosh@google.com> >> Subject: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive color >> pipeline is needed >> >> v2: >> - Update colorop visualizations to match reality (Sebastian, Alex Hung) >> - Updated wording (Pekka) >> - Change BYPASS wording to make it non-mandatory (Sebastian) >> - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property >> section (Pekka) >> - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) >> - Add "Driver Implementer's Guide" section (Pekka) >> - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) >> >> Signed-off-by: Harry Wentland <harry.wentland@amd.com> >> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> >> Cc: Pekka Paalanen <pekka.paalanen@collabora.com> >> Cc: Simon Ser <contact@emersion.fr> >> Cc: Harry Wentland <harry.wentland@amd.com> >> Cc: Melissa Wen <mwen@igalia.com> >> Cc: Jonas Ådahl <jadahl@redhat.com> >> Cc: Sebastian Wick <sebastian.wick@redhat.com> >> Cc: Shashank Sharma <shashank.sharma@amd.com> >> Cc: Alexander Goins <agoins@nvidia.com> >> Cc: Joshua Ashton <joshua@froggi.es> >> Cc: Michel Dänzer <mdaenzer@redhat.com> >> Cc: Aleix Pol <aleixpol@kde.org> >> Cc: Xaver Hugl <xaver.hugl@gmail.com> >> Cc: Victoria Brekenfeld <victoria@system76.com> >> Cc: Sima <daniel@ffwll.ch> >> Cc: Uma Shankar <uma.shankar@intel.com> >> Cc: Naseer Ahmed <quic_naseer@quicinc.com> >> Cc: Christopher Braga <quic_cbraga@quicinc.com> >> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> >> Cc: Arthur Grillo <arthurgrillo@riseup.net> >> Cc: Hector Martin <marcan@marcan.st> >> Cc: Liviu Dudau <Liviu.Dudau@arm.com> >> Cc: Sasha McIntosh <sashamcintosh@google.com> >> --- >> Documentation/gpu/rfc/color_pipeline.rst | 347 +++++++++++++++++++++++ >> 1 file changed, 347 insertions(+) >> create mode 100644 Documentation/gpu/rfc/color_pipeline.rst >> >> diff --git a/Documentation/gpu/rfc/color_pipeline.rst >> b/Documentation/gpu/rfc/color_pipeline.rst >> new file mode 100644 >> index 000000000000..af5f2ea29116 >> --- /dev/null >> +++ b/Documentation/gpu/rfc/color_pipeline.rst >> @@ -0,0 +1,347 @@ >> +======================== >> +Linux Color Pipeline API >> +======================== >> + >> +What problem are we solving? >> +============================ >> + >> +We would like to support pre-, and post-blending complex color >> +transformations in display controller hardware in order to allow for >> +HW-supported HDR use-cases, as well as to provide support to >> +color-managed applications, such as video or image editors. >> + >> +It is possible to support an HDR output on HW supporting the Colorspace >> +and HDR Metadata drm_connector properties, but that requires the >> +compositor or application to render and compose the content into one >> +final buffer intended for display. Doing so is costly. >> + >> +Most modern display HW offers various 1D LUTs, 3D LUTs, matrices, and >> +other operations to support color transformations. These operations are >> +often implemented in fixed-function HW and therefore much more power >> +efficient than performing similar operations via shaders or CPU. >> + >> +We would like to make use of this HW functionality to support complex >> +color transformations with no, or minimal CPU or shader load. >> + >> + >> +How are other OSes solving this problem? >> +======================================== >> + >> +The most widely supported use-cases regard HDR content, whether video >> +or gaming. >> + >> +Most OSes will specify the source content format (color gamut, encoding >> +transfer function, and other metadata, such as max and average light levels) to a >> driver. >> +Drivers will then program their fixed-function HW accordingly to map >> +from a source content buffer's space to a display's space. >> + >> +When fixed-function HW is not available the compositor will assemble a >> +shader to ask the GPU to perform the transformation from the source >> +content format to the display's format. >> + >> +A compositor's mapping function and a driver's mapping function are >> +usually entirely separate concepts. On OSes where a HW vendor has no >> +insight into closed-source compositor code such a vendor will tune >> +their color management code to visually match the compositor's. On >> +other OSes, where both mapping functions are open to an implementer they will >> ensure both mappings match. >> + >> +This results in mapping algorithm lock-in, meaning that no-one alone >> +can experiment with or introduce new mapping algorithms and achieve >> +consistent results regardless of which implementation path is taken. >> + >> +Why is Linux different? >> +======================= >> + >> +Unlike other OSes, where there is one compositor for one or more >> +drivers, on Linux we have a many-to-many relationship. Many compositors; >> many drivers. >> +In addition each compositor vendor or community has their own view of >> +how color management should be done. This is what makes Linux so beautiful. >> + >> +This means that a HW vendor can now no longer tune their driver to one >> +compositor, as tuning it to one could make it look fairly different >> +from another compositor's color mapping. >> + >> +We need a better solution. >> + >> + >> +Descriptive API >> +=============== >> + >> +An API that describes the source and destination colorspaces is a >> +descriptive API. It describes the input and output color spaces but >> +does not describe how precisely they should be mapped. Such a mapping >> +includes many minute design decision that can greatly affect the look of the final >> result. >> + >> +It is not feasible to describe such mapping with enough detail to >> +ensure the same result from each implementation. In fact, these >> +mappings are a very active research area. >> + >> + >> +Prescriptive API >> +================ >> + >> +A prescriptive API describes not the source and destination >> +colorspaces. It instead prescribes a recipe for how to manipulate pixel >> +values to arrive at the desired outcome. >> + >> +This recipe is generally an ordered list of straight-forward >> +operations, with clear mathematical definitions, such as 1D LUTs, 3D >> +LUTs, matrices, or other operations that can be described in a precise manner. >> + >> + >> +The Color Pipeline API >> +====================== >> + >> +HW color management pipelines can significantly differ between HW >> +vendors in terms of availability, ordering, and capabilities of HW >> +blocks. This makes a common definition of color management blocks and >> +their ordering nigh impossible. Instead we are defining an API that >> +allows user space to discover the HW capabilities in a generic manner, >> +agnostic of specific drivers and hardware. >> + >> + >> +drm_colorop Object & IOCTLs >> +=========================== >> + >> +To support the definition of color pipelines we define the DRM core >> +object type drm_colorop. Individual drm_colorop objects will be chained >> +via the NEXT property of a drm_colorop to constitute a color pipeline. >> +Each drm_colorop object is unique, i.e., even if multiple color >> +pipelines have the same operation they won't share the same drm_colorop >> +object to describe that operation. >> + >> +Note that drivers are not expected to map drm_colorop objects >> +statically to specific HW blocks. The mapping of drm_colorop objects is >> +entirely a driver-internal detail and can be as dynamic or static as a >> +driver needs it to be. See more in the Driver Implementation Guide section >> below. >> + >> +Just like other DRM objects the drm_colorop objects are discovered via >> +IOCTLs: >> + >> +DRM_IOCTL_MODE_GETCOLOROPRESOURCES: This IOCTL is used to retrieve >> the >> +number of all drm_colorop objects. >> + >> +DRM_IOCTL_MODE_GETCOLOROP: This IOCTL is used to read one drm_colorop. >> +It includes the ID for the colorop object, as well as the plane_id of >> +the associated plane. All other values should be registered as >> +properties. >> + >> +Each drm_colorop has three core properties: >> + >> +TYPE: The type of transformation, such as >> +* enumerated curve >> +* custom (uniform) 1D LUT >> +* 3x3 matrix >> +* 3x4 matrix >> +* 3D LUT >> +* etc. >> + >> +Depending on the type of transformation other properties will describe >> +more details. >> + >> +BYPASS: A boolean property that can be used to easily put a block into >> +bypass mode. While setting other properties might fail atomic check, >> +setting the BYPASS property to true should never fail. The BYPASS >> +property is not mandatory for a colorop, as long as the entire pipeline >> +can get bypassed by setting the COLOR_PIPELINE on a plane to '0'. >> + >> +NEXT: The ID of the next drm_colorop in a color pipeline, or 0 if this >> +drm_colorop is the last in the chain. >> + >> +An example of a drm_colorop object might look like one of these:: >> + >> + /* 1D enumerated curve */ >> + Color operation 42 >> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 >> matrix, 3D LUT, etc.} = 1D enumerated curve >> + ├─ "BYPASS": bool {true, false} >> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ >> inverse EOTF, …} > > Having the fixed function enum for some targeted input/output may not be scalable > for all usecases. There are multiple colorspaces and transfer functions possible, > so it will not be possible to cover all these by any enum definitions. Also, this will > depend on the capabilities of respective hardware from various vendors. > Agreed, and this is only an example of one TYPE of colorop, the "1D enumerated curve". There is a place for a "1D LUT", that's a traditional 1D LUT, or even a "PWL" type, if someone wants to define that. The beauty with the DRM object and properties approach is that this is extensible without breaking existing implementations in the kernel or userspace. >> + └─ "NEXT": immutable color operation ID = 43 >> + >> + /* custom 4k entry 1D LUT */ >> + Color operation 52 >> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 >> matrix, 3D LUT, etc.} = 1D LUT >> + ├─ "BYPASS": bool {true, false} >> + ├─ "LUT_1D_SIZE": immutable range = 4096 > > For the size and capability of individual LUT block, it would be good to add this > as a blob as defined in the blob approach we were planning earlier. So just taking > that part of the series to have this capability detection generic. Refer below: > https://patchwork.freedesktop.org/patch/554855/?series=123023&rev=1 > > Basically, use this structure for lut capability and arrangement: > struct drm_color_lut_range { > /* DRM_MODE_LUT_* */ > __u32 flags; > /* number of points on the curve */ > __u16 count; > /* input/output bits per component */ > __u8 input_bpc, output_bpc; > /* input start/end values */ > __s32 start, end; > /* output min/max values */ > __s32 min, max; > }; > > If the intention is to have just 1 segment with 4096, it can be easily described there. > Additionally, this can also cater to any kind of lut arrangement, PWL, segmented or logarithmic. > Thanks for sharing this again. We've had some discussion about this and it looks like we definitely want something to describe the range of the domain of the LUT as well as it's output values, maybe also things like clamping. Your struct seems to cover all of that. >> + ├─ "LUT_1D": blob >> + └─ "NEXT": immutable color operation ID = 0 >> + >> + /* 17^3 3D LUT */ >> + Color operation 72 >> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 >> matrix, 3D LUT, etc.} = 3D LUT >> + ├─ "BYPASS": bool {true, false} >> + ├─ "LUT_3D_SIZE": immutable range = 17 >> + ├─ "LUT_3D": blob >> + └─ "NEXT": immutable color operation ID = 73 >> + >> + >> +COLOR_PIPELINE Plane Property >> +============================= >> + >> +Color Pipelines are created by a driver and advertised via a new >> +COLOR_PIPELINE enum property on each plane. Values of the property >> +always include '0', which is the default and means all color processing >> +is disabled. Additional values will be the object IDs of the first >> +drm_colorop in a pipeline. A driver can create and advertise none, one, >> +or more possible color pipelines. A DRM client will select a color >> +pipeline by setting the COLOR PIPELINE to the respective value. >> + >> +In the case where drivers have custom support for pre-blending color >> +processing those drivers shall reject atomic commits that are trying to >> +use both the custom color properties, as well as the COLOR_PIPELINE >> +property. >> + >> +An example of a COLOR_PIPELINE property on a plane might look like this:: >> + >> + Plane 10 >> + ├─ "type": immutable enum {Overlay, Primary, Cursor} = Primary >> + ├─ … >> + └─ "color_pipeline": enum {0, 42, 52} = 0 >> + >> + >> +Color Pipeline Discovery >> +======================== >> + >> +A DRM client wanting color management on a drm_plane will: >> + >> +1. Read all drm_colorop objects >> +2. Get the COLOR_PIPELINE property of the plane 3. iterate all >> +COLOR_PIPELINE enum values 4. for each enum value walk the color >> +pipeline (via the NEXT pointers) >> + and see if the available color operations are suitable for the >> + desired color management operations >> + >> +An example of chained properties to define an AMD pre-blending color >> +pipeline might look like this:: >> + >> + Plane 10 >> + ├─ "TYPE" (immutable) = Primary >> + └─ "COLOR_PIPELINE": enum {0, 44} = 0 >> + >> + Color operation 44 >> + ├─ "TYPE" (immutable) = 1D enumerated curve >> + ├─ "BYPASS": bool >> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, PQ EOTF} = sRGB EOTF >> + └─ "NEXT" (immutable) = 45 >> + >> + Color operation 45 >> + ├─ "TYPE" (immutable) = 3x4 Matrix >> + ├─ "BYPASS": bool >> + ├─ "MATRIX_3_4": blob >> + └─ "NEXT" (immutable) = 46 >> + >> + Color operation 46 >> + ├─ "TYPE" (immutable) = 1D enumerated curve >> + ├─ "BYPASS": bool >> + ├─ "CURVE_1D_TYPE": enum {sRGB Inverse EOTF, PQ Inverse EOTF} = sRGB >> EOTF >> + └─ "NEXT" (immutable) = 47 >> + >> + Color operation 47 >> + ├─ "TYPE" (immutable) = 1D LUT >> + ├─ "LUT_1D_SIZE": immutable range = 4096 >> + ├─ "LUT_1D_DATA": blob >> + └─ "NEXT" (immutable) = 48 >> + >> + Color operation 48 >> + ├─ "TYPE" (immutable) = 3D LUT >> + ├─ "LUT_3D_SIZE" (immutable) = 17 >> + ├─ "LUT_3D_DATA": blob >> + └─ "NEXT" (immutable) = 49 >> + >> + Color operation 49 >> + ├─ "TYPE" (immutable) = 1D enumerated curve >> + ├─ "BYPASS": bool >> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, PQ EOTF} = sRGB EOTF >> + └─ "NEXT" (immutable) = 0 >> + >> + >> +Color Pipeline Programming >> +========================== >> + >> +Once a DRM client has found a suitable pipeline it will: >> + >> +1. Set the COLOR_PIPELINE enum value to the one pointing at the first >> + drm_colorop object of the desired pipeline 2. Set the properties for >> +all drm_colorop objects in the pipeline to the >> + desired values, setting BYPASS to true for unused drm_colorop blocks, >> + and false for enabled drm_colorop blocks 3. Perform >> +atomic_check/commit as desired >> + >> +To configure the pipeline for an HDR10 PQ plane and blending in linear >> +space, a compositor might perform an atomic commit with the following >> +property values:: >> + >> + Plane 10 >> + └─ "COLOR_PIPELINE" = 42 >> + >> + Color operation 42 (input CSC) >> + └─ "BYPASS" = true >> + >> + Color operation 44 (DeGamma) >> + └─ "BYPASS" = true >> + >> + Color operation 45 (gamut remap) >> + └─ "BYPASS" = true >> + >> + Color operation 46 (shaper LUT RAM) >> + └─ "BYPASS" = true >> + >> + Color operation 47 (3D LUT RAM) >> + └─ "LUT_3D_DATA" = Gamut mapping + tone mapping + night mode >> + >> + Color operation 48 (blend gamma) >> + └─ "CURVE_1D_TYPE" = PQ EOTF >> + >> + >> +Driver Implementer's Guide >> +========================== >> + >> +What does this all mean for driver implementations? As noted above the >> +colorops can map to HW directly but don't need to do so. Here are some >> +suggestions on how to think about creating your color pipelines: >> + >> +- Try to expose pipelines that use already defined colorops, even if >> + your hardware pipeline is split differently. This allows existing >> + userspace to immediately take advantage of the hardware. >> + >> +- Additionally, try to expose your actual hardware blocks as colorops. >> + Define new colorop types where you believe it can offer significant >> + benefits if userspace learns to program them. >> + >> +- Avoid defining new colorops for compound operations with very narrow >> + scope. If you have a hardware block for a special operation that >> + cannot be split further, you can expose that as a new colorop type. >> + However, try to not define colorops for "use cases", especially if >> + they require you to combine multiple hardware blocks. >> + >> +- Design new colorops as prescriptive, not descriptive; by the >> + mathematical formula, not by the assumed input and output. >> + >> +A defined colorop type must be deterministic. Its operation can depend >> +only on its properties and input and nothing else, allowed error >> +tolerance notwithstanding. >> + >> + >> +Driver Forward/Backward Compatibility >> +===================================== >> + >> +As this is uAPI drivers can't regress color pipelines that have been >> +introduced for a given HW generation. New HW generations are free to >> +abandon color pipelines advertised for previous generations. >> +Nevertheless, it can be beneficial to carry support for existing color >> +pipelines forward as those will likely already have support in DRM >> +clients. >> + >> +Introducing new colorops to a pipeline is fine, as long as they can be >> +disabled or are purely informational. DRM clients implementing support >> +for the pipeline can always skip unknown properties as long as they can >> +be confident that doing so will not cause unexpected results. >> + >> +If a new colorop doesn't fall into one of the above categories >> +(bypassable or informational) the modified pipeline would be unusable >> +for user space. In this case a new pipeline should be defined. > > Thanks again for this nice documentation and capturing all the details clearly. > Thanks for your feedback. Harry > Regards, > Uma Shankar > >> + >> +References >> +========== >> + >> +1. >> +https://lore.kernel.org/dri-devel/QMers3awXvNCQlyhWdTtsPwkp5ie9bze_hD5n >> +AccFW7a_RXlWjYB7MoUW_8CKLT2bSQwIXVi5H6VULYIxCdgvryZoAoJnC5lZgyK1 >> QWn488= >> +@emersion.fr/ >> \ No newline at end of file >> -- >> 2.42.0 >
> -----Original Message----- > From: Joshua Ashton <joshua@froggi.es> > Sent: Wednesday, November 8, 2023 7:13 PM > To: Shankar, Uma <uma.shankar@intel.com>; Harry Wentland > <harry.wentland@amd.com>; dri-devel@lists.freedesktop.org > Cc: wayland-devel@lists.freedesktop.org; Ville Syrjala > <ville.syrjala@linux.intel.com>; Pekka Paalanen > <pekka.paalanen@collabora.com>; Simon Ser <contact@emersion.fr>; Melissa > Wen <mwen@igalia.com>; Jonas Ådahl <jadahl@redhat.com>; Sebastian Wick > <sebastian.wick@redhat.com>; Shashank Sharma > <shashank.sharma@amd.com>; Alexander Goins <agoins@nvidia.com>; Michel > Dänzer <mdaenzer@redhat.com>; Aleix Pol <aleixpol@kde.org>; Xaver Hugl > <xaver.hugl@gmail.com>; Victoria Brekenfeld <victoria@system76.com>; Sima > <daniel@ffwll.ch>; Naseer Ahmed <quic_naseer@quicinc.com>; Christopher > Braga <quic_cbraga@quicinc.com>; Abhinav Kumar > <quic_abhinavk@quicinc.com>; Arthur Grillo <arthurgrillo@riseup.net>; Hector > Martin <marcan@marcan.st>; Liviu Dudau <Liviu.Dudau@arm.com>; Sasha > McIntosh <sashamcintosh@google.com> > Subject: Re: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive color > pipeline is needed > > > > On 11/8/23 12:18, Shankar, Uma wrote: > > > > > >> -----Original Message----- > >> From: Harry Wentland <harry.wentland@amd.com> > >> Sent: Friday, October 20, 2023 2:51 AM > >> To: dri-devel@lists.freedesktop.org > >> Cc: wayland-devel@lists.freedesktop.org; Harry Wentland > >> <harry.wentland@amd.com>; Ville Syrjala > >> <ville.syrjala@linux.intel.com>; Pekka Paalanen > >> <pekka.paalanen@collabora.com>; Simon Ser <contact@emersion.fr>; > >> Melissa Wen <mwen@igalia.com>; Jonas Ådahl <jadahl@redhat.com>; > >> Sebastian Wick <sebastian.wick@redhat.com>; Shashank Sharma > >> <shashank.sharma@amd.com>; Alexander Goins <agoins@nvidia.com>; > >> Joshua Ashton <joshua@froggi.es>; Michel Dänzer > >> <mdaenzer@redhat.com>; Aleix Pol <aleixpol@kde.org>; Xaver Hugl > >> <xaver.hugl@gmail.com>; Victoria Brekenfeld <victoria@system76.com>; > >> Sima <daniel@ffwll.ch>; Shankar, Uma <uma.shankar@intel.com>; Naseer > >> Ahmed <quic_naseer@quicinc.com>; Christopher Braga > >> <quic_cbraga@quicinc.com>; Abhinav Kumar <quic_abhinavk@quicinc.com>; > >> Arthur Grillo <arthurgrillo@riseup.net>; Hector Martin > >> <marcan@marcan.st>; Liviu Dudau <Liviu.Dudau@arm.com>; Sasha McIntosh > >> <sashamcintosh@google.com> > >> Subject: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive > >> color pipeline is needed > >> > >> v2: > >> - Update colorop visualizations to match reality (Sebastian, Alex Hung) > >> - Updated wording (Pekka) > >> - Change BYPASS wording to make it non-mandatory (Sebastian) > >> - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property > >> section (Pekka) > >> - Use PQ EOTF instead of its inverse in Pipeline Programming example > (Melissa) > >> - Add "Driver Implementer's Guide" section (Pekka) > >> - Add "Driver Forward/Backward Compatibility" section (Sebastian, > >> Pekka) > >> > >> Signed-off-by: Harry Wentland <harry.wentland@amd.com> > >> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> > >> Cc: Pekka Paalanen <pekka.paalanen@collabora.com> > >> Cc: Simon Ser <contact@emersion.fr> > >> Cc: Harry Wentland <harry.wentland@amd.com> > >> Cc: Melissa Wen <mwen@igalia.com> > >> Cc: Jonas Ådahl <jadahl@redhat.com> > >> Cc: Sebastian Wick <sebastian.wick@redhat.com> > >> Cc: Shashank Sharma <shashank.sharma@amd.com> > >> Cc: Alexander Goins <agoins@nvidia.com> > >> Cc: Joshua Ashton <joshua@froggi.es> > >> Cc: Michel Dänzer <mdaenzer@redhat.com> > >> Cc: Aleix Pol <aleixpol@kde.org> > >> Cc: Xaver Hugl <xaver.hugl@gmail.com> > >> Cc: Victoria Brekenfeld <victoria@system76.com> > >> Cc: Sima <daniel@ffwll.ch> > >> Cc: Uma Shankar <uma.shankar@intel.com> > >> Cc: Naseer Ahmed <quic_naseer@quicinc.com> > >> Cc: Christopher Braga <quic_cbraga@quicinc.com> > >> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> > >> Cc: Arthur Grillo <arthurgrillo@riseup.net> > >> Cc: Hector Martin <marcan@marcan.st> > >> Cc: Liviu Dudau <Liviu.Dudau@arm.com> > >> Cc: Sasha McIntosh <sashamcintosh@google.com> > >> --- > >> Documentation/gpu/rfc/color_pipeline.rst | 347 +++++++++++++++++++++++ > >> 1 file changed, 347 insertions(+) > >> create mode 100644 Documentation/gpu/rfc/color_pipeline.rst > >> > >> diff --git a/Documentation/gpu/rfc/color_pipeline.rst > >> b/Documentation/gpu/rfc/color_pipeline.rst > >> new file mode 100644 > >> index 000000000000..af5f2ea29116 > >> --- /dev/null > >> +++ b/Documentation/gpu/rfc/color_pipeline.rst > >> @@ -0,0 +1,347 @@ > >> +======================== > >> +Linux Color Pipeline API > >> +======================== > >> + > >> +What problem are we solving? > >> +============================ > >> + > >> +We would like to support pre-, and post-blending complex color > >> +transformations in display controller hardware in order to allow for > >> +HW-supported HDR use-cases, as well as to provide support to > >> +color-managed applications, such as video or image editors. > >> + > >> +It is possible to support an HDR output on HW supporting the > >> +Colorspace and HDR Metadata drm_connector properties, but that > >> +requires the compositor or application to render and compose the > >> +content into one final buffer intended for display. Doing so is costly. > >> + > >> +Most modern display HW offers various 1D LUTs, 3D LUTs, matrices, > >> +and other operations to support color transformations. These > >> +operations are often implemented in fixed-function HW and therefore > >> +much more power efficient than performing similar operations via shaders or > CPU. > >> + > >> +We would like to make use of this HW functionality to support > >> +complex color transformations with no, or minimal CPU or shader load. > >> + > >> + > >> +How are other OSes solving this problem? > >> +======================================== > >> + > >> +The most widely supported use-cases regard HDR content, whether > >> +video or gaming. > >> + > >> +Most OSes will specify the source content format (color gamut, > >> +encoding transfer function, and other metadata, such as max and > >> +average light levels) to a > >> driver. > >> +Drivers will then program their fixed-function HW accordingly to map > >> +from a source content buffer's space to a display's space. > >> + > >> +When fixed-function HW is not available the compositor will assemble > >> +a shader to ask the GPU to perform the transformation from the > >> +source content format to the display's format. > >> + > >> +A compositor's mapping function and a driver's mapping function are > >> +usually entirely separate concepts. On OSes where a HW vendor has no > >> +insight into closed-source compositor code such a vendor will tune > >> +their color management code to visually match the compositor's. On > >> +other OSes, where both mapping functions are open to an implementer > >> +they will > >> ensure both mappings match. > >> + > >> +This results in mapping algorithm lock-in, meaning that no-one alone > >> +can experiment with or introduce new mapping algorithms and achieve > >> +consistent results regardless of which implementation path is taken. > >> + > >> +Why is Linux different? > >> +======================= > >> + > >> +Unlike other OSes, where there is one compositor for one or more > >> +drivers, on Linux we have a many-to-many relationship. Many > >> +compositors; > >> many drivers. > >> +In addition each compositor vendor or community has their own view > >> +of how color management should be done. This is what makes Linux so > beautiful. > >> + > >> +This means that a HW vendor can now no longer tune their driver to > >> +one compositor, as tuning it to one could make it look fairly > >> +different from another compositor's color mapping. > >> + > >> +We need a better solution. > >> + > >> + > >> +Descriptive API > >> +=============== > >> + > >> +An API that describes the source and destination colorspaces is a > >> +descriptive API. It describes the input and output color spaces but > >> +does not describe how precisely they should be mapped. Such a > >> +mapping includes many minute design decision that can greatly affect > >> +the look of the final > >> result. > >> + > >> +It is not feasible to describe such mapping with enough detail to > >> +ensure the same result from each implementation. In fact, these > >> +mappings are a very active research area. > >> + > >> + > >> +Prescriptive API > >> +================ > >> + > >> +A prescriptive API describes not the source and destination > >> +colorspaces. It instead prescribes a recipe for how to manipulate > >> +pixel values to arrive at the desired outcome. > >> + > >> +This recipe is generally an ordered list of straight-forward > >> +operations, with clear mathematical definitions, such as 1D LUTs, 3D > >> +LUTs, matrices, or other operations that can be described in a precise manner. > >> + > >> + > >> +The Color Pipeline API > >> +====================== > >> + > >> +HW color management pipelines can significantly differ between HW > >> +vendors in terms of availability, ordering, and capabilities of HW > >> +blocks. This makes a common definition of color management blocks > >> +and their ordering nigh impossible. Instead we are defining an API > >> +that allows user space to discover the HW capabilities in a generic > >> +manner, agnostic of specific drivers and hardware. > >> + > >> + > >> +drm_colorop Object & IOCTLs > >> +=========================== > >> + > >> +To support the definition of color pipelines we define the DRM core > >> +object type drm_colorop. Individual drm_colorop objects will be > >> +chained via the NEXT property of a drm_colorop to constitute a color > pipeline. > >> +Each drm_colorop object is unique, i.e., even if multiple color > >> +pipelines have the same operation they won't share the same > >> +drm_colorop object to describe that operation. > >> + > >> +Note that drivers are not expected to map drm_colorop objects > >> +statically to specific HW blocks. The mapping of drm_colorop objects > >> +is entirely a driver-internal detail and can be as dynamic or static > >> +as a driver needs it to be. See more in the Driver Implementation > >> +Guide section > >> below. > >> + > >> +Just like other DRM objects the drm_colorop objects are discovered > >> +via > >> +IOCTLs: > >> + > >> +DRM_IOCTL_MODE_GETCOLOROPRESOURCES: This IOCTL is used to > retrieve > >> the > >> +number of all drm_colorop objects. > >> + > >> +DRM_IOCTL_MODE_GETCOLOROP: This IOCTL is used to read one > drm_colorop. > >> +It includes the ID for the colorop object, as well as the plane_id > >> +of the associated plane. All other values should be registered as > >> +properties. > >> + > >> +Each drm_colorop has three core properties: > >> + > >> +TYPE: The type of transformation, such as > >> +* enumerated curve > >> +* custom (uniform) 1D LUT > >> +* 3x3 matrix > >> +* 3x4 matrix > >> +* 3D LUT > >> +* etc. > >> + > >> +Depending on the type of transformation other properties will > >> +describe more details. > >> + > >> +BYPASS: A boolean property that can be used to easily put a block > >> +into bypass mode. While setting other properties might fail atomic > >> +check, setting the BYPASS property to true should never fail. The > >> +BYPASS property is not mandatory for a colorop, as long as the > >> +entire pipeline can get bypassed by setting the COLOR_PIPELINE on a plane > to '0'. > >> + > >> +NEXT: The ID of the next drm_colorop in a color pipeline, or 0 if > >> +this drm_colorop is the last in the chain. > >> + > >> +An example of a drm_colorop object might look like one of these:: > >> + > >> + /* 1D enumerated curve */ > >> + Color operation 42 > >> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 > >> + matrix, 3x4 > >> matrix, 3D LUT, etc.} = 1D enumerated curve > >> + ├─ "BYPASS": bool {true, false} > >> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, > >> + PQ > >> inverse EOTF, …} > > > > Having the fixed function enum for some targeted input/output may not > > be scalable for all usecases. There are multiple colorspaces and > > transfer functions possible, so it will not be possible to cover all > > these by any enum definitions. Also, this will depend on the capabilities of > respective hardware from various vendors. > > The reason this exists is such that certain HW vendors such as AMD have transfer > functions implemented in HW. It is important to take advantage of these for both > precision and power reasons. Issue we see here is that, it will be too usecase and vendor specific. There will be BT601, BT709, BT2020, SRGB, HDR EOTF and many more. Not to forget we will need linearization and non-linearization enums for each of these. Also a CTM indication to convert colospace. Also, if the underlying hardware block is programmable, its not limited to be used only for the colorspace management but can be used for other color enhancements as well by a capable client. Hence, we feel that it is bordering on being descriptive with too many possible combinations (not easy to generalize). So, if hardware is programmable, lets expose its capability through a blob and be generic. For any fixed function hardware where Lut etc is stored in ROM and just a control/enable bit is provided to driver, we can define a pipeline with a vendor specific color block. This can be identified with a flag (better ways can be discussed). For example, on some of the Intel platform, we had a fixed function to convert colorspaces directly with a bit setting. These kinds of things should be vendor specific and not be part of generic userspace implementation. For reference: 001b YUV601 to RGB601 YUV BT.601 to RGB BT.601 conversion. 010b YUV709 to RGB709 YUV BT.709 to RGB BT.709 conversion. 011b YUV2020 to RGB2020 YUV BT.2020 to RGB BT.2020 conversion. 100b RGB709 to RGB2020 RGB BT.709 to RGB BT.2020 conversion. > Additionally, not every vendor implements bucketed/segemented LUTs the same > way, so it's not feasible to expose that in a way that's particularly useful or not > vendor-specific. If the underlying hardware is programmable, the structure which we propose to advertise the capability of the block to userspace will be sufficient to compute the LUT coefficients. The caps can be : 1. Number of segments in Lut 2. Precision of lut 3. Starting and ending point of the segment 4. Number of samples in the segment. 5. Any other flag which could be useful in this computation. This way we can compute LUT's generically and send to driver. This will be scalable for all colorspaces, configurations and vendors. > Thus we decided to have a regular 1D LUT modulated onto a known curve. > This is the only real cross-vendor solution here that allows HW curve > implementations to be taken advantage of and also works with > bucketing/segemented LUTs. > (Including vendors we are not aware of yet). > > This also means that vendors that only support HW curves at some stages without > an actual LUT are also serviced. Any fixed function vendor implementation should be supported but with a vendor specific color block. Trying to come up with enums which aligns with some underlying hardware may not be scalable. > You are right that there *might* be some usecase not covered by this right now, > and that it would need kernel churn to implement new curves, but unfortunately > that's the compromise that we (so-far) have decided on in order to ensure > everyone can have good, precise, power-efficient support. Yes, we are aligned on this. But believe programmable hardware should be able to expose its caps. Fixed function hardware should be non-generic and vendor specific. > It is always possible for us to extend the uAPI at a later date for other curves, or > other properties that might expose a generic segmented LUT interface (such as > what you have proposed for a while) for vendors that can support it. > (With the whole color pipeline thing, we can essentially do 'versioning' > with that, if we wanted a new 1D LUT type.) Most of the hardware vendors have programmable luts (including AMD), so it would be good to have this as a default generic compositor implementation. And yes, any new color block with a type can be added to the existing API's as the need arises without breaking compatibility. Regards, Uma Shankar > > Thanks! > - Joshie
> -----Original Message----- > From: Harry Wentland <harry.wentland@amd.com> > Sent: Wednesday, November 8, 2023 8:08 PM > To: Shankar, Uma <uma.shankar@intel.com>; dri-devel@lists.freedesktop.org > Cc: wayland-devel@lists.freedesktop.org; Ville Syrjala > <ville.syrjala@linux.intel.com>; Pekka Paalanen > <pekka.paalanen@collabora.com>; Simon Ser <contact@emersion.fr>; Melissa > Wen <mwen@igalia.com>; Jonas Ådahl <jadahl@redhat.com>; Sebastian Wick > <sebastian.wick@redhat.com>; Shashank Sharma > <shashank.sharma@amd.com>; Alexander Goins <agoins@nvidia.com>; Joshua > Ashton <joshua@froggi.es>; Michel Dänzer <mdaenzer@redhat.com>; Aleix Pol > <aleixpol@kde.org>; Xaver Hugl <xaver.hugl@gmail.com>; Victoria Brekenfeld > <victoria@system76.com>; Sima <daniel@ffwll.ch>; Naseer Ahmed > <quic_naseer@quicinc.com>; Christopher Braga <quic_cbraga@quicinc.com>; > Abhinav Kumar <quic_abhinavk@quicinc.com>; Arthur Grillo > <arthurgrillo@riseup.net>; Hector Martin <marcan@marcan.st>; Liviu Dudau > <Liviu.Dudau@arm.com>; Sasha McIntosh <sashamcintosh@google.com> > Subject: Re: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive color > pipeline is needed > > > > On 2023-11-08 07:18, Shankar, Uma wrote: > > > > > >> -----Original Message----- > >> From: Harry Wentland <harry.wentland@amd.com> > >> Sent: Friday, October 20, 2023 2:51 AM > >> To: dri-devel@lists.freedesktop.org > >> Cc: wayland-devel@lists.freedesktop.org; Harry Wentland > >> <harry.wentland@amd.com>; Ville Syrjala > >> <ville.syrjala@linux.intel.com>; Pekka Paalanen > >> <pekka.paalanen@collabora.com>; Simon Ser <contact@emersion.fr>; > >> Melissa Wen <mwen@igalia.com>; Jonas Ådahl <jadahl@redhat.com>; > >> Sebastian Wick <sebastian.wick@redhat.com>; Shashank Sharma > >> <shashank.sharma@amd.com>; Alexander Goins <agoins@nvidia.com>; > >> Joshua Ashton <joshua@froggi.es>; Michel Dänzer > >> <mdaenzer@redhat.com>; Aleix Pol <aleixpol@kde.org>; Xaver Hugl > >> <xaver.hugl@gmail.com>; Victoria Brekenfeld <victoria@system76.com>; > >> Sima <daniel@ffwll.ch>; Shankar, Uma <uma.shankar@intel.com>; Naseer > >> Ahmed <quic_naseer@quicinc.com>; Christopher Braga > >> <quic_cbraga@quicinc.com>; Abhinav Kumar <quic_abhinavk@quicinc.com>; > >> Arthur Grillo <arthurgrillo@riseup.net>; Hector Martin > >> <marcan@marcan.st>; Liviu Dudau <Liviu.Dudau@arm.com>; Sasha McIntosh > >> <sashamcintosh@google.com> > >> Subject: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive > >> color pipeline is needed > >> > >> v2: > >> - Update colorop visualizations to match reality (Sebastian, Alex > >> Hung) > >> - Updated wording (Pekka) > >> - Change BYPASS wording to make it non-mandatory (Sebastian) > >> - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property > >> section (Pekka) > >> - Use PQ EOTF instead of its inverse in Pipeline Programming example > >> (Melissa) > >> - Add "Driver Implementer's Guide" section (Pekka) > >> - Add "Driver Forward/Backward Compatibility" section (Sebastian, > >> Pekka) > >> > >> Signed-off-by: Harry Wentland <harry.wentland@amd.com> > >> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> > >> Cc: Pekka Paalanen <pekka.paalanen@collabora.com> > >> Cc: Simon Ser <contact@emersion.fr> > >> Cc: Harry Wentland <harry.wentland@amd.com> > >> Cc: Melissa Wen <mwen@igalia.com> > >> Cc: Jonas Ådahl <jadahl@redhat.com> > >> Cc: Sebastian Wick <sebastian.wick@redhat.com> > >> Cc: Shashank Sharma <shashank.sharma@amd.com> > >> Cc: Alexander Goins <agoins@nvidia.com> > >> Cc: Joshua Ashton <joshua@froggi.es> > >> Cc: Michel Dänzer <mdaenzer@redhat.com> > >> Cc: Aleix Pol <aleixpol@kde.org> > >> Cc: Xaver Hugl <xaver.hugl@gmail.com> > >> Cc: Victoria Brekenfeld <victoria@system76.com> > >> Cc: Sima <daniel@ffwll.ch> > >> Cc: Uma Shankar <uma.shankar@intel.com> > >> Cc: Naseer Ahmed <quic_naseer@quicinc.com> > >> Cc: Christopher Braga <quic_cbraga@quicinc.com> > >> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> > >> Cc: Arthur Grillo <arthurgrillo@riseup.net> > >> Cc: Hector Martin <marcan@marcan.st> > >> Cc: Liviu Dudau <Liviu.Dudau@arm.com> > >> Cc: Sasha McIntosh <sashamcintosh@google.com> > >> --- > >> Documentation/gpu/rfc/color_pipeline.rst | 347 > >> +++++++++++++++++++++++ > >> 1 file changed, 347 insertions(+) > >> create mode 100644 Documentation/gpu/rfc/color_pipeline.rst > >> > >> diff --git a/Documentation/gpu/rfc/color_pipeline.rst > >> b/Documentation/gpu/rfc/color_pipeline.rst > >> new file mode 100644 > >> index 000000000000..af5f2ea29116 > >> --- /dev/null > >> +++ b/Documentation/gpu/rfc/color_pipeline.rst > >> @@ -0,0 +1,347 @@ > >> +======================== > >> +Linux Color Pipeline API > >> +======================== > >> + > >> +What problem are we solving? > >> +============================ > >> + > >> +We would like to support pre-, and post-blending complex color > >> +transformations in display controller hardware in order to allow for > >> +HW-supported HDR use-cases, as well as to provide support to > >> +color-managed applications, such as video or image editors. > >> + > >> +It is possible to support an HDR output on HW supporting the > >> +Colorspace and HDR Metadata drm_connector properties, but that > >> +requires the compositor or application to render and compose the > >> +content into one final buffer intended for display. Doing so is costly. > >> + > >> +Most modern display HW offers various 1D LUTs, 3D LUTs, matrices, > >> +and other operations to support color transformations. These > >> +operations are often implemented in fixed-function HW and therefore > >> +much more power efficient than performing similar operations via shaders or > CPU. > >> + > >> +We would like to make use of this HW functionality to support > >> +complex color transformations with no, or minimal CPU or shader load. > >> + > >> + > >> +How are other OSes solving this problem? > >> +======================================== > >> + > >> +The most widely supported use-cases regard HDR content, whether > >> +video or gaming. > >> + > >> +Most OSes will specify the source content format (color gamut, > >> +encoding transfer function, and other metadata, such as max and > >> +average light levels) to a > >> driver. > >> +Drivers will then program their fixed-function HW accordingly to map > >> +from a source content buffer's space to a display's space. > >> + > >> +When fixed-function HW is not available the compositor will assemble > >> +a shader to ask the GPU to perform the transformation from the > >> +source content format to the display's format. > >> + > >> +A compositor's mapping function and a driver's mapping function are > >> +usually entirely separate concepts. On OSes where a HW vendor has no > >> +insight into closed-source compositor code such a vendor will tune > >> +their color management code to visually match the compositor's. On > >> +other OSes, where both mapping functions are open to an implementer > >> +they will > >> ensure both mappings match. > >> + > >> +This results in mapping algorithm lock-in, meaning that no-one alone > >> +can experiment with or introduce new mapping algorithms and achieve > >> +consistent results regardless of which implementation path is taken. > >> + > >> +Why is Linux different? > >> +======================= > >> + > >> +Unlike other OSes, where there is one compositor for one or more > >> +drivers, on Linux we have a many-to-many relationship. Many > >> +compositors; > >> many drivers. > >> +In addition each compositor vendor or community has their own view > >> +of how color management should be done. This is what makes Linux so > beautiful. > >> + > >> +This means that a HW vendor can now no longer tune their driver to > >> +one compositor, as tuning it to one could make it look fairly > >> +different from another compositor's color mapping. > >> + > >> +We need a better solution. > >> + > >> + > >> +Descriptive API > >> +=============== > >> + > >> +An API that describes the source and destination colorspaces is a > >> +descriptive API. It describes the input and output color spaces but > >> +does not describe how precisely they should be mapped. Such a > >> +mapping includes many minute design decision that can greatly affect > >> +the look of the final > >> result. > >> + > >> +It is not feasible to describe such mapping with enough detail to > >> +ensure the same result from each implementation. In fact, these > >> +mappings are a very active research area. > >> + > >> + > >> +Prescriptive API > >> +================ > >> + > >> +A prescriptive API describes not the source and destination > >> +colorspaces. It instead prescribes a recipe for how to manipulate > >> +pixel values to arrive at the desired outcome. > >> + > >> +This recipe is generally an ordered list of straight-forward > >> +operations, with clear mathematical definitions, such as 1D LUTs, 3D > >> +LUTs, matrices, or other operations that can be described in a precise manner. > >> + > >> + > >> +The Color Pipeline API > >> +====================== > >> + > >> +HW color management pipelines can significantly differ between HW > >> +vendors in terms of availability, ordering, and capabilities of HW > >> +blocks. This makes a common definition of color management blocks > >> +and their ordering nigh impossible. Instead we are defining an API > >> +that allows user space to discover the HW capabilities in a generic > >> +manner, agnostic of specific drivers and hardware. > >> + > >> + > >> +drm_colorop Object & IOCTLs > >> +=========================== > >> + > >> +To support the definition of color pipelines we define the DRM core > >> +object type drm_colorop. Individual drm_colorop objects will be > >> +chained via the NEXT property of a drm_colorop to constitute a color > pipeline. > >> +Each drm_colorop object is unique, i.e., even if multiple color > >> +pipelines have the same operation they won't share the same > >> +drm_colorop object to describe that operation. > >> + > >> +Note that drivers are not expected to map drm_colorop objects > >> +statically to specific HW blocks. The mapping of drm_colorop objects > >> +is entirely a driver-internal detail and can be as dynamic or static > >> +as a driver needs it to be. See more in the Driver Implementation > >> +Guide section > >> below. > >> + > >> +Just like other DRM objects the drm_colorop objects are discovered > >> +via > >> +IOCTLs: > >> + > >> +DRM_IOCTL_MODE_GETCOLOROPRESOURCES: This IOCTL is used to > retrieve > >> the > >> +number of all drm_colorop objects. > >> + > >> +DRM_IOCTL_MODE_GETCOLOROP: This IOCTL is used to read one > drm_colorop. > >> +It includes the ID for the colorop object, as well as the plane_id > >> +of the associated plane. All other values should be registered as > >> +properties. > >> + > >> +Each drm_colorop has three core properties: > >> + > >> +TYPE: The type of transformation, such as > >> +* enumerated curve > >> +* custom (uniform) 1D LUT > >> +* 3x3 matrix > >> +* 3x4 matrix > >> +* 3D LUT > >> +* etc. > >> + > >> +Depending on the type of transformation other properties will > >> +describe more details. > >> + > >> +BYPASS: A boolean property that can be used to easily put a block > >> +into bypass mode. While setting other properties might fail atomic > >> +check, setting the BYPASS property to true should never fail. The > >> +BYPASS property is not mandatory for a colorop, as long as the > >> +entire pipeline can get bypassed by setting the COLOR_PIPELINE on a plane > to '0'. > >> + > >> +NEXT: The ID of the next drm_colorop in a color pipeline, or 0 if > >> +this drm_colorop is the last in the chain. > >> + > >> +An example of a drm_colorop object might look like one of these:: > >> + > >> + /* 1D enumerated curve */ > >> + Color operation 42 > >> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 > >> + matrix, 3x4 > >> matrix, 3D LUT, etc.} = 1D enumerated curve > >> + ├─ "BYPASS": bool {true, false} > >> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, > >> + PQ > >> inverse EOTF, …} > > > > Having the fixed function enum for some targeted input/output may not > > be scalable for all usecases. There are multiple colorspaces and > > transfer functions possible, so it will not be possible to cover all > > these by any enum definitions. Also, this will depend on the capabilities of > respective hardware from various vendors. > > > > Agreed, and this is only an example of one TYPE of colorop, the "1D enumerated > curve". There is a place for a "1D LUT", that's a traditional 1D LUT, or even a > "PWL" type, if someone wants to define that. > > The beauty with the DRM object and properties approach is that this is extensible > without breaking existing implementations in the kernel or userspace. Yeah, the only concern with enums I had was on the possible combinations and its associated mapping on various hardware and vendors. So a generic userspace should rely on capability detection and programming, which will be scalable and useful for all possible hardware and vendors. Some custom hardware can be handled by vendor specific block and its related HAL. > >> + └─ "NEXT": immutable color operation ID = 43 > >> + > >> + /* custom 4k entry 1D LUT */ > >> + Color operation 52 > >> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 > >> + matrix, 3x4 > >> matrix, 3D LUT, etc.} = 1D LUT > >> + ├─ "BYPASS": bool {true, false} > >> + ├─ "LUT_1D_SIZE": immutable range = 4096 > > > > For the size and capability of individual LUT block, it would be good > > to add this as a blob as defined in the blob approach we were planning > > earlier. So just taking that part of the series to have this capability detection > generic. Refer below: > > https://patchwork.freedesktop.org/patch/554855/?series=123023&rev=1 > > > > Basically, use this structure for lut capability and arrangement: > > struct drm_color_lut_range { > > /* DRM_MODE_LUT_* */ > > __u32 flags; > > /* number of points on the curve */ > > __u16 count; > > /* input/output bits per component */ > > __u8 input_bpc, output_bpc; > > /* input start/end values */ > > __s32 start, end; > > /* output min/max values */ > > __s32 min, max; > > }; > > > > If the intention is to have just 1 segment with 4096, it can be easily described > there. > > Additionally, this can also cater to any kind of lut arrangement, PWL, segmented > or logarithmic. > > > > Thanks for sharing this again. We've had some discussion about this and it looks > like we definitely want something to describe the range of the domain of the LUT > as well as it's output values, maybe also things like clamping. Your struct seems to > cover all of that. Sure, thanks Harry. Regards, Uma Shankar > >> + ├─ "LUT_1D": blob > >> + └─ "NEXT": immutable color operation ID = 0 > >> + > >> + /* 17^3 3D LUT */ > >> + Color operation 72 > >> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 > >> + matrix, 3x4 > >> matrix, 3D LUT, etc.} = 3D LUT > >> + ├─ "BYPASS": bool {true, false} > >> + ├─ "LUT_3D_SIZE": immutable range = 17 > >> + ├─ "LUT_3D": blob > >> + └─ "NEXT": immutable color operation ID = 73 > >> + > >> + > >> +COLOR_PIPELINE Plane Property > >> +============================= > >> + > >> +Color Pipelines are created by a driver and advertised via a new > >> +COLOR_PIPELINE enum property on each plane. Values of the property > >> +always include '0', which is the default and means all color > >> +processing is disabled. Additional values will be the object IDs of > >> +the first drm_colorop in a pipeline. A driver can create and > >> +advertise none, one, or more possible color pipelines. A DRM client > >> +will select a color pipeline by setting the COLOR PIPELINE to the respective > value. > >> + > >> +In the case where drivers have custom support for pre-blending color > >> +processing those drivers shall reject atomic commits that are trying > >> +to use both the custom color properties, as well as the > >> +COLOR_PIPELINE property. > >> + > >> +An example of a COLOR_PIPELINE property on a plane might look like this:: > >> + > >> + Plane 10 > >> + ├─ "type": immutable enum {Overlay, Primary, Cursor} = Primary > >> + ├─ … > >> + └─ "color_pipeline": enum {0, 42, 52} = 0 > >> + > >> + > >> +Color Pipeline Discovery > >> +======================== > >> + > >> +A DRM client wanting color management on a drm_plane will: > >> + > >> +1. Read all drm_colorop objects > >> +2. Get the COLOR_PIPELINE property of the plane 3. iterate all > >> +COLOR_PIPELINE enum values 4. for each enum value walk the color > >> +pipeline (via the NEXT pointers) > >> + and see if the available color operations are suitable for the > >> + desired color management operations > >> + > >> +An example of chained properties to define an AMD pre-blending color > >> +pipeline might look like this:: > >> + > >> + Plane 10 > >> + ├─ "TYPE" (immutable) = Primary > >> + └─ "COLOR_PIPELINE": enum {0, 44} = 0 > >> + > >> + Color operation 44 > >> + ├─ "TYPE" (immutable) = 1D enumerated curve > >> + ├─ "BYPASS": bool > >> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, PQ EOTF} = sRGB EOTF > >> + └─ "NEXT" (immutable) = 45 > >> + > >> + Color operation 45 > >> + ├─ "TYPE" (immutable) = 3x4 Matrix > >> + ├─ "BYPASS": bool > >> + ├─ "MATRIX_3_4": blob > >> + └─ "NEXT" (immutable) = 46 > >> + > >> + Color operation 46 > >> + ├─ "TYPE" (immutable) = 1D enumerated curve > >> + ├─ "BYPASS": bool > >> + ├─ "CURVE_1D_TYPE": enum {sRGB Inverse EOTF, PQ Inverse EOTF} = > >> + sRGB > >> EOTF > >> + └─ "NEXT" (immutable) = 47 > >> + > >> + Color operation 47 > >> + ├─ "TYPE" (immutable) = 1D LUT > >> + ├─ "LUT_1D_SIZE": immutable range = 4096 > >> + ├─ "LUT_1D_DATA": blob > >> + └─ "NEXT" (immutable) = 48 > >> + > >> + Color operation 48 > >> + ├─ "TYPE" (immutable) = 3D LUT > >> + ├─ "LUT_3D_SIZE" (immutable) = 17 > >> + ├─ "LUT_3D_DATA": blob > >> + └─ "NEXT" (immutable) = 49 > >> + > >> + Color operation 49 > >> + ├─ "TYPE" (immutable) = 1D enumerated curve > >> + ├─ "BYPASS": bool > >> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, PQ EOTF} = sRGB EOTF > >> + └─ "NEXT" (immutable) = 0 > >> + > >> + > >> +Color Pipeline Programming > >> +========================== > >> + > >> +Once a DRM client has found a suitable pipeline it will: > >> + > >> +1. Set the COLOR_PIPELINE enum value to the one pointing at the first > >> + drm_colorop object of the desired pipeline 2. Set the properties > >> +for all drm_colorop objects in the pipeline to the > >> + desired values, setting BYPASS to true for unused drm_colorop blocks, > >> + and false for enabled drm_colorop blocks 3. Perform > >> +atomic_check/commit as desired > >> + > >> +To configure the pipeline for an HDR10 PQ plane and blending in > >> +linear space, a compositor might perform an atomic commit with the > >> +following property values:: > >> + > >> + Plane 10 > >> + └─ "COLOR_PIPELINE" = 42 > >> + > >> + Color operation 42 (input CSC) > >> + └─ "BYPASS" = true > >> + > >> + Color operation 44 (DeGamma) > >> + └─ "BYPASS" = true > >> + > >> + Color operation 45 (gamut remap) > >> + └─ "BYPASS" = true > >> + > >> + Color operation 46 (shaper LUT RAM) > >> + └─ "BYPASS" = true > >> + > >> + Color operation 47 (3D LUT RAM) > >> + └─ "LUT_3D_DATA" = Gamut mapping + tone mapping + night mode > >> + > >> + Color operation 48 (blend gamma) > >> + └─ "CURVE_1D_TYPE" = PQ EOTF > >> + > >> + > >> +Driver Implementer's Guide > >> +========================== > >> + > >> +What does this all mean for driver implementations? As noted above > >> +the colorops can map to HW directly but don't need to do so. Here > >> +are some suggestions on how to think about creating your color pipelines: > >> + > >> +- Try to expose pipelines that use already defined colorops, even if > >> + your hardware pipeline is split differently. This allows existing > >> + userspace to immediately take advantage of the hardware. > >> + > >> +- Additionally, try to expose your actual hardware blocks as colorops. > >> + Define new colorop types where you believe it can offer > >> +significant > >> + benefits if userspace learns to program them. > >> + > >> +- Avoid defining new colorops for compound operations with very > >> +narrow > >> + scope. If you have a hardware block for a special operation that > >> + cannot be split further, you can expose that as a new colorop type. > >> + However, try to not define colorops for "use cases", especially if > >> + they require you to combine multiple hardware blocks. > >> + > >> +- Design new colorops as prescriptive, not descriptive; by the > >> + mathematical formula, not by the assumed input and output. > >> + > >> +A defined colorop type must be deterministic. Its operation can > >> +depend only on its properties and input and nothing else, allowed > >> +error tolerance notwithstanding. > >> + > >> + > >> +Driver Forward/Backward Compatibility > >> +===================================== > >> + > >> +As this is uAPI drivers can't regress color pipelines that have been > >> +introduced for a given HW generation. New HW generations are free to > >> +abandon color pipelines advertised for previous generations. > >> +Nevertheless, it can be beneficial to carry support for existing > >> +color pipelines forward as those will likely already have support in > >> +DRM clients. > >> + > >> +Introducing new colorops to a pipeline is fine, as long as they can > >> +be disabled or are purely informational. DRM clients implementing > >> +support for the pipeline can always skip unknown properties as long > >> +as they can be confident that doing so will not cause unexpected results. > >> + > >> +If a new colorop doesn't fall into one of the above categories > >> +(bypassable or informational) the modified pipeline would be > >> +unusable for user space. In this case a new pipeline should be defined. > > > > Thanks again for this nice documentation and capturing all the details clearly. > > > > Thanks for your feedback. > > Harry > > > Regards, > > Uma Shankar > > > >> + > >> +References > >> +========== > >> + > >> +1. > >> +https://lore.kernel.org/dri-devel/QMers3awXvNCQlyhWdTtsPwkp5ie9bze_h > >> +D5n > >> > +AccFW7a_RXlWjYB7MoUW_8CKLT2bSQwIXVi5H6VULYIxCdgvryZoAoJnC5lZgyK1 > >> QWn488= > >> +@emersion.fr/ > >> \ No newline at end of file > >> -- > >> 2.42.0 > >
On Thu, 9 Nov 2023 10:17:11 +0000 "Shankar, Uma" <uma.shankar@intel.com> wrote: > > -----Original Message----- > > From: Joshua Ashton <joshua@froggi.es> > > Sent: Wednesday, November 8, 2023 7:13 PM > > To: Shankar, Uma <uma.shankar@intel.com>; Harry Wentland > > <harry.wentland@amd.com>; dri-devel@lists.freedesktop.org ... > > Subject: Re: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive color > > pipeline is needed > > > > > > > > On 11/8/23 12:18, Shankar, Uma wrote: > > > > > > > > >> -----Original Message----- > > >> From: Harry Wentland <harry.wentland@amd.com> > > >> Sent: Friday, October 20, 2023 2:51 AM > > >> To: dri-devel@lists.freedesktop.org ... > > >> Subject: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive > > >> color pipeline is needed ... > > >> +An example of a drm_colorop object might look like one of these:: > > >> + > > >> + /* 1D enumerated curve */ > > >> + Color operation 42 > > >> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 > > >> + matrix, 3x4 > > >> matrix, 3D LUT, etc.} = 1D enumerated curve > > >> + ├─ "BYPASS": bool {true, false} > > >> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, > > >> + PQ > > >> inverse EOTF, …} > > > > > > Having the fixed function enum for some targeted input/output may not > > > be scalable for all usecases. There are multiple colorspaces and > > > transfer functions possible, so it will not be possible to cover all > > > these by any enum definitions. Also, this will depend on the capabilities of > > respective hardware from various vendors. > > > > The reason this exists is such that certain HW vendors such as AMD have transfer > > functions implemented in HW. It is important to take advantage of these for both > > precision and power reasons. > > Issue we see here is that, it will be too usecase and vendor specific. > There will be BT601, BT709, BT2020, SRGB, HDR EOTF and many more. Not to forget > we will need linearization and non-linearization enums for each of these. I don't see that as a problem at all. It's not a combinatorial explosion like input/output combinations in a single enum would be. It's always a curve and its inverse at most. It's KMS properties, not every driver needs to implement every defined enum value but only those values it can and wants to support. Userspace also sees the supported list, it does not need trial and error. This is the only way to actually use hard-wired curves. The alternative would be for userspace to submit a LUT of some type, and the driver needs to start guessing if it matches one of the hard-wired curves the hardware supports, which is just not feasible. Hard-wired curves are an addition, not a replacement, to custom curves defined by parameters or various different LUT representations. Many of these hard-wired curves will emerge as is from common use cases. > Also > a CTM indication to convert colospace. Did someone propose to enumerate matrices? I would not do that, unless you literally have hard-wired matrices in hardware and cannot do custom matrices. > Also, if the underlying hardware block is > programmable, its not limited to be used only for the colorspace management but > can be used for other color enhancements as well by a capable client. Yes, that's why we have other types for curves, the programmable ones. > Hence, we feel that it is bordering on being descriptive with too many possible > combinations (not easy to generalize). So, if hardware is programmable, lets > expose its capability through a blob and be generic. It's not descriptive though. It's a prescription of a mathematical function the hardware implements as fixed-function hardware. The function is a curve. There is no implication that the curve must be used with specific input or output color spaces. > For any fixed function hardware where Lut etc is stored in ROM and just a control/enable > bit is provided to driver, we can define a pipeline with a vendor specific color block. This > can be identified with a flag (better ways can be discussed). No, there is no need for that. A curve type will do well. A vendor specific colorop needs vendor specific userspace code to program *at all*. A generic curve colorop might list some curve types the userspace does not understand, but also curve types userspace does understand. The understood curve types can still be used by userspace. > For example, on some of the Intel platform, we had a fixed function to convert colorspaces > directly with a bit setting. These kinds of things should be vendor specific and not be part > of generic userspace implementation. Why would you forbid generic userspace from making use of them? > For reference: > 001b YUV601 to RGB601 YUV BT.601 to RGB BT.601 conversion. > 010b YUV709 to RGB709 YUV BT.709 to RGB BT.709 conversion. > 011b YUV2020 to RGB2020 YUV BT.2020 to RGB BT.2020 conversion. > 100b RGB709 to RGB2020 RGB BT.709 to RGB BT.2020 conversion. This is nothing like the curves we talked about above. Anyway, you can expose these fixed-function operations with a colorop that has an enum choosing the conversion. There is no need to make it vendor-specific at all. It's possible that only specific chips from Intel support it, but nothing stops anyone else from implementing or emulating the colorop if they can construct a hardware configuration achieving the same result. It seems there are already problems in exploding the number of pipelines to expose, so it's best to try to avoid single-use colorops and use enums in more generic colorops instead. > > > Additionally, not every vendor implements bucketed/segemented LUTs the same > > way, so it's not feasible to expose that in a way that's particularly useful or not > > vendor-specific. Joshua, I see no problem here really. They are just another type of LUT for a curve colorop, with a different configuration blob that can be defined in the UAPI. > If the underlying hardware is programmable, the structure which we propose to advertise > the capability of the block to userspace will be sufficient to compute the LUT coefficients. > The caps can be : > 1. Number of segments in Lut > 2. Precision of lut > 3. Starting and ending point of the segment > 4. Number of samples in the segment. > 5. Any other flag which could be useful in this computation. > > This way we can compute LUT's generically and send to driver. This will be scalable for all > colorspaces, configurations and vendors. Drop the mention of colorspaces, and I hope so. :-) Color spaces don't quite exist in a prescriptive pipeline definition. > > Thus we decided to have a regular 1D LUT modulated onto a known curve. > > This is the only real cross-vendor solution here that allows HW curve > > implementations to be taken advantage of and also works with > > bucketing/segemented LUTs. > > (Including vendors we are not aware of yet). > > > > This also means that vendors that only support HW curves at some stages without > > an actual LUT are also serviced. > > Any fixed function vendor implementation should be supported but with a vendor > specific color block. Trying to come up with enums which aligns with some underlying > hardware may not be scalable. I disagree with both of you. Who said there could be only one "degamma" block on a plane's pipeline? If hardware is best modelled as a fixed-function selectable curve followed by a custom curve, then expose exactly those two generic colorops. Nothing stops a pipeline from having two curve colorops in sequence with a disjoint set of supported types or features. If some hardware does not have one of the curve colorops, then just don't add the missing one in a pipeline. Thanks, pq > > You are right that there *might* be some usecase not covered by this right now, > > and that it would need kernel churn to implement new curves, but unfortunately > > that's the compromise that we (so-far) have decided on in order to ensure > > everyone can have good, precise, power-efficient support. > > Yes, we are aligned on this. But believe programmable hardware should be able to > expose its caps. Fixed function hardware should be non-generic and vendor specific. > > > It is always possible for us to extend the uAPI at a later date for other curves, or > > other properties that might expose a generic segmented LUT interface (such as > > what you have proposed for a while) for vendors that can support it. > > (With the whole color pipeline thing, we can essentially do 'versioning' > > with that, if we wanted a new 1D LUT type.) > > Most of the hardware vendors have programmable luts (including AMD), so it would be > good to have this as a default generic compositor implementation. And yes, any new color > block with a type can be added to the existing API's as the need arises without breaking > compatibility. > > Regards, > Uma Shankar > > > > > Thanks! > > - Joshie
> -----Original Message----- > From: Pekka Paalanen <ppaalanen@gmail.com> > Sent: Thursday, November 9, 2023 5:26 PM > To: Shankar, Uma <uma.shankar@intel.com> > Cc: Joshua Ashton <joshua@froggi.es>; Harry Wentland > <harry.wentland@amd.com>; dri-devel@lists.freedesktop.org; Sebastian Wick > <sebastian.wick@redhat.com>; Sasha McIntosh <sashamcintosh@google.com>; > Abhinav Kumar <quic_abhinavk@quicinc.com>; Shashank Sharma > <shashank.sharma@amd.com>; Xaver Hugl <xaver.hugl@gmail.com>; Hector > Martin <marcan@marcan.st>; Liviu Dudau <Liviu.Dudau@arm.com>; Alexander > Goins <agoins@nvidia.com>; Michel Dänzer <mdaenzer@redhat.com>; wayland- > devel@lists.freedesktop.org; Melissa Wen <mwen@igalia.com>; Jonas Ådahl > <jadahl@redhat.com>; Arthur Grillo <arthurgrillo@riseup.net>; Victoria > Brekenfeld <victoria@system76.com>; Sima <daniel@ffwll.ch>; Aleix Pol > <aleixpol@kde.org>; Naseer Ahmed <quic_naseer@quicinc.com>; Christopher > Braga <quic_cbraga@quicinc.com>; Ville Syrjala <ville.syrjala@linux.intel.com> > Subject: Re: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive color > pipeline is needed > > On Thu, 9 Nov 2023 10:17:11 +0000 > "Shankar, Uma" <uma.shankar@intel.com> wrote: > > > > -----Original Message----- > > > From: Joshua Ashton <joshua@froggi.es> > > > Sent: Wednesday, November 8, 2023 7:13 PM > > > To: Shankar, Uma <uma.shankar@intel.com>; Harry Wentland > > > <harry.wentland@amd.com>; dri-devel@lists.freedesktop.org > > ... > > > > Subject: Re: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why > > > prescriptive color pipeline is needed > > > > > > > > > > > > On 11/8/23 12:18, Shankar, Uma wrote: > > > > > > > > > > > >> -----Original Message----- > > > >> From: Harry Wentland <harry.wentland@amd.com> > > > >> Sent: Friday, October 20, 2023 2:51 AM > > > >> To: dri-devel@lists.freedesktop.org > > ... > > > > >> Subject: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why > > > >> prescriptive color pipeline is needed > > ... > > > > >> +An example of a drm_colorop object might look like one of these:: > > > >> + > > > >> + /* 1D enumerated curve */ > > > >> + Color operation 42 > > > >> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 > > > >> + matrix, 3x4 > > > >> matrix, 3D LUT, etc.} = 1D enumerated curve > > > >> + ├─ "BYPASS": bool {true, false} > > > >> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ > > > >> + EOTF, PQ > > > >> inverse EOTF, …} > > > > > > > > Having the fixed function enum for some targeted input/output may > > > > not be scalable for all usecases. There are multiple colorspaces > > > > and transfer functions possible, so it will not be possible to > > > > cover all these by any enum definitions. Also, this will depend on > > > > the capabilities of > > > respective hardware from various vendors. > > > > > > The reason this exists is such that certain HW vendors such as AMD > > > have transfer functions implemented in HW. It is important to take > > > advantage of these for both precision and power reasons. > > > > Issue we see here is that, it will be too usecase and vendor specific. > > There will be BT601, BT709, BT2020, SRGB, HDR EOTF and many more. Not > > to forget we will need linearization and non-linearization enums for each of > these. > > I don't see that as a problem at all. It's not a combinatorial explosion like > input/output combinations in a single enum would be. > It's always a curve and its inverse at most. > > It's KMS properties, not every driver needs to implement every defined enum > value but only those values it can and wants to support. > Userspace also sees the supported list, it does not need trial and error. > > This is the only way to actually use hard-wired curves. The alternative would be > for userspace to submit a LUT of some type, and the driver needs to start > guessing if it matches one of the hard-wired curves the hardware supports, which > is just not feasible. > > Hard-wired curves are an addition, not a replacement, to custom curves defined > by parameters or various different LUT representations. > Many of these hard-wired curves will emerge as is from common use cases. Point taken, we can go with this fixed function curve types as long as it represents a single mathematical operation, thereby avoiding the combination nightmare. However, just want to make sure that the same thing can be done with a programmable hardware. In the case above, lut tables for the same need to be hardcoded in driver for various platforms (depending on its capabilities, precision, number, and distribution of luts etc). This is manageable, but driver will get bloated with all kinds of hardcoded lut tables, which could have been easily computed by the compositor runtime. Driver cannot compute the tables runtime due to the complexity of the floating math involved, so hardcoded lut tables will be the only option. So we should just ensure that if these enums are not exposed by a driver, but a programmable lut block is exposed instead, userspace should fall back to the programmable lut. Having the fixed function enum should not become a mandatory norm to implement and expose even for a programmable hardware. With this we will be able to cater to both kinds of hardware with a generic userspace. Hope this expectation is ok. > > Also > > a CTM indication to convert colospace. > > Did someone propose to enumerate matrices? I would not do that, unless you > literally have hard-wired matrices in hardware and cannot do custom matrices. Not currently, but there can be fixed function matrix for certain color space or format conversion like BT709->BT2020 etc.. However, we see this is not proposed currently and if not needed, it's fine and don't want to bring another non-problem for discussion. > > Also, if the underlying hardware block is programmable, its not > > limited to be used only for the colorspace management but can be used > > for other color enhancements as well by a capable client. > > Yes, that's why we have other types for curves, the programmable ones. Got that and agree, it's fine as mentioned above. > > Hence, we feel that it is bordering on being descriptive with too many > > possible combinations (not easy to generalize). So, if hardware is > > programmable, lets expose its capability through a blob and be generic. > > It's not descriptive though. It's a prescription of a mathematical function the > hardware implements as fixed-function hardware. The function is a curve. There > is no implication that the curve must be used with specific input or output color > spaces. As long as we don’t mix combinations it should be fine. But all hardware's may not represent these fixed functions with single mathematical operation level granularity. It would be tough to represent such color blocks with a single enum. > > For any fixed function hardware where Lut etc is stored in ROM and > > just a control/enable bit is provided to driver, we can define a > > pipeline with a vendor specific color block. This can be identified with a flag > (better ways can be discussed). > > No, there is no need for that. A curve type will do well. Agree and aligned here. > A vendor specific colorop needs vendor specific userspace code to program *at > all*. A generic curve colorop might list some curve types the userspace does not > understand, but also curve types userspace does understand. The understood > curve types can still be used by userspace. Issue is with combination operation in hardware. If it’s a single mathematical operation, it would be easy. > > For example, on some of the Intel platform, we had a fixed function to > > convert colorspaces directly with a bit setting. These kinds of things > > should be vendor specific and not be part of generic userspace implementation. > > Why would you forbid generic userspace from making use of them? Issue is that it was not one single mathematical operation but a combination as described below. > > For reference: > > 001b YUV601 to RGB601 YUV BT.601 to RGB BT.601 conversion. > > 010b YUV709 to RGB709 YUV BT.709 to RGB BT.709 conversion. > > 011b YUV2020 to RGB2020 YUV BT.2020 to RGB BT.2020 conversion. > > 100b RGB709 to RGB2020 RGB BT.709 to RGB BT.2020 conversion. > > This is nothing like the curves we talked about above. > Anyway, you can expose these fixed-function operations with a colorop that has > an enum choosing the conversion. There is no need to make it vendor-specific at > all. It's possible that only specific chips from Intel support it, but nothing stops > anyone else from implementing or emulating the colorop if they can construct a > hardware configuration achieving the same result. > > It seems there are already problems in exploding the number of pipelines to > expose, so it's best to try to avoid single-use colorops and use enums in more > generic colorops instead. Yeah, this is how hardware will implement and it involves multiple mathematical operations, controlled with one programmable bit to enable the same. These will be tough to generalize. What should be the type of color op for these would be an open. It would be great if we can address this generically. > > > > > Additionally, not every vendor implements bucketed/segemented LUTs > > > the same way, so it's not feasible to expose that in a way that's > > > particularly useful or not vendor-specific. > > Joshua, I see no problem here really. They are just another type of LUT for a curve > colorop, with a different configuration blob that can be defined in the UAPI. Yeah, agree. And the programmable hardware can be easily exposed and generalize for all vendors, so it should not be a concern. > > If the underlying hardware is programmable, the structure which we > > propose to advertise the capability of the block to userspace will be sufficient to > compute the LUT coefficients. > > The caps can be : > > 1. Number of segments in Lut > > 2. Precision of lut > > 3. Starting and ending point of the segment 4. Number of samples in > > the segment. > > 5. Any other flag which could be useful in this computation. > > > > This way we can compute LUT's generically and send to driver. This > > will be scalable for all colorspaces, configurations and vendors. > > Drop the mention of colorspaces, and I hope so. :-) > > Color spaces don't quite exist in a prescriptive pipeline definition. Yeah. For driver it's just a LUT for programmable hardware, OR mathematical operation for fixed function hardware defined via enum
On Fri, 10 Nov 2023 11:27:14 +0000 "Shankar, Uma" <uma.shankar@intel.com> wrote: > > -----Original Message----- > > From: Pekka Paalanen <ppaalanen@gmail.com> > > Sent: Thursday, November 9, 2023 5:26 PM > > To: Shankar, Uma <uma.shankar@intel.com> > > Cc: Joshua Ashton <joshua@froggi.es>; Harry Wentland > > <harry.wentland@amd.com>; dri-devel@lists.freedesktop.org; Sebastian Wick > > <sebastian.wick@redhat.com>; Sasha McIntosh <sashamcintosh@google.com>; > > Abhinav Kumar <quic_abhinavk@quicinc.com>; Shashank Sharma > > <shashank.sharma@amd.com>; Xaver Hugl <xaver.hugl@gmail.com>; Hector > > Martin <marcan@marcan.st>; Liviu Dudau <Liviu.Dudau@arm.com>; Alexander > > Goins <agoins@nvidia.com>; Michel Dänzer <mdaenzer@redhat.com>; wayland- > > devel@lists.freedesktop.org; Melissa Wen <mwen@igalia.com>; Jonas Ådahl > > <jadahl@redhat.com>; Arthur Grillo <arthurgrillo@riseup.net>; Victoria > > Brekenfeld <victoria@system76.com>; Sima <daniel@ffwll.ch>; Aleix Pol > > <aleixpol@kde.org>; Naseer Ahmed <quic_naseer@quicinc.com>; Christopher > > Braga <quic_cbraga@quicinc.com>; Ville Syrjala <ville.syrjala@linux.intel.com> > > Subject: Re: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive color > > pipeline is needed > > > > On Thu, 9 Nov 2023 10:17:11 +0000 > > "Shankar, Uma" <uma.shankar@intel.com> wrote: > > > > > > -----Original Message----- > > > > From: Joshua Ashton <joshua@froggi.es> > > > > Sent: Wednesday, November 8, 2023 7:13 PM > > > > To: Shankar, Uma <uma.shankar@intel.com>; Harry Wentland > > > > <harry.wentland@amd.com>; dri-devel@lists.freedesktop.org > > > > ... > > > > > > Subject: Re: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why > > > > prescriptive color pipeline is needed > > > > > > > > > > > > > > > > On 11/8/23 12:18, Shankar, Uma wrote: > > > > > > > > > > > > > > >> -----Original Message----- > > > > >> From: Harry Wentland <harry.wentland@amd.com> > > > > >> Sent: Friday, October 20, 2023 2:51 AM > > > > >> To: dri-devel@lists.freedesktop.org > > > > ... > > > > > > >> Subject: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why > > > > >> prescriptive color pipeline is needed > > > > ... > > > > > > >> +An example of a drm_colorop object might look like one of these:: > > > > >> + > > > > >> + /* 1D enumerated curve */ > > > > >> + Color operation 42 > > > > >> + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 > > > > >> + matrix, 3x4 > > > > >> matrix, 3D LUT, etc.} = 1D enumerated curve > > > > >> + ├─ "BYPASS": bool {true, false} > > > > >> + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ > > > > >> + EOTF, PQ > > > > >> inverse EOTF, …} > > > > > > > > > > Having the fixed function enum for some targeted input/output may > > > > > not be scalable for all usecases. There are multiple colorspaces > > > > > and transfer functions possible, so it will not be possible to > > > > > cover all these by any enum definitions. Also, this will depend on > > > > > the capabilities of > > > > respective hardware from various vendors. > > > > > > > > The reason this exists is such that certain HW vendors such as AMD > > > > have transfer functions implemented in HW. It is important to take > > > > advantage of these for both precision and power reasons. > > > > > > Issue we see here is that, it will be too usecase and vendor specific. > > > There will be BT601, BT709, BT2020, SRGB, HDR EOTF and many more. Not > > > to forget we will need linearization and non-linearization enums for each of > > these. > > > > I don't see that as a problem at all. It's not a combinatorial explosion like > > input/output combinations in a single enum would be. > > It's always a curve and its inverse at most. > > > > It's KMS properties, not every driver needs to implement every defined enum > > value but only those values it can and wants to support. > > Userspace also sees the supported list, it does not need trial and error. > > > > This is the only way to actually use hard-wired curves. The alternative would be > > for userspace to submit a LUT of some type, and the driver needs to start > > guessing if it matches one of the hard-wired curves the hardware supports, which > > is just not feasible. > > > > Hard-wired curves are an addition, not a replacement, to custom curves defined > > by parameters or various different LUT representations. > > Many of these hard-wired curves will emerge as is from common use cases. > > Point taken, we can go with this fixed function curve types as long as it represents a > single mathematical operation, thereby avoiding the combination nightmare. > > However, just want to make sure that the same thing can be done with a programmable > hardware. In the case above, lut tables for the same need to be hardcoded in driver for > various platforms (depending on its capabilities, precision, number, and distribution of luts etc). Hi Uma, you can do that if you want to. > This is manageable, but driver will get bloated with all kinds of hardcoded lut tables, > which could have been easily computed by the compositor runtime. Driver cannot compute > the tables runtime due to the complexity of the floating math involved, so hardcoded > lut tables will be the only option. You do not have to do that if you don't want to. > So we should just ensure that if these enums are not exposed by a driver, but a programmable > lut block is exposed instead, userspace should fall back to the programmable lut. Having the > fixed function enum should not become a mandatory norm to implement and expose even for a > programmable hardware. I agree. > With this we will be able to cater to both kinds of hardware with a generic userspace. > Hope this expectation is ok. > > > > Also > > > a CTM indication to convert colospace. > > > > Did someone propose to enumerate matrices? I would not do that, unless you > > literally have hard-wired matrices in hardware and cannot do custom matrices. > > Not currently, but there can be fixed function matrix for certain color space or > format conversion like BT709->BT2020 etc.. > However, we see this is not proposed currently and if not needed, it's fine and > don't want to bring another non-problem for discussion. > > > > Also, if the underlying hardware block is programmable, its not > > > limited to be used only for the colorspace management but can be used > > > for other color enhancements as well by a capable client. > > > > Yes, that's why we have other types for curves, the programmable ones. > > Got that and agree, it's fine as mentioned above. > > > > Hence, we feel that it is bordering on being descriptive with too many > > > possible combinations (not easy to generalize). So, if hardware is > > > programmable, lets expose its capability through a blob and be generic. > > > > It's not descriptive though. It's a prescription of a mathematical function the > > hardware implements as fixed-function hardware. The function is a curve. There > > is no implication that the curve must be used with specific input or output color > > spaces. > > As long as we don’t mix combinations it should be fine. But all hardware's may not > represent these fixed functions with single mathematical operation level granularity. > It would be tough to represent such color blocks with a single enum. If a colorop does not fit for some hardware, then the driver should not expose that colorop or pipeline. > > > For any fixed function hardware where Lut etc is stored in ROM and > > > just a control/enable bit is provided to driver, we can define a > > > pipeline with a vendor specific color block. This can be identified with a flag > > (better ways can be discussed). > > > > No, there is no need for that. A curve type will do well. > > Agree and aligned here. > > > A vendor specific colorop needs vendor specific userspace code to program *at > > all*. A generic curve colorop might list some curve types the userspace does not > > understand, but also curve types userspace does understand. The understood > > curve types can still be used by userspace. > > Issue is with combination operation in hardware. If it’s a single mathematical operation, > it would be easy. > > > > For example, on some of the Intel platform, we had a fixed function to > > > convert colorspaces directly with a bit setting. These kinds of things > > > should be vendor specific and not be part of generic userspace implementation. > > > > Why would you forbid generic userspace from making use of them? > > Issue is that it was not one single mathematical operation but a combination > as described below. > > > > For reference: > > > 001b YUV601 to RGB601 YUV BT.601 to RGB BT.601 conversion. > > > 010b YUV709 to RGB709 YUV BT.709 to RGB BT.709 conversion. > > > 011b YUV2020 to RGB2020 YUV BT.2020 to RGB BT.2020 conversion. > > > 100b RGB709 to RGB2020 RGB BT.709 to RGB BT.2020 conversion. > > > > This is nothing like the curves we talked about above. > > Anyway, you can expose these fixed-function operations with a colorop that has > > an enum choosing the conversion. There is no need to make it vendor-specific at > > all. It's possible that only specific chips from Intel support it, but nothing stops > > anyone else from implementing or emulating the colorop if they can construct a > > hardware configuration achieving the same result. > > > > It seems there are already problems in exploding the number of pipelines to > > expose, so it's best to try to avoid single-use colorops and use enums in more > > generic colorops instead. > > Yeah, this is how hardware will implement and it involves multiple mathematical operations, > controlled with one programmable bit to enable the same. These will be tough to generalize. > What should be the type of color op for these would be an open. > > It would be great if we can address this generically. We would need to know what those four things actually do. Your description is very vague. Are there curves involved? > > > > Additionally, not every vendor implements bucketed/segemented LUTs > > > > the same way, so it's not feasible to expose that in a way that's > > > > particularly useful or not vendor-specific. > > > > Joshua, I see no problem here really. They are just another type of LUT for a curve > > colorop, with a different configuration blob that can be defined in the UAPI. > > Yeah, agree. > And the programmable hardware can be easily exposed and generalize for all vendors, > so it should not be a concern. > > > > If the underlying hardware is programmable, the structure which we > > > propose to advertise the capability of the block to userspace will be sufficient to > > compute the LUT coefficients. > > > The caps can be : > > > 1. Number of segments in Lut > > > 2. Precision of lut > > > 3. Starting and ending point of the segment 4. Number of samples in > > > the segment. > > > 5. Any other flag which could be useful in this computation. > > > > > > This way we can compute LUT's generically and send to driver. This > > > will be scalable for all colorspaces, configurations and vendors. > > > > Drop the mention of colorspaces, and I hope so. :-) > > > > Color spaces don't quite exist in a prescriptive pipeline definition. > > Yeah. For driver it's just a LUT for programmable hardware, OR mathematical > operation for fixed function hardware defined via enum
diff --git a/Documentation/gpu/rfc/color_pipeline.rst b/Documentation/gpu/rfc/color_pipeline.rst new file mode 100644 index 000000000000..af5f2ea29116 --- /dev/null +++ b/Documentation/gpu/rfc/color_pipeline.rst @@ -0,0 +1,347 @@ +======================== +Linux Color Pipeline API +======================== + +What problem are we solving? +============================ + +We would like to support pre-, and post-blending complex color +transformations in display controller hardware in order to allow for +HW-supported HDR use-cases, as well as to provide support to +color-managed applications, such as video or image editors. + +It is possible to support an HDR output on HW supporting the Colorspace +and HDR Metadata drm_connector properties, but that requires the +compositor or application to render and compose the content into one +final buffer intended for display. Doing so is costly. + +Most modern display HW offers various 1D LUTs, 3D LUTs, matrices, and other +operations to support color transformations. These operations are often +implemented in fixed-function HW and therefore much more power efficient than +performing similar operations via shaders or CPU. + +We would like to make use of this HW functionality to support complex color +transformations with no, or minimal CPU or shader load. + + +How are other OSes solving this problem? +======================================== + +The most widely supported use-cases regard HDR content, whether video or +gaming. + +Most OSes will specify the source content format (color gamut, encoding transfer +function, and other metadata, such as max and average light levels) to a driver. +Drivers will then program their fixed-function HW accordingly to map from a +source content buffer's space to a display's space. + +When fixed-function HW is not available the compositor will assemble a shader to +ask the GPU to perform the transformation from the source content format to the +display's format. + +A compositor's mapping function and a driver's mapping function are usually +entirely separate concepts. On OSes where a HW vendor has no insight into +closed-source compositor code such a vendor will tune their color management +code to visually match the compositor's. On other OSes, where both mapping +functions are open to an implementer they will ensure both mappings match. + +This results in mapping algorithm lock-in, meaning that no-one alone can +experiment with or introduce new mapping algorithms and achieve +consistent results regardless of which implementation path is taken. + +Why is Linux different? +======================= + +Unlike other OSes, where there is one compositor for one or more drivers, on +Linux we have a many-to-many relationship. Many compositors; many drivers. +In addition each compositor vendor or community has their own view of how +color management should be done. This is what makes Linux so beautiful. + +This means that a HW vendor can now no longer tune their driver to one +compositor, as tuning it to one could make it look fairly different from +another compositor's color mapping. + +We need a better solution. + + +Descriptive API +=============== + +An API that describes the source and destination colorspaces is a descriptive +API. It describes the input and output color spaces but does not describe +how precisely they should be mapped. Such a mapping includes many minute +design decision that can greatly affect the look of the final result. + +It is not feasible to describe such mapping with enough detail to ensure the +same result from each implementation. In fact, these mappings are a very active +research area. + + +Prescriptive API +================ + +A prescriptive API describes not the source and destination colorspaces. It +instead prescribes a recipe for how to manipulate pixel values to arrive at the +desired outcome. + +This recipe is generally an ordered list of straight-forward operations, +with clear mathematical definitions, such as 1D LUTs, 3D LUTs, matrices, +or other operations that can be described in a precise manner. + + +The Color Pipeline API +====================== + +HW color management pipelines can significantly differ between HW +vendors in terms of availability, ordering, and capabilities of HW +blocks. This makes a common definition of color management blocks and +their ordering nigh impossible. Instead we are defining an API that +allows user space to discover the HW capabilities in a generic manner, +agnostic of specific drivers and hardware. + + +drm_colorop Object & IOCTLs +=========================== + +To support the definition of color pipelines we define the DRM core +object type drm_colorop. Individual drm_colorop objects will be chained +via the NEXT property of a drm_colorop to constitute a color pipeline. +Each drm_colorop object is unique, i.e., even if multiple color +pipelines have the same operation they won't share the same drm_colorop +object to describe that operation. + +Note that drivers are not expected to map drm_colorop objects statically +to specific HW blocks. The mapping of drm_colorop objects is entirely a +driver-internal detail and can be as dynamic or static as a driver needs +it to be. See more in the Driver Implementation Guide section below. + +Just like other DRM objects the drm_colorop objects are discovered via +IOCTLs: + +DRM_IOCTL_MODE_GETCOLOROPRESOURCES: This IOCTL is used to retrieve the +number of all drm_colorop objects. + +DRM_IOCTL_MODE_GETCOLOROP: This IOCTL is used to read one drm_colorop. +It includes the ID for the colorop object, as well as the plane_id of +the associated plane. All other values should be registered as +properties. + +Each drm_colorop has three core properties: + +TYPE: The type of transformation, such as +* enumerated curve +* custom (uniform) 1D LUT +* 3x3 matrix +* 3x4 matrix +* 3D LUT +* etc. + +Depending on the type of transformation other properties will describe +more details. + +BYPASS: A boolean property that can be used to easily put a block into +bypass mode. While setting other properties might fail atomic check, +setting the BYPASS property to true should never fail. The BYPASS +property is not mandatory for a colorop, as long as the entire pipeline +can get bypassed by setting the COLOR_PIPELINE on a plane to '0'. + +NEXT: The ID of the next drm_colorop in a color pipeline, or 0 if this +drm_colorop is the last in the chain. + +An example of a drm_colorop object might look like one of these:: + + /* 1D enumerated curve */ + Color operation 42 + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D enumerated curve + ├─ "BYPASS": bool {true, false} + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ inverse EOTF, …} + └─ "NEXT": immutable color operation ID = 43 + + /* custom 4k entry 1D LUT */ + Color operation 52 + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D LUT + ├─ "BYPASS": bool {true, false} + ├─ "LUT_1D_SIZE": immutable range = 4096 + ├─ "LUT_1D": blob + └─ "NEXT": immutable color operation ID = 0 + + /* 17^3 3D LUT */ + Color operation 72 + ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 3D LUT + ├─ "BYPASS": bool {true, false} + ├─ "LUT_3D_SIZE": immutable range = 17 + ├─ "LUT_3D": blob + └─ "NEXT": immutable color operation ID = 73 + + +COLOR_PIPELINE Plane Property +============================= + +Color Pipelines are created by a driver and advertised via a new +COLOR_PIPELINE enum property on each plane. Values of the property +always include '0', which is the default and means all color processing +is disabled. Additional values will be the object IDs of the first +drm_colorop in a pipeline. A driver can create and advertise none, one, +or more possible color pipelines. A DRM client will select a color +pipeline by setting the COLOR PIPELINE to the respective value. + +In the case where drivers have custom support for pre-blending color +processing those drivers shall reject atomic commits that are trying to +use both the custom color properties, as well as the COLOR_PIPELINE +property. + +An example of a COLOR_PIPELINE property on a plane might look like this:: + + Plane 10 + ├─ "type": immutable enum {Overlay, Primary, Cursor} = Primary + ├─ … + └─ "color_pipeline": enum {0, 42, 52} = 0 + + +Color Pipeline Discovery +======================== + +A DRM client wanting color management on a drm_plane will: + +1. Read all drm_colorop objects +2. Get the COLOR_PIPELINE property of the plane +3. iterate all COLOR_PIPELINE enum values +4. for each enum value walk the color pipeline (via the NEXT pointers) + and see if the available color operations are suitable for the + desired color management operations + +An example of chained properties to define an AMD pre-blending color +pipeline might look like this:: + + Plane 10 + ├─ "TYPE" (immutable) = Primary + └─ "COLOR_PIPELINE": enum {0, 44} = 0 + + Color operation 44 + ├─ "TYPE" (immutable) = 1D enumerated curve + ├─ "BYPASS": bool + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, PQ EOTF} = sRGB EOTF + └─ "NEXT" (immutable) = 45 + + Color operation 45 + ├─ "TYPE" (immutable) = 3x4 Matrix + ├─ "BYPASS": bool + ├─ "MATRIX_3_4": blob + └─ "NEXT" (immutable) = 46 + + Color operation 46 + ├─ "TYPE" (immutable) = 1D enumerated curve + ├─ "BYPASS": bool + ├─ "CURVE_1D_TYPE": enum {sRGB Inverse EOTF, PQ Inverse EOTF} = sRGB EOTF + └─ "NEXT" (immutable) = 47 + + Color operation 47 + ├─ "TYPE" (immutable) = 1D LUT + ├─ "LUT_1D_SIZE": immutable range = 4096 + ├─ "LUT_1D_DATA": blob + └─ "NEXT" (immutable) = 48 + + Color operation 48 + ├─ "TYPE" (immutable) = 3D LUT + ├─ "LUT_3D_SIZE" (immutable) = 17 + ├─ "LUT_3D_DATA": blob + └─ "NEXT" (immutable) = 49 + + Color operation 49 + ├─ "TYPE" (immutable) = 1D enumerated curve + ├─ "BYPASS": bool + ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, PQ EOTF} = sRGB EOTF + └─ "NEXT" (immutable) = 0 + + +Color Pipeline Programming +========================== + +Once a DRM client has found a suitable pipeline it will: + +1. Set the COLOR_PIPELINE enum value to the one pointing at the first + drm_colorop object of the desired pipeline +2. Set the properties for all drm_colorop objects in the pipeline to the + desired values, setting BYPASS to true for unused drm_colorop blocks, + and false for enabled drm_colorop blocks +3. Perform atomic_check/commit as desired + +To configure the pipeline for an HDR10 PQ plane and blending in linear +space, a compositor might perform an atomic commit with the following +property values:: + + Plane 10 + └─ "COLOR_PIPELINE" = 42 + + Color operation 42 (input CSC) + └─ "BYPASS" = true + + Color operation 44 (DeGamma) + └─ "BYPASS" = true + + Color operation 45 (gamut remap) + └─ "BYPASS" = true + + Color operation 46 (shaper LUT RAM) + └─ "BYPASS" = true + + Color operation 47 (3D LUT RAM) + └─ "LUT_3D_DATA" = Gamut mapping + tone mapping + night mode + + Color operation 48 (blend gamma) + └─ "CURVE_1D_TYPE" = PQ EOTF + + +Driver Implementer's Guide +========================== + +What does this all mean for driver implementations? As noted above the +colorops can map to HW directly but don't need to do so. Here are some +suggestions on how to think about creating your color pipelines: + +- Try to expose pipelines that use already defined colorops, even if + your hardware pipeline is split differently. This allows existing + userspace to immediately take advantage of the hardware. + +- Additionally, try to expose your actual hardware blocks as colorops. + Define new colorop types where you believe it can offer significant + benefits if userspace learns to program them. + +- Avoid defining new colorops for compound operations with very narrow + scope. If you have a hardware block for a special operation that + cannot be split further, you can expose that as a new colorop type. + However, try to not define colorops for "use cases", especially if + they require you to combine multiple hardware blocks. + +- Design new colorops as prescriptive, not descriptive; by the + mathematical formula, not by the assumed input and output. + +A defined colorop type must be deterministic. Its operation can depend +only on its properties and input and nothing else, allowed error +tolerance notwithstanding. + + +Driver Forward/Backward Compatibility +===================================== + +As this is uAPI drivers can't regress color pipelines that have been +introduced for a given HW generation. New HW generations are free to +abandon color pipelines advertised for previous generations. +Nevertheless, it can be beneficial to carry support for existing color +pipelines forward as those will likely already have support in DRM +clients. + +Introducing new colorops to a pipeline is fine, as long as they can be +disabled or are purely informational. DRM clients implementing support +for the pipeline can always skip unknown properties as long as they can +be confident that doing so will not cause unexpected results. + +If a new colorop doesn't fall into one of the above categories +(bypassable or informational) the modified pipeline would be unusable +for user space. In this case a new pipeline should be defined. + + +References +========== + +1. https://lore.kernel.org/dri-devel/QMers3awXvNCQlyhWdTtsPwkp5ie9bze_hD5nAccFW7a_RXlWjYB7MoUW_8CKLT2bSQwIXVi5H6VULYIxCdgvryZoAoJnC5lZgyK1QWn488=@emersion.fr/ \ No newline at end of file
v2: - Update colorop visualizations to match reality (Sebastian, Alex Hung) - Updated wording (Pekka) - Change BYPASS wording to make it non-mandatory (Sebastian) - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property section (Pekka) - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa) - Add "Driver Implementer's Guide" section (Pekka) - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka) Signed-off-by: Harry Wentland <harry.wentland@amd.com> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Pekka Paalanen <pekka.paalanen@collabora.com> Cc: Simon Ser <contact@emersion.fr> Cc: Harry Wentland <harry.wentland@amd.com> Cc: Melissa Wen <mwen@igalia.com> Cc: Jonas Ådahl <jadahl@redhat.com> Cc: Sebastian Wick <sebastian.wick@redhat.com> Cc: Shashank Sharma <shashank.sharma@amd.com> Cc: Alexander Goins <agoins@nvidia.com> Cc: Joshua Ashton <joshua@froggi.es> Cc: Michel Dänzer <mdaenzer@redhat.com> Cc: Aleix Pol <aleixpol@kde.org> Cc: Xaver Hugl <xaver.hugl@gmail.com> Cc: Victoria Brekenfeld <victoria@system76.com> Cc: Sima <daniel@ffwll.ch> Cc: Uma Shankar <uma.shankar@intel.com> Cc: Naseer Ahmed <quic_naseer@quicinc.com> Cc: Christopher Braga <quic_cbraga@quicinc.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Arthur Grillo <arthurgrillo@riseup.net> Cc: Hector Martin <marcan@marcan.st> Cc: Liviu Dudau <Liviu.Dudau@arm.com> Cc: Sasha McIntosh <sashamcintosh@google.com> --- Documentation/gpu/rfc/color_pipeline.rst | 347 +++++++++++++++++++++++ 1 file changed, 347 insertions(+) create mode 100644 Documentation/gpu/rfc/color_pipeline.rst