x86/EPT: relax iPAT for "invalid" MFNs

Message ID	56063a8f-f569-4130-ac25-f0f064e288a1@suse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> Message-ID: <56063a8f-f569-4130-ac25-f0f064e288a1@suse.com> Date: Mon, 10 Jun 2024 16:58:52 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com>, =?utf-8?q?Roger_Pau_Monn?= =?utf-8?q?=C3=A9?= <roger.pau@citrix.com> From: Jan Beulich <jbeulich@suse.com> Subject: [PATCH] x86/EPT: relax iPAT for "invalid" MFNs Autocrypt: addr=jbeulich@suse.com; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit
Series	x86/EPT: relax iPAT for "invalid" MFNs \| expand x86/EPT: relax iPAT for "invalid" MFNs

Jan Beulich June 10, 2024, 2:58 p.m. UTC

mfn_valid() is RAM-focused; it will often return false for MMIO. Yet
access to actual MMIO space should not generally be restricted to UC
only; especially video frame buffer accesses are unduly affected by such
a restriction. Permit PAT use for directly assigned MMIO as long as the
domain is known to have been granted some level of cache control.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Considering that we've just declared PVH Dom0 "supported", this may well
qualify for 4.19. The issue was specifically very noticable there.

The conditional may be more complex than really necessary, but it's in
line with what we do elsewhere. And imo better continue to be a little
too restrictive, than moving to too lax.

Jan Beulich June 10, 2024, 3 p.m. UTC | #1

On 10.06.2024 16:58, Jan Beulich wrote:
> mfn_valid() is RAM-focused; it will often return false for MMIO. Yet
> access to actual MMIO space should not generally be restricted to UC
> only; especially video frame buffer accesses are unduly affected by such
> a restriction. Permit PAT use for directly assigned MMIO as long as the
> domain is known to have been granted some level of cache control.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Considering that we've just declared PVH Dom0 "supported", this may well
> qualify for 4.19. The issue was specifically very noticable there.

Actually - meant to Cc Oleksii for this, and then forgot.

Jan

> The conditional may be more complex than really necessary, but it's in
> line with what we do elsewhere. And imo better continue to be a little
> too restrictive, than moving to too lax.
> 
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
>  
>      if ( !mfn_valid(mfn) )
>      {
> -        *ipat = true;
> +        *ipat = type != p2m_mmio_direct ||
> +                (!is_iommu_enabled(d) && !cache_flush_permitted(d));
>          return X86_MT_UC;
>      }
>

Roger Pau Monne June 11, 2024, 7:41 a.m. UTC | #2

On Mon, Jun 10, 2024 at 04:58:52PM +0200, Jan Beulich wrote:
> mfn_valid() is RAM-focused; it will often return false for MMIO. Yet
> access to actual MMIO space should not generally be restricted to UC
> only; especially video frame buffer accesses are unduly affected by such
> a restriction. Permit PAT use for directly assigned MMIO as long as the
> domain is known to have been granted some level of cache control.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Considering that we've just declared PVH Dom0 "supported", this may well
> qualify for 4.19. The issue was specifically very noticable there.
> 
> The conditional may be more complex than really necessary, but it's in
> line with what we do elsewhere. And imo better continue to be a little
> too restrictive, than moving to too lax.
> 
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
>  
>      if ( !mfn_valid(mfn) )
>      {
> -        *ipat = true;
> +        *ipat = type != p2m_mmio_direct ||
> +                (!is_iommu_enabled(d) && !cache_flush_permitted(d));

Looking at this, shouldn't the !mfn_valid special case be removed, and
mfns without a valid page be processed normally, so that the guest
MTRR values are taken into account, and no iPAT is enforced?

I also think this likely wants a:

Fixes: 81fd0d3ca4b2 ('x86/hvm: simplify 'mmio_direct' check in epte_get_entry_emt()')

As AFAICT before that commit direct MMIO regions would set iPAT to WB,
which would result in the correct attributes (albeit guest MTRR was
still ignored).

Thanks, Roger.

Jan Beulich June 11, 2024, 8:26 a.m. UTC | #3

On 11.06.2024 09:41, Roger Pau Monné wrote:
> On Mon, Jun 10, 2024 at 04:58:52PM +0200, Jan Beulich wrote:
>> mfn_valid() is RAM-focused; it will often return false for MMIO. Yet
>> access to actual MMIO space should not generally be restricted to UC
>> only; especially video frame buffer accesses are unduly affected by such
>> a restriction. Permit PAT use for directly assigned MMIO as long as the
>> domain is known to have been granted some level of cache control.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> Considering that we've just declared PVH Dom0 "supported", this may well
>> qualify for 4.19. The issue was specifically very noticable there.
>>
>> The conditional may be more complex than really necessary, but it's in
>> line with what we do elsewhere. And imo better continue to be a little
>> too restrictive, than moving to too lax.
>>
>> --- a/xen/arch/x86/mm/p2m-ept.c
>> +++ b/xen/arch/x86/mm/p2m-ept.c
>> @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
>>  
>>      if ( !mfn_valid(mfn) )
>>      {
>> -        *ipat = true;
>> +        *ipat = type != p2m_mmio_direct ||
>> +                (!is_iommu_enabled(d) && !cache_flush_permitted(d));
> 
> Looking at this, shouldn't the !mfn_valid special case be removed, and
> mfns without a valid page be processed normally, so that the guest
> MTRR values are taken into account, and no iPAT is enforced?

Such removal is what, in the post commit message remark, I'm referring to
as "moving to too lax". Doing so might be okay, but will imo be hard to
prove to be correct for all possible cases. Along these lines goes also
that I'm adding the IOMMU-enabled and cache-flush checks: In principle
p2m_mmio_direct should not be used when neither of these return true. Yet
a similar consideration would apply to the immediately subsequent if().

Removing this code would, in particular, result in INVALID_MFN getting a
type of WB by way of the subsequent if(), unless the type there would
also be p2m_mmio_direct (which, as said, it ought to never be for non-
pass-through domains). That again _may_ not be a problem as long as such
EPT entries would never be marked present, yet that's again difficult to
prove.

I was in fact wondering whether to special-case INVALID_MFN in the change
I'm making. Question there is: Are we sure that by now we've indeed got
rid of all arithmetic mistakenly done on MFN variables happening to hold
INVALID_MFN as the value? IOW I fear that there might be code left which
would pass in INVALID_MFN masked down to a 2M or 1G boundary. At which
point checking for just INVALID_MFN would end up insufficient. If we
meant to rely on this (tagging possible leftover issues as bugs we don't
mean to attempt to cover for here anymore), then indeed the mfn_valid()
check could be replaced by a comparison with INVALID_MFN (following a
pattern we've been slowly trying to carry through elsewhere, especially
in shadow code). Yet it could still not be outright dropped imo.

Furthermore simply dropping (or replacing as per above) that check won't
work either: Further down in the function we use mfn_to_page(), which
requires an up-front mfn_valid() check. That said, this code looks
partly broken to me anyway: For a 1G page mfn_valid() on the start of it
doesn't really imply all parts of it are valid. I guess I need to make a
2nd patch to address that as well, which may then want to be a prereq
change to the one here (if we decided to go the route you're asking for).

> I also think this likely wants a:
> 
> Fixes: 81fd0d3ca4b2 ('x86/hvm: simplify 'mmio_direct' check in epte_get_entry_emt()')

Oh, indeed, I should have dug out when this broke. I didn't because I
knew this mfn_valid() check was there forever, neglecting that it wasn't
always (almost) first.

> As AFAICT before that commit direct MMIO regions would set iPAT to WB,
> which would result in the correct attributes (albeit guest MTRR was
> still ignored).

Two corrections here: First iPAT is a boolean; it can't be set to WB.
And then what was happening prior to that change was that for the APIC
access page iPAT was set to true, thus forcing WB there. iPAT was left
set to false for all other p2m_mmio_direct pages, yielding (PAT-
overridable) UC there.

Jan

Roger Pau Monne June 11, 2024, 9:02 a.m. UTC | #4

On Tue, Jun 11, 2024 at 10:26:32AM +0200, Jan Beulich wrote:
> On 11.06.2024 09:41, Roger Pau Monné wrote:
> > On Mon, Jun 10, 2024 at 04:58:52PM +0200, Jan Beulich wrote:
> >> mfn_valid() is RAM-focused; it will often return false for MMIO. Yet
> >> access to actual MMIO space should not generally be restricted to UC
> >> only; especially video frame buffer accesses are unduly affected by such
> >> a restriction. Permit PAT use for directly assigned MMIO as long as the
> >> domain is known to have been granted some level of cache control.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> Considering that we've just declared PVH Dom0 "supported", this may well
> >> qualify for 4.19. The issue was specifically very noticable there.
> >>
> >> The conditional may be more complex than really necessary, but it's in
> >> line with what we do elsewhere. And imo better continue to be a little
> >> too restrictive, than moving to too lax.
> >>
> >> --- a/xen/arch/x86/mm/p2m-ept.c
> >> +++ b/xen/arch/x86/mm/p2m-ept.c
> >> @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
> >>  
> >>      if ( !mfn_valid(mfn) )
> >>      {
> >> -        *ipat = true;
> >> +        *ipat = type != p2m_mmio_direct ||
> >> +                (!is_iommu_enabled(d) && !cache_flush_permitted(d));
> > 
> > Looking at this, shouldn't the !mfn_valid special case be removed, and
> > mfns without a valid page be processed normally, so that the guest
> > MTRR values are taken into account, and no iPAT is enforced?
> 
> Such removal is what, in the post commit message remark, I'm referring to
> as "moving to too lax". Doing so might be okay, but will imo be hard to
> prove to be correct for all possible cases. Along these lines goes also
> that I'm adding the IOMMU-enabled and cache-flush checks: In principle
> p2m_mmio_direct should not be used when neither of these return true. Yet
> a similar consideration would apply to the immediately subsequent if().
> 
> Removing this code would, in particular, result in INVALID_MFN getting a
> type of WB by way of the subsequent if(), unless the type there would
> also be p2m_mmio_direct (which, as said, it ought to never be for non-
> pass-through domains). That again _may_ not be a problem as long as such
> EPT entries would never be marked present, yet that's again difficult to
> prove.

My understanding is that the !mfn_valid() check was a way to detect
MMIO regions in order to exit early and set those to UC.  I however
don't follow why the guest MTRR settings shouldn't also be applied to
those regions.

I'm also confused by your comment about "as such EPT entries would
never be marked present": non-present EPT entries don't even get into
epte_get_entry_emt(), and hence we could assert in epte_get_entry_emt
that mfn != INVALID_MFN?

> I was in fact wondering whether to special-case INVALID_MFN in the change
> I'm making. Question there is: Are we sure that by now we've indeed got
> rid of all arithmetic mistakenly done on MFN variables happening to hold
> INVALID_MFN as the value? IOW I fear that there might be code left which
> would pass in INVALID_MFN masked down to a 2M or 1G boundary. At which
> point checking for just INVALID_MFN would end up insufficient. If we
> meant to rely on this (tagging possible leftover issues as bugs we don't
> mean to attempt to cover for here anymore), then indeed the mfn_valid()
> check could be replaced by a comparison with INVALID_MFN (following a
> pattern we've been slowly trying to carry through elsewhere, especially
> in shadow code). Yet it could still not be outright dropped imo.
> 
> Furthermore simply dropping (or replacing as per above) that check won't
> work either: Further down in the function we use mfn_to_page(), which
> requires an up-front mfn_valid() check. That said, this code looks
> partly broken to me anyway: For a 1G page mfn_valid() on the start of it
> doesn't really imply all parts of it are valid. I guess I need to make a
> 2nd patch to address that as well, which may then want to be a prereq
> change to the one here (if we decided to go the route you're asking for).

I see, yes, the loop over the special pages array will need to be
adjusted to account for mfn_to_page() possibly returning NULL.

Overall I don't understand the need for this special case for
!mfn_valid().  The rest of special cases we have (the special pages
and domains without devices or MMIO regions assigned) are performance
optimizations which I do understand.  Yet the special casing of
!mfn_valid regions bypassing guest MTRR settings seems bogus to me.

> 
> > I also think this likely wants a:
> > 
> > Fixes: 81fd0d3ca4b2 ('x86/hvm: simplify 'mmio_direct' check in epte_get_entry_emt()')
> 
> Oh, indeed, I should have dug out when this broke. I didn't because I
> knew this mfn_valid() check was there forever, neglecting that it wasn't
> always (almost) first.
> 
> > As AFAICT before that commit direct MMIO regions would set iPAT to WB,
> > which would result in the correct attributes (albeit guest MTRR was
> > still ignored).
> 
> Two corrections here: First iPAT is a boolean; it can't be set to WB.
> And then what was happening prior to that change was that for the APIC
> access page iPAT was set to true, thus forcing WB there. iPAT was left
> set to false for all other p2m_mmio_direct pages, yielding (PAT-
> overridable) UC there.

Right, that behavior was still dubious to me, as I would assume those
regions would also want to fetch the type from guest MTRRs.

Thanks, Roger.

Jan Beulich June 11, 2024, 9:33 a.m. UTC | #5

On 11.06.2024 11:02, Roger Pau Monné wrote:
> On Tue, Jun 11, 2024 at 10:26:32AM +0200, Jan Beulich wrote:
>> On 11.06.2024 09:41, Roger Pau Monné wrote:
>>> On Mon, Jun 10, 2024 at 04:58:52PM +0200, Jan Beulich wrote:
>>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>>> @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
>>>>  
>>>>      if ( !mfn_valid(mfn) )
>>>>      {
>>>> -        *ipat = true;
>>>> +        *ipat = type != p2m_mmio_direct ||
>>>> +                (!is_iommu_enabled(d) && !cache_flush_permitted(d));
>>>
>>> Looking at this, shouldn't the !mfn_valid special case be removed, and
>>> mfns without a valid page be processed normally, so that the guest
>>> MTRR values are taken into account, and no iPAT is enforced?
>>
>> Such removal is what, in the post commit message remark, I'm referring to
>> as "moving to too lax". Doing so might be okay, but will imo be hard to
>> prove to be correct for all possible cases. Along these lines goes also
>> that I'm adding the IOMMU-enabled and cache-flush checks: In principle
>> p2m_mmio_direct should not be used when neither of these return true. Yet
>> a similar consideration would apply to the immediately subsequent if().
>>
>> Removing this code would, in particular, result in INVALID_MFN getting a
>> type of WB by way of the subsequent if(), unless the type there would
>> also be p2m_mmio_direct (which, as said, it ought to never be for non-
>> pass-through domains). That again _may_ not be a problem as long as such
>> EPT entries would never be marked present, yet that's again difficult to
>> prove.
> 
> My understanding is that the !mfn_valid() check was a way to detect
> MMIO regions in order to exit early and set those to UC.  I however
> don't follow why the guest MTRR settings shouldn't also be applied to
> those regions.

It's unclear to me whether the original purpose of he check really was
(just) MMIO. It could as well also have been to cover the (then not yet
named that way) case of INVALID_MFN.

As to ignoring guest MTRRs for MMIO: I think that's to be on the safe
side. We don't want guests to map uncachable memory with a cachable
memory type. Yet control isn't fine grained enough to prevent just
that. Hence why we force UC, allowing merely to move to WC via PAT.

> I'm also confused by your comment about "as such EPT entries would
> never be marked present": non-present EPT entries don't even get into
> epte_get_entry_emt(), and hence we could assert in epte_get_entry_emt
> that mfn != INVALID_MFN?

I don't think we can. Especially for the call from ept_set_entry() I
can't spot anything that would prevent the call for non-present entries.
This may be a mistake, but I can't do anything about it right here.

>> I was in fact wondering whether to special-case INVALID_MFN in the change
>> I'm making. Question there is: Are we sure that by now we've indeed got
>> rid of all arithmetic mistakenly done on MFN variables happening to hold
>> INVALID_MFN as the value? IOW I fear that there might be code left which
>> would pass in INVALID_MFN masked down to a 2M or 1G boundary. At which
>> point checking for just INVALID_MFN would end up insufficient. If we
>> meant to rely on this (tagging possible leftover issues as bugs we don't
>> mean to attempt to cover for here anymore), then indeed the mfn_valid()
>> check could be replaced by a comparison with INVALID_MFN (following a
>> pattern we've been slowly trying to carry through elsewhere, especially
>> in shadow code). Yet it could still not be outright dropped imo.
>>
>> Furthermore simply dropping (or replacing as per above) that check won't
>> work either: Further down in the function we use mfn_to_page(), which
>> requires an up-front mfn_valid() check. That said, this code looks
>> partly broken to me anyway: For a 1G page mfn_valid() on the start of it
>> doesn't really imply all parts of it are valid. I guess I need to make a
>> 2nd patch to address that as well, which may then want to be a prereq
>> change to the one here (if we decided to go the route you're asking for).
> 
> I see, yes, the loop over the special pages array will need to be
> adjusted to account for mfn_to_page() possibly returning NULL.

Except that NULL will hardly ever come back there. What we need is an
explicit mfn_valid() check. I already have a patch, but I'd like to
submit it only once I know how the v2 of the one here is going to look
like.

> Overall I don't understand the need for this special case for
> !mfn_valid().  The rest of special cases we have (the special pages
> and domains without devices or MMIO regions assigned) are performance
> optimizations which I do understand.  Yet the special casing of
> !mfn_valid regions bypassing guest MTRR settings seems bogus to me.

As said, it may well be that we can (now) switch to comparison against
INVALID_MFN there, if we're certain MMIO isn't to be covered by this
(anymore).

>>> I also think this likely wants a:
>>>
>>> Fixes: 81fd0d3ca4b2 ('x86/hvm: simplify 'mmio_direct' check in epte_get_entry_emt()')
>>
>> Oh, indeed, I should have dug out when this broke. I didn't because I
>> knew this mfn_valid() check was there forever, neglecting that it wasn't
>> always (almost) first.
>>
>>> As AFAICT before that commit direct MMIO regions would set iPAT to WB,
>>> which would result in the correct attributes (albeit guest MTRR was
>>> still ignored).
>>
>> Two corrections here: First iPAT is a boolean; it can't be set to WB.
>> And then what was happening prior to that change was that for the APIC
>> access page iPAT was set to true, thus forcing WB there. iPAT was left
>> set to false for all other p2m_mmio_direct pages, yielding (PAT-
>> overridable) UC there.
> 
> Right, that behavior was still dubious to me, as I would assume those
> regions would also want to fetch the type from guest MTRRs.

Well, for the APIC access page we want to prevent it becoming UC. It's MMIO
from the guest's perspective, yet _we_ know it's really ordinary RAM. For
actual MMIO see above; the only case where we probably ought to respect
guest MTRRs is when they say WC (following from what I said further up).
Yet that's again an independent change to (possibly) make.

Jan

Oleksii Kurochko June 11, 2024, 10:40 a.m. UTC | #6

On Mon, 2024-06-10 at 17:00 +0200, Jan Beulich wrote:
> On 10.06.2024 16:58, Jan Beulich wrote:
> > mfn_valid() is RAM-focused; it will often return false for MMIO.
> > Yet
> > access to actual MMIO space should not generally be restricted to
> > UC
> > only; especially video frame buffer accesses are unduly affected by
> > such
> > a restriction. Permit PAT use for directly assigned MMIO as long as
> > the
> > domain is known to have been granted some level of cache control.
> > 
> > Signed-off-by: Jan Beulich <jbeulich@suse.com>
Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>

~ Oleksii

> > ---
> > Considering that we've just declared PVH Dom0 "supported", this may
> > well
> > qualify for 4.19. The issue was specifically very noticable there.
> 
> Actually - meant to Cc Oleksii for this, and then forgot.
> 
> Jan
> 
> > The conditional may be more complex than really necessary, but it's
> > in
> > line with what we do elsewhere. And imo better continue to be a
> > little
> > too restrictive, than moving to too lax.
> > 
> > --- a/xen/arch/x86/mm/p2m-ept.c
> > +++ b/xen/arch/x86/mm/p2m-ept.c
> > @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
> >  
> >      if ( !mfn_valid(mfn) )
> >      {
> > -        *ipat = true;
> > +        *ipat = type != p2m_mmio_direct ||
> > +                (!is_iommu_enabled(d) &&
> > !cache_flush_permitted(d));
> >          return X86_MT_UC;
> >      }
> >  
>

Roger Pau Monne June 11, 2024, 11:08 a.m. UTC | #7

On Tue, Jun 11, 2024 at 11:33:24AM +0200, Jan Beulich wrote:
> On 11.06.2024 11:02, Roger Pau Monné wrote:
> > On Tue, Jun 11, 2024 at 10:26:32AM +0200, Jan Beulich wrote:
> >> On 11.06.2024 09:41, Roger Pau Monné wrote:
> >>> On Mon, Jun 10, 2024 at 04:58:52PM +0200, Jan Beulich wrote:
> >>>> --- a/xen/arch/x86/mm/p2m-ept.c
> >>>> +++ b/xen/arch/x86/mm/p2m-ept.c
> >>>> @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
> >>>>  
> >>>>      if ( !mfn_valid(mfn) )
> >>>>      {
> >>>> -        *ipat = true;
> >>>> +        *ipat = type != p2m_mmio_direct ||
> >>>> +                (!is_iommu_enabled(d) && !cache_flush_permitted(d));
> >>>
> >>> Looking at this, shouldn't the !mfn_valid special case be removed, and
> >>> mfns without a valid page be processed normally, so that the guest
> >>> MTRR values are taken into account, and no iPAT is enforced?
> >>
> >> Such removal is what, in the post commit message remark, I'm referring to
> >> as "moving to too lax". Doing so might be okay, but will imo be hard to
> >> prove to be correct for all possible cases. Along these lines goes also
> >> that I'm adding the IOMMU-enabled and cache-flush checks: In principle
> >> p2m_mmio_direct should not be used when neither of these return true. Yet
> >> a similar consideration would apply to the immediately subsequent if().
> >>
> >> Removing this code would, in particular, result in INVALID_MFN getting a
> >> type of WB by way of the subsequent if(), unless the type there would
> >> also be p2m_mmio_direct (which, as said, it ought to never be for non-
> >> pass-through domains). That again _may_ not be a problem as long as such
> >> EPT entries would never be marked present, yet that's again difficult to
> >> prove.
> > 
> > My understanding is that the !mfn_valid() check was a way to detect
> > MMIO regions in order to exit early and set those to UC.  I however
> > don't follow why the guest MTRR settings shouldn't also be applied to
> > those regions.
> 
> It's unclear to me whether the original purpose of he check really was
> (just) MMIO. It could as well also have been to cover the (then not yet
> named that way) case of INVALID_MFN.
> 
> As to ignoring guest MTRRs for MMIO: I think that's to be on the safe
> side. We don't want guests to map uncachable memory with a cachable
> memory type. Yet control isn't fine grained enough to prevent just
> that. Hence why we force UC, allowing merely to move to WC via PAT.

Would that be to cover up for guests bugs, or there's a coherency
reason for not allowing guests to access memory using fully guest
chosen cache attributes?

I really wonder whether Xen has enough information to figure out
whether a hole (MMIO region) is supposed to be accessed as UC or
something else.

Your proposed patch already allows guest to set such attributes in
PAT, and hence I don't see why also taking guest MTRRs into account
would be any worse.

> > I'm also confused by your comment about "as such EPT entries would
> > never be marked present": non-present EPT entries don't even get into
> > epte_get_entry_emt(), and hence we could assert in epte_get_entry_emt
> > that mfn != INVALID_MFN?
> 
> I don't think we can. Especially for the call from ept_set_entry() I
> can't spot anything that would prevent the call for non-present entries.
> This may be a mistake, but I can't do anything about it right here.

Hm, I see, then we should explicitly handle INVALID_MFN in
epte_get_entry_emt(), and just return early.

> >>> I also think this likely wants a:
> >>>
> >>> Fixes: 81fd0d3ca4b2 ('x86/hvm: simplify 'mmio_direct' check in epte_get_entry_emt()')
> >>
> >> Oh, indeed, I should have dug out when this broke. I didn't because I
> >> knew this mfn_valid() check was there forever, neglecting that it wasn't
> >> always (almost) first.
> >>
> >>> As AFAICT before that commit direct MMIO regions would set iPAT to WB,
> >>> which would result in the correct attributes (albeit guest MTRR was
> >>> still ignored).
> >>
> >> Two corrections here: First iPAT is a boolean; it can't be set to WB.
> >> And then what was happening prior to that change was that for the APIC
> >> access page iPAT was set to true, thus forcing WB there. iPAT was left
> >> set to false for all other p2m_mmio_direct pages, yielding (PAT-
> >> overridable) UC there.
> > 
> > Right, that behavior was still dubious to me, as I would assume those
> > regions would also want to fetch the type from guest MTRRs.
> 
> Well, for the APIC access page we want to prevent it becoming UC. It's MMIO
> from the guest's perspective, yet _we_ know it's really ordinary RAM. For
> actual MMIO see above; the only case where we probably ought to respect
> guest MTRRs is when they say WC (following from what I said further up).
> Yet that's again an independent change to (possibly) make.

For emulated devices we might map regular RAM into what the guest
otherwise thinks it's MMIO.  Maybe the mfn_valid() check should be
inverted, and return WB when the underlying mfn is RAM, and otherwise
use the guest MTRRs to decide the cache attribute?

Thanks, Roger.

Jan Beulich June 11, 2024, 11:52 a.m. UTC | #8

On 11.06.2024 13:08, Roger Pau Monné wrote:
> On Tue, Jun 11, 2024 at 11:33:24AM +0200, Jan Beulich wrote:
>> On 11.06.2024 11:02, Roger Pau Monné wrote:
>>> On Tue, Jun 11, 2024 at 10:26:32AM +0200, Jan Beulich wrote:
>>>> On 11.06.2024 09:41, Roger Pau Monné wrote:
>>>>> On Mon, Jun 10, 2024 at 04:58:52PM +0200, Jan Beulich wrote:
>>>>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>>>>> @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
>>>>>>  
>>>>>>      if ( !mfn_valid(mfn) )
>>>>>>      {
>>>>>> -        *ipat = true;
>>>>>> +        *ipat = type != p2m_mmio_direct ||
>>>>>> +                (!is_iommu_enabled(d) && !cache_flush_permitted(d));
>>>>>
>>>>> Looking at this, shouldn't the !mfn_valid special case be removed, and
>>>>> mfns without a valid page be processed normally, so that the guest
>>>>> MTRR values are taken into account, and no iPAT is enforced?
>>>>
>>>> Such removal is what, in the post commit message remark, I'm referring to
>>>> as "moving to too lax". Doing so might be okay, but will imo be hard to
>>>> prove to be correct for all possible cases. Along these lines goes also
>>>> that I'm adding the IOMMU-enabled and cache-flush checks: In principle
>>>> p2m_mmio_direct should not be used when neither of these return true. Yet
>>>> a similar consideration would apply to the immediately subsequent if().
>>>>
>>>> Removing this code would, in particular, result in INVALID_MFN getting a
>>>> type of WB by way of the subsequent if(), unless the type there would
>>>> also be p2m_mmio_direct (which, as said, it ought to never be for non-
>>>> pass-through domains). That again _may_ not be a problem as long as such
>>>> EPT entries would never be marked present, yet that's again difficult to
>>>> prove.
>>>
>>> My understanding is that the !mfn_valid() check was a way to detect
>>> MMIO regions in order to exit early and set those to UC.  I however
>>> don't follow why the guest MTRR settings shouldn't also be applied to
>>> those regions.
>>
>> It's unclear to me whether the original purpose of he check really was
>> (just) MMIO. It could as well also have been to cover the (then not yet
>> named that way) case of INVALID_MFN.
>>
>> As to ignoring guest MTRRs for MMIO: I think that's to be on the safe
>> side. We don't want guests to map uncachable memory with a cachable
>> memory type. Yet control isn't fine grained enough to prevent just
>> that. Hence why we force UC, allowing merely to move to WC via PAT.
> 
> Would that be to cover up for guests bugs, or there's a coherency
> reason for not allowing guests to access memory using fully guest
> chosen cache attributes?

I think the main reason is that this way we don't need to bother thinking
of whether MMIO regions may need caches flushed in order for us to be
sure memory is all up-to-date. But I have no insight into what the
original reasons here may have been.

> I really wonder whether Xen has enough information to figure out
> whether a hole (MMIO region) is supposed to be accessed as UC or
> something else.

It certainly hasn't, and hence is erring on the (safe) side of forcing
UC.

> Your proposed patch already allows guest to set such attributes in
> PAT, and hence I don't see why also taking guest MTRRs into account
> would be any worse.

Whatever the guest sets in PAT, UC in EMT will win except fot the
special case of WC.

>>>>> I also think this likely wants a:
>>>>>
>>>>> Fixes: 81fd0d3ca4b2 ('x86/hvm: simplify 'mmio_direct' check in epte_get_entry_emt()')
>>>>
>>>> Oh, indeed, I should have dug out when this broke. I didn't because I
>>>> knew this mfn_valid() check was there forever, neglecting that it wasn't
>>>> always (almost) first.
>>>>
>>>>> As AFAICT before that commit direct MMIO regions would set iPAT to WB,
>>>>> which would result in the correct attributes (albeit guest MTRR was
>>>>> still ignored).
>>>>
>>>> Two corrections here: First iPAT is a boolean; it can't be set to WB.
>>>> And then what was happening prior to that change was that for the APIC
>>>> access page iPAT was set to true, thus forcing WB there. iPAT was left
>>>> set to false for all other p2m_mmio_direct pages, yielding (PAT-
>>>> overridable) UC there.
>>>
>>> Right, that behavior was still dubious to me, as I would assume those
>>> regions would also want to fetch the type from guest MTRRs.
>>
>> Well, for the APIC access page we want to prevent it becoming UC. It's MMIO
>> from the guest's perspective, yet _we_ know it's really ordinary RAM. For
>> actual MMIO see above; the only case where we probably ought to respect
>> guest MTRRs is when they say WC (following from what I said further up).
>> Yet that's again an independent change to (possibly) make.
> 
> For emulated devices we might map regular RAM into what the guest
> otherwise thinks it's MMIO.

Right, and for non-pass-through domains we force everything to WB already.

>  Maybe the mfn_valid() check should be
> inverted, and return WB when the underlying mfn is RAM, and otherwise
> use the guest MTRRs to decide the cache attribute?

First: Whether WB is correct for RAM isn't known. With some peculiar device
assigned, the guest may want/need part of its RAM be e.g. WC or WT. (It's
only without any physical devices assigned that we can be quite sure that
WB is good for all of RAM.) Therefore, second, I think respecting MTRRs for
RAM is less likely to cause problems than respecting them for MMIO.

I think at this point the main question is: Do we want to do things at least
along the lines of this v1, or do we instead feel certain enough to switch
the mfn_valid() to a comparison against INVALID_MFN (and perhaps moving it
up to almost the top of the function)? One caveat here that I forgot to
mention before: MFNs taken out of EPT entries will never be INVALID_MFN, for
the truncation that happens when populating entries. In that case we rely on
mfn_valid() to be "rejecting" them.

Jan

Roger Pau Monne June 11, 2024, 1:52 p.m. UTC | #9

On Tue, Jun 11, 2024 at 01:52:58PM +0200, Jan Beulich wrote:
> On 11.06.2024 13:08, Roger Pau Monné wrote:
> > On Tue, Jun 11, 2024 at 11:33:24AM +0200, Jan Beulich wrote:
> >> On 11.06.2024 11:02, Roger Pau Monné wrote:
> >>> On Tue, Jun 11, 2024 at 10:26:32AM +0200, Jan Beulich wrote:
> >>>> On 11.06.2024 09:41, Roger Pau Monné wrote:
> >>>>> On Mon, Jun 10, 2024 at 04:58:52PM +0200, Jan Beulich wrote:
> >>>>>> --- a/xen/arch/x86/mm/p2m-ept.c
> >>>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
> >>>>>> @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
> >>>>>>  
> >>>>>>      if ( !mfn_valid(mfn) )
> >>>>>>      {
> >>>>>> -        *ipat = true;
> >>>>>> +        *ipat = type != p2m_mmio_direct ||
> >>>>>> +                (!is_iommu_enabled(d) && !cache_flush_permitted(d));
> >>>>>
> >>>>> Looking at this, shouldn't the !mfn_valid special case be removed, and
> >>>>> mfns without a valid page be processed normally, so that the guest
> >>>>> MTRR values are taken into account, and no iPAT is enforced?
> >>>>
> >>>> Such removal is what, in the post commit message remark, I'm referring to
> >>>> as "moving to too lax". Doing so might be okay, but will imo be hard to
> >>>> prove to be correct for all possible cases. Along these lines goes also
> >>>> that I'm adding the IOMMU-enabled and cache-flush checks: In principle
> >>>> p2m_mmio_direct should not be used when neither of these return true. Yet
> >>>> a similar consideration would apply to the immediately subsequent if().
> >>>>
> >>>> Removing this code would, in particular, result in INVALID_MFN getting a
> >>>> type of WB by way of the subsequent if(), unless the type there would
> >>>> also be p2m_mmio_direct (which, as said, it ought to never be for non-
> >>>> pass-through domains). That again _may_ not be a problem as long as such
> >>>> EPT entries would never be marked present, yet that's again difficult to
> >>>> prove.
> >>>
> >>> My understanding is that the !mfn_valid() check was a way to detect
> >>> MMIO regions in order to exit early and set those to UC.  I however
> >>> don't follow why the guest MTRR settings shouldn't also be applied to
> >>> those regions.
> >>
> >> It's unclear to me whether the original purpose of he check really was
> >> (just) MMIO. It could as well also have been to cover the (then not yet
> >> named that way) case of INVALID_MFN.
> >>
> >> As to ignoring guest MTRRs for MMIO: I think that's to be on the safe
> >> side. We don't want guests to map uncachable memory with a cachable
> >> memory type. Yet control isn't fine grained enough to prevent just
> >> that. Hence why we force UC, allowing merely to move to WC via PAT.
> > 
> > Would that be to cover up for guests bugs, or there's a coherency
> > reason for not allowing guests to access memory using fully guest
> > chosen cache attributes?
> 
> I think the main reason is that this way we don't need to bother thinking
> of whether MMIO regions may need caches flushed in order for us to be
> sure memory is all up-to-date. But I have no insight into what the
> original reasons here may have been.
> 
> > I really wonder whether Xen has enough information to figure out
> > whether a hole (MMIO region) is supposed to be accessed as UC or
> > something else.
> 
> It certainly hasn't, and hence is erring on the (safe) side of forcing
> UC.

Except that for the vesa framebuffer at least this is a bad choice :).

> > Your proposed patch already allows guest to set such attributes in
> > PAT, and hence I don't see why also taking guest MTRRs into account
> > would be any worse.
> 
> Whatever the guest sets in PAT, UC in EMT will win except fot the
> special case of WC.
> 
> >>>>> I also think this likely wants a:
> >>>>>
> >>>>> Fixes: 81fd0d3ca4b2 ('x86/hvm: simplify 'mmio_direct' check in epte_get_entry_emt()')
> >>>>
> >>>> Oh, indeed, I should have dug out when this broke. I didn't because I
> >>>> knew this mfn_valid() check was there forever, neglecting that it wasn't
> >>>> always (almost) first.
> >>>>
> >>>>> As AFAICT before that commit direct MMIO regions would set iPAT to WB,
> >>>>> which would result in the correct attributes (albeit guest MTRR was
> >>>>> still ignored).
> >>>>
> >>>> Two corrections here: First iPAT is a boolean; it can't be set to WB.
> >>>> And then what was happening prior to that change was that for the APIC
> >>>> access page iPAT was set to true, thus forcing WB there. iPAT was left
> >>>> set to false for all other p2m_mmio_direct pages, yielding (PAT-
> >>>> overridable) UC there.
> >>>
> >>> Right, that behavior was still dubious to me, as I would assume those
> >>> regions would also want to fetch the type from guest MTRRs.
> >>
> >> Well, for the APIC access page we want to prevent it becoming UC. It's MMIO
> >> from the guest's perspective, yet _we_ know it's really ordinary RAM. For
> >> actual MMIO see above; the only case where we probably ought to respect
> >> guest MTRRs is when they say WC (following from what I said further up).
> >> Yet that's again an independent change to (possibly) make.
> > 
> > For emulated devices we might map regular RAM into what the guest
> > otherwise thinks it's MMIO.
> 
> Right, and for non-pass-through domains we force everything to WB already.
> 
> >  Maybe the mfn_valid() check should be
> > inverted, and return WB when the underlying mfn is RAM, and otherwise
> > use the guest MTRRs to decide the cache attribute?
> 
> First: Whether WB is correct for RAM isn't known. With some peculiar device
> assigned, the guest may want/need part of its RAM be e.g. WC or WT. (It's
> only without any physical devices assigned that we can be quite sure that
> WB is good for all of RAM.) Therefore, second, I think respecting MTRRs for
> RAM is less likely to cause problems than respecting them for MMIO.
> 
> I think at this point the main question is: Do we want to do things at least
> along the lines of this v1, or do we instead feel certain enough to switch
> the mfn_valid() to a comparison against INVALID_MFN (and perhaps moving it
> up to almost the top of the function)?

My preferred option would be the later, as that would remove a special
casing.  However, I'm unsure how much fallout this could cause - those
caching changes are always tricky and lead to unexpected fallout.

OTOH the current !mfn_valid() check is very restrictive, as it forces
all MMIO to UC.  So by removing it we allow guest chosen types to take
effect, which are likely less restrictive than UC (whether those are
correct is another question).

> One caveat here that I forgot to
> mention before: MFNs taken out of EPT entries will never be INVALID_MFN, for
> the truncation that happens when populating entries. In that case we rely on
> mfn_valid() to be "rejecting" them.

The only caller where mfns from EPT entries are passed to
epte_get_entry_emt() is in resolve_misconfig() AFAICT, and in that
case the EPT entry must be present for epte_get_entry_emt() to be
called.  So it seems to me that epte_get_entry_emt() can never be
called from an mfn constructed from an INVALID_MFN EPT entry (but it's
worth adding an assert for it).

Thanks, Roger.

Andrew Cooper June 11, 2024, 1:55 p.m. UTC | #10

On 11/06/2024 10:33 am, Jan Beulich wrote:
> On 11.06.2024 11:02, Roger Pau Monné wrote:
>> On Tue, Jun 11, 2024 at 10:26:32AM +0200, Jan Beulich wrote:
>>> On 11.06.2024 09:41, Roger Pau Monné wrote:
>>>> On Mon, Jun 10, 2024 at 04:58:52PM +0200, Jan Beulich wrote:
>>>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>>>> @@ -503,7 +503,8 @@ int epte_get_entry_emt(struct domain *d,
>>>>>  
>>>>>      if ( !mfn_valid(mfn) )
>>>>>      {
>>>>> -        *ipat = true;
>>>>> +        *ipat = type != p2m_mmio_direct ||
>>>>> +                (!is_iommu_enabled(d) && !cache_flush_permitted(d));
>>>> Looking at this, shouldn't the !mfn_valid special case be removed, and
>>>> mfns without a valid page be processed normally, so that the guest
>>>> MTRR values are taken into account, and no iPAT is enforced?
>>> Such removal is what, in the post commit message remark, I'm referring to
>>> as "moving to too lax". Doing so might be okay, but will imo be hard to
>>> prove to be correct for all possible cases. Along these lines goes also
>>> that I'm adding the IOMMU-enabled and cache-flush checks: In principle
>>> p2m_mmio_direct should not be used when neither of these return true. Yet
>>> a similar consideration would apply to the immediately subsequent if().
>>>
>>> Removing this code would, in particular, result in INVALID_MFN getting a
>>> type of WB by way of the subsequent if(), unless the type there would
>>> also be p2m_mmio_direct (which, as said, it ought to never be for non-
>>> pass-through domains). That again _may_ not be a problem as long as such
>>> EPT entries would never be marked present, yet that's again difficult to
>>> prove.
>> My understanding is that the !mfn_valid() check was a way to detect
>> MMIO regions in order to exit early and set those to UC.  I however
>> don't follow why the guest MTRR settings shouldn't also be applied to
>> those regions.
> It's unclear to me whether the original purpose of he check really was
> (just) MMIO. It could as well also have been to cover the (then not yet
> named that way) case of INVALID_MFN.
>
> As to ignoring guest MTRRs for MMIO: I think that's to be on the safe
> side. We don't want guests to map uncachable memory with a cachable
> memory type. Yet control isn't fine grained enough to prevent just
> that. Hence why we force UC, allowing merely to move to WC via PAT.
>
>> I'm also confused by your comment about "as such EPT entries would
>> never be marked present": non-present EPT entries don't even get into
>> epte_get_entry_emt(), and hence we could assert in epte_get_entry_emt
>> that mfn != INVALID_MFN?
> I don't think we can. Especially for the call from ept_set_entry() I
> can't spot anything that would prevent the call for non-present entries.
> This may be a mistake, but I can't do anything about it right here.
>
>>> I was in fact wondering whether to special-case INVALID_MFN in the change
>>> I'm making. Question there is: Are we sure that by now we've indeed got
>>> rid of all arithmetic mistakenly done on MFN variables happening to hold
>>> INVALID_MFN as the value? IOW I fear that there might be code left which
>>> would pass in INVALID_MFN masked down to a 2M or 1G boundary. At which
>>> point checking for just INVALID_MFN would end up insufficient. If we
>>> meant to rely on this (tagging possible leftover issues as bugs we don't
>>> mean to attempt to cover for here anymore), then indeed the mfn_valid()
>>> check could be replaced by a comparison with INVALID_MFN (following a
>>> pattern we've been slowly trying to carry through elsewhere, especially
>>> in shadow code). Yet it could still not be outright dropped imo.
>>>
>>> Furthermore simply dropping (or replacing as per above) that check won't
>>> work either: Further down in the function we use mfn_to_page(), which
>>> requires an up-front mfn_valid() check. That said, this code looks
>>> partly broken to me anyway: For a 1G page mfn_valid() on the start of it
>>> doesn't really imply all parts of it are valid. I guess I need to make a
>>> 2nd patch to address that as well, which may then want to be a prereq
>>> change to the one here (if we decided to go the route you're asking for).
>> I see, yes, the loop over the special pages array will need to be
>> adjusted to account for mfn_to_page() possibly returning NULL.
> Except that NULL will hardly ever come back there. What we need is an
> explicit mfn_valid() check. I already have a patch, but I'd like to
> submit it only once I know how the v2 of the one here is going to look
> like.
>
>> Overall I don't understand the need for this special case for
>> !mfn_valid().  The rest of special cases we have (the special pages
>> and domains without devices or MMIO regions assigned) are performance
>> optimizations which I do understand.  Yet the special casing of
>> !mfn_valid regions bypassing guest MTRR settings seems bogus to me.
> As said, it may well be that we can (now) switch to comparison against
> INVALID_MFN there, if we're certain MMIO isn't to be covered by this
> (anymore).
>
>>>> I also think this likely wants a:
>>>>
>>>> Fixes: 81fd0d3ca4b2 ('x86/hvm: simplify 'mmio_direct' check in epte_get_entry_emt()')
>>> Oh, indeed, I should have dug out when this broke. I didn't because I
>>> knew this mfn_valid() check was there forever, neglecting that it wasn't
>>> always (almost) first.
>>>
>>>> As AFAICT before that commit direct MMIO regions would set iPAT to WB,
>>>> which would result in the correct attributes (albeit guest MTRR was
>>>> still ignored).
>>> Two corrections here: First iPAT is a boolean; it can't be set to WB.
>>> And then what was happening prior to that change was that for the APIC
>>> access page iPAT was set to true, thus forcing WB there. iPAT was left
>>> set to false for all other p2m_mmio_direct pages, yielding (PAT-
>>> overridable) UC there.
>> Right, that behavior was still dubious to me, as I would assume those
>> regions would also want to fetch the type from guest MTRRs.
> Well, for the APIC access page we want to prevent it becoming UC. It's MMIO
> from the guest's perspective, yet _we_ know it's really ordinary RAM.

It's really not "ordinary" RAM.

For both Intel and AMD, APIC acceleration is triggered based on a memory
operand match in host physical address space, but accesses are
redirected to the (per vCPU) APIC register page.

Intel state that the EPT translation must be a 4k translation, and AMD
state that the NPT perms must be RW.

I can't actually find any statement about cacheability.  I expect this
is because it's never actually accessed.  (Intel go as far as saying
that even if you CLFLUSH against it, because of the redirect, you'll end
up flushing the respective line in the APIC Regs page.)

Irrespective, it appears that the changeability doesn't matter, but I
would recommend against using it as a representative example for the
discussion here.

~Andrew

Jan Beulich June 11, 2024, 2:53 p.m. UTC | #11

On 11.06.2024 15:52, Roger Pau Monné wrote:
> On Tue, Jun 11, 2024 at 01:52:58PM +0200, Jan Beulich wrote:
>> On 11.06.2024 13:08, Roger Pau Monné wrote:
>>> I really wonder whether Xen has enough information to figure out
>>> whether a hole (MMIO region) is supposed to be accessed as UC or
>>> something else.
>>
>> It certainly hasn't, and hence is erring on the (safe) side of forcing
>> UC.
> 
> Except that for the vesa framebuffer at least this is a bad choice :).

Well, yes, that's where we want WC to be permitted. But for that we only
need to avoid setting iPAT; we still can uniformly hand back UC. Except
(as mentioned elsewhere earlier) if the guest uses MTRRs rather than PAT
to arrange for WC.

>>>  Maybe the mfn_valid() check should be
>>> inverted, and return WB when the underlying mfn is RAM, and otherwise
>>> use the guest MTRRs to decide the cache attribute?
>>
>> First: Whether WB is correct for RAM isn't known. With some peculiar device
>> assigned, the guest may want/need part of its RAM be e.g. WC or WT. (It's
>> only without any physical devices assigned that we can be quite sure that
>> WB is good for all of RAM.) Therefore, second, I think respecting MTRRs for
>> RAM is less likely to cause problems than respecting them for MMIO.
>>
>> I think at this point the main question is: Do we want to do things at least
>> along the lines of this v1, or do we instead feel certain enough to switch
>> the mfn_valid() to a comparison against INVALID_MFN (and perhaps moving it
>> up to almost the top of the function)?
> 
> My preferred option would be the later, as that would remove a special
> casing.  However, I'm unsure how much fallout this could cause - those
> caching changes are always tricky and lead to unexpected fallout.

Which is the very reason why I tried to avoid going to far with this.

> OTOH the current !mfn_valid() check is very restrictive, as it forces
> all MMIO to UC.

Which is why, in this v1, I'm relaxing only the iPAT part.

>  So by removing it we allow guest chosen types to take
> effect, which are likely less restrictive than UC (whether those are
> correct is another question).

No, guest chosen types still wouldn't come into play, due to what the
switch() further down in the function does for p2m_mmio_direct.

>> One caveat here that I forgot to
>> mention before: MFNs taken out of EPT entries will never be INVALID_MFN, for
>> the truncation that happens when populating entries. In that case we rely on
>> mfn_valid() to be "rejecting" them.
> 
> The only caller where mfns from EPT entries are passed to
> epte_get_entry_emt() is in resolve_misconfig() AFAICT, and in that
> case the EPT entry must be present for epte_get_entry_emt() to be
> called.  So it seems to me that epte_get_entry_emt() can never be
> called from an mfn constructed from an INVALID_MFN EPT entry (but it's
> worth adding an assert for it).

Are you sure? I agree for the first of those two calls, but the second
isn't quite as obvious. There we'd need to first prove that we will
never create non-present super-page entries. Yet if I'm not mistaken
for PoD we may create such.

Jan

Roger Pau Monne June 11, 2024, 4:21 p.m. UTC | #12

On Tue, Jun 11, 2024 at 04:53:22PM +0200, Jan Beulich wrote:
> On 11.06.2024 15:52, Roger Pau Monné wrote:
> > On Tue, Jun 11, 2024 at 01:52:58PM +0200, Jan Beulich wrote:
> >> On 11.06.2024 13:08, Roger Pau Monné wrote:
> >>> I really wonder whether Xen has enough information to figure out
> >>> whether a hole (MMIO region) is supposed to be accessed as UC or
> >>> something else.
> >>
> >> It certainly hasn't, and hence is erring on the (safe) side of forcing
> >> UC.
> > 
> > Except that for the vesa framebuffer at least this is a bad choice :).
> 
> Well, yes, that's where we want WC to be permitted. But for that we only
> need to avoid setting iPAT; we still can uniformly hand back UC. Except
> (as mentioned elsewhere earlier) if the guest uses MTRRs rather than PAT
> to arrange for WC.

If we want to get this into 4.19, we likely want to go your proposed
approach then, as it's less risky.

I think a comment would be helpful to note that the fix here to not
enforce iPAT by still return UC is mostly done to accommodate vesa
regions mapped with PAT attributes to use WC.

I would also like to add some kind of note that special casing
!mfn_valid() might not be needed, but that removing it must be done
carefully to not cause regressions.

> >>>  Maybe the mfn_valid() check should be
> >>> inverted, and return WB when the underlying mfn is RAM, and otherwise
> >>> use the guest MTRRs to decide the cache attribute?
> >>
> >> First: Whether WB is correct for RAM isn't known. With some peculiar device
> >> assigned, the guest may want/need part of its RAM be e.g. WC or WT. (It's
> >> only without any physical devices assigned that we can be quite sure that
> >> WB is good for all of RAM.) Therefore, second, I think respecting MTRRs for
> >> RAM is less likely to cause problems than respecting them for MMIO.
> >>
> >> I think at this point the main question is: Do we want to do things at least
> >> along the lines of this v1, or do we instead feel certain enough to switch
> >> the mfn_valid() to a comparison against INVALID_MFN (and perhaps moving it
> >> up to almost the top of the function)?
> > 
> > My preferred option would be the later, as that would remove a special
> > casing.  However, I'm unsure how much fallout this could cause - those
> > caching changes are always tricky and lead to unexpected fallout.
> 
> Which is the very reason why I tried to avoid going to far with this.
> 
> > OTOH the current !mfn_valid() check is very restrictive, as it forces
> > all MMIO to UC.
> 
> Which is why, in this v1, I'm relaxing only the iPAT part.
> 
> >  So by removing it we allow guest chosen types to take
> > effect, which are likely less restrictive than UC (whether those are
> > correct is another question).
> 
> No, guest chosen types still wouldn't come into play, due to what the
> switch() further down in the function does for p2m_mmio_direct.

Indeed.  That should also be removed if we decide for MMIO cache
attributes to be controlled by guest MTRRs.

> 
> >> One caveat here that I forgot to
> >> mention before: MFNs taken out of EPT entries will never be INVALID_MFN, for
> >> the truncation that happens when populating entries. In that case we rely on
> >> mfn_valid() to be "rejecting" them.
> > 
> > The only caller where mfns from EPT entries are passed to
> > epte_get_entry_emt() is in resolve_misconfig() AFAICT, and in that
> > case the EPT entry must be present for epte_get_entry_emt() to be
> > called.  So it seems to me that epte_get_entry_emt() can never be
> > called from an mfn constructed from an INVALID_MFN EPT entry (but it's
> > worth adding an assert for it).
> 
> Are you sure? I agree for the first of those two calls, but the second
> isn't quite as obvious. There we'd need to first prove that we will
> never create non-present super-page entries. Yet if I'm not mistaken
> for PoD we may create such.

I should go look then, didn't know PoD would do that.

Regards, Roger.

Jan Beulich June 12, 2024, 11:52 a.m. UTC | #13

On 11.06.2024 18:21, Roger Pau Monné wrote:
> On Tue, Jun 11, 2024 at 04:53:22PM +0200, Jan Beulich wrote:
>> On 11.06.2024 15:52, Roger Pau Monné wrote:
>>> On Tue, Jun 11, 2024 at 01:52:58PM +0200, Jan Beulich wrote:
>>>> On 11.06.2024 13:08, Roger Pau Monné wrote:
>>>>> I really wonder whether Xen has enough information to figure out
>>>>> whether a hole (MMIO region) is supposed to be accessed as UC or
>>>>> something else.
>>>>
>>>> It certainly hasn't, and hence is erring on the (safe) side of forcing
>>>> UC.
>>>
>>> Except that for the vesa framebuffer at least this is a bad choice :).
>>
>> Well, yes, that's where we want WC to be permitted. But for that we only
>> need to avoid setting iPAT; we still can uniformly hand back UC. Except
>> (as mentioned elsewhere earlier) if the guest uses MTRRs rather than PAT
>> to arrange for WC.
> 
> If we want to get this into 4.19, we likely want to go your proposed
> approach then, as it's less risky.
> 
> I think a comment would be helpful to note that the fix here to not
> enforce iPAT by still return UC is mostly done to accommodate vesa
> regions mapped with PAT attributes to use WC.
> 
> I would also like to add some kind of note that special casing
> !mfn_valid() might not be needed, but that removing it must be done
> carefully to not cause regressions.

Hmm, in the meantime I have myself sufficiently convinced that with a
small (hopefully easy / uncontroversial) change to ept_set_entry() I
can arrange for having the guarantees that neither INVALID_MFN nor a
truncated for of it can make it into the function, allowing the check
to be dropped (as you had initially asked for).

>>>> One caveat here that I forgot to
>>>> mention before: MFNs taken out of EPT entries will never be INVALID_MFN, for
>>>> the truncation that happens when populating entries. In that case we rely on
>>>> mfn_valid() to be "rejecting" them.
>>>
>>> The only caller where mfns from EPT entries are passed to
>>> epte_get_entry_emt() is in resolve_misconfig() AFAICT, and in that
>>> case the EPT entry must be present for epte_get_entry_emt() to be
>>> called.  So it seems to me that epte_get_entry_emt() can never be
>>> called from an mfn constructed from an INVALID_MFN EPT entry (but it's
>>> worth adding an assert for it).
>>
>> Are you sure? I agree for the first of those two calls, but the second
>> isn't quite as obvious. There we'd need to first prove that we will
>> never create non-present super-page entries. Yet if I'm not mistaken
>> for PoD we may create such.
> 
> I should go look then, didn't know PoD would do that.

I've meanwhile checked, and indeed we do. That's what with said prereq
change I hope to make no longer be the case.

Jan

x86/EPT: relax iPAT for "invalid" MFNs

Commit Message

Comments

Patch