diff mbox series

[v1,2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory

Message ID 20230907181459.18145-3-ankita@nvidia.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: support write combining and cachable IO memory in VMs | expand

Commit Message

Ankit Agrawal Sept. 7, 2023, 6:14 p.m. UTC
From: Ankit Agrawal <ankita@nvidia.com>

Linux allows device drivers to map IO memory on a per-page basis using
"write combining" or WC. This is often done using
pgprot_writecombing(). The driver knows which pages can support WC
access and the proper programming model to generate this IO. Generally
the use case is to boost performance by using write combining to
generate larger PCIe MemWr TLPs.

Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for
all IO memory. This puts the VM in charge of the memory attributes,
and removes the KVM override to DEVICE_nGnRE.

Ultimately this makes pgprot_writecombing() work correctly in VMs and
allows drivers like mlx5 to fully operate their HW.

After some discussions with ARM and CPU architects we reached the
conclusion there was no need for KVM to prevent the VM from selecting
between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
that NORMAL_NC could result in uncontained failures, but upon deeper
analysis it turns out there are already possible cases for uncontained
failures with DEVICE types too. Ultimately the platform must be
implemented in a way that ensures that all DEVICE_* and NORMAL_NC
accesses have no uncontained failures.

Fortunately real platforms do tend to implement this.

This patch makes the VM's memory attributes behave as follows:

 S1           |   S2          |  Result
 NORMAL-WB    |  NORMAL-NC    |  NORMAL-NC
 NORMAL-WT    |  NORMAL-NC    |  NORMAL-NC
 NORMAL-NC    |  NORMAL-NC    |  NORMAL-NC
 DEVICE<attr> |  NORMAL-NC    |  DEVICE<attr>

See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf
for details.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/include/asm/memory.h | 2 ++
 arch/arm64/kvm/hyp/pgtable.c    | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

Comments

Catalin Marinas Sept. 8, 2023, 4:40 p.m. UTC | #1
+ Lorenzo

On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Linux allows device drivers to map IO memory on a per-page basis using
> "write combining" or WC. This is often done using
> pgprot_writecombing(). The driver knows which pages can support WC
> access and the proper programming model to generate this IO. Generally
> the use case is to boost performance by using write combining to
> generate larger PCIe MemWr TLPs.
> 
> Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for
> all IO memory. This puts the VM in charge of the memory attributes,
> and removes the KVM override to DEVICE_nGnRE.
> 
> Ultimately this makes pgprot_writecombing() work correctly in VMs and
> allows drivers like mlx5 to fully operate their HW.
> 
> After some discussions with ARM and CPU architects we reached the
> conclusion there was no need for KVM to prevent the VM from selecting
> between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
> that NORMAL_NC could result in uncontained failures, but upon deeper
> analysis it turns out there are already possible cases for uncontained
> failures with DEVICE types too. Ultimately the platform must be
> implemented in a way that ensures that all DEVICE_* and NORMAL_NC
> accesses have no uncontained failures.
> 
> Fortunately real platforms do tend to implement this.
> 
> This patch makes the VM's memory attributes behave as follows:
> 
>  S1           |   S2          |  Result
>  NORMAL-WB    |  NORMAL-NC    |  NORMAL-NC
>  NORMAL-WT    |  NORMAL-NC    |  NORMAL-NC
>  NORMAL-NC    |  NORMAL-NC    |  NORMAL-NC
>  DEVICE<attr> |  NORMAL-NC    |  DEVICE<attr>
> 
> See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf
> for details.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

From the discussions with the hardware people in Arm and Nvidia, we
indeed concluded that relaxing S2 to Normal-NC is not any worse than
Device (and actually Device memory is more prone to generating
uncontained errors, something to do with the Device memory ordering
semantics).

For this change:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Lorenzo Pieralisi Sept. 11, 2023, 2:57 p.m. UTC | #2
On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Linux allows device drivers to map IO memory on a per-page basis using
> "write combining" or WC. This is often done using
> pgprot_writecombing(). The driver knows which pages can support WC

pgprot_writecombine() ?

> access and the proper programming model to generate this IO. Generally
> the use case is to boost performance by using write combining to
> generate larger PCIe MemWr TLPs.

First off, this changeset does not affect *only* Linux guests, obviously.

I understand that's the use case you are after but this change is
targeting all VMs, it must be clear.

Then WC and mapping to PCI TLPs, either you describe that in details
(NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or
you don't describe it at all, as it stands I don't know how to use
this information.

> Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for
> all IO memory. This puts the VM in charge of the memory attributes,
> and removes the KVM override to DEVICE_nGnRE.
> 
> Ultimately this makes pgprot_writecombing() work correctly in VMs and

pgprot_writecombine() ?

> allows drivers like mlx5 to fully operate their HW.
> 
> After some discussions with ARM and CPU architects we reached the
> conclusion there was no need for KVM to prevent the VM from selecting
> between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
> that NORMAL_NC could result in uncontained failures, but upon deeper
> analysis it turns out there are already possible cases for uncontained
> failures with DEVICE types too. Ultimately the platform must be
> implemented in a way that ensures that all DEVICE_* and NORMAL_NC
> accesses have no uncontained failures.

I would reorder/rephrase this changelog as follows:

- Describe what the problem is (ie KVM default s2 mappings)
- Describe how you are solving it
- Add a link to the documentation that states why it is safe to do
  that and the resulting s1/s2 mappings combination

It must be clear why from a legacy standpoint this is a safe change
to apply.

> Fortunately real platforms do tend to implement this.

Remove this sentence, it adds no information for someone who
is chasing bugs or just wants to understand the change itself.

> This patch makes the VM's memory attributes behave as follows:

"The resulting stage 1 and stage 2 memory types and cacheability
attributes are as follows":

>  S1           |   S2          |  Result
>  NORMAL-WB    |  NORMAL-NC    |  NORMAL-NC
>  NORMAL-WT    |  NORMAL-NC    |  NORMAL-NC
>  NORMAL-NC    |  NORMAL-NC    |  NORMAL-NC
>  DEVICE<attr> |  NORMAL-NC    |  DEVICE<attr>
> 
> See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf
> for details.

D8.5 I assume.

I commented on the log because it must be easy to follow, it is an
important change (and there is a lot of background information
required to understand it - please report it).

The patch itself must be reviewed by arm64/kvm maintainers; the changes
(given the background work that led to them) seem fine to me (please CC
me on next versions).

Thanks
Lorenzo

> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  arch/arm64/include/asm/memory.h | 2 ++
>  arch/arm64/kvm/hyp/pgtable.c    | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index fde4186cc387..c247e5f29d5a 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -147,6 +147,7 @@
>   * Memory types for Stage-2 translation
>   */
>  #define MT_S2_NORMAL		0xf
> +#define MT_S2_NORMAL_NC	0x5
>  #define MT_S2_DEVICE_nGnRE	0x1
>  
>  /*
> @@ -154,6 +155,7 @@
>   * Stage-2 enforces Normal-WB and Device-nGnRE
>   */
>  #define MT_S2_FWB_NORMAL	6
> +#define MT_S2_FWB_NORMAL_NC	5
>  #define MT_S2_FWB_DEVICE_nGnRE	1
>  
>  #ifdef CONFIG_ARM64_4K_PAGES
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index ccd291b6893d..a80949002191 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -696,7 +696,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
>  				kvm_pte_t *ptep)
>  {
>  	bool device = prot & KVM_PGTABLE_PROT_DEVICE;
> -	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) :
> +	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, NORMAL_NC) :
>  			    KVM_S2_MEMATTR(pgt, NORMAL);
>  	u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;
>  
> -- 
> 2.17.1
>
Jason Gunthorpe Sept. 11, 2023, 5:20 p.m. UTC | #3
On Mon, Sep 11, 2023 at 04:57:51PM +0200, Lorenzo Pieralisi wrote:
> On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote:
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > Linux allows device drivers to map IO memory on a per-page basis using
> > "write combining" or WC. This is often done using
> > pgprot_writecombing(). The driver knows which pages can support WC
> 
> pgprot_writecombine() ?
> 
> > access and the proper programming model to generate this IO. Generally
> > the use case is to boost performance by using write combining to
> > generate larger PCIe MemWr TLPs.
> 
> First off, this changeset does not affect *only* Linux guests, obviously.

I think everyone understands that. It can be clarified.

> I understand that's the use case you are after but this change is
> targeting all VMs, it must be clear.
> 
> Then WC and mapping to PCI TLPs, either you describe that in details
> (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or
> you don't describe it at all, as it stands I don't know how to use
> this information.

How about another pargraph:

 KVM prevents all VMs (including Linux) from accessing NORMAL_NC
 mappings, which is how Linux implements pgprot_writecombine(). This
 prevents using this performance optimization within VMs.

I don't think we need to go into details how it works beyond that it
requires NORMAL_NC.

> > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for
> > all IO memory. This puts the VM in charge of the memory attributes,
> > and removes the KVM override to DEVICE_nGnRE.
> > 
> > Ultimately this makes pgprot_writecombing() work correctly in VMs and
> 
> pgprot_writecombine() ?
> 
> > allows drivers like mlx5 to fully operate their HW.
> > 
> > After some discussions with ARM and CPU architects we reached the
> > conclusion there was no need for KVM to prevent the VM from selecting
> > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
> > that NORMAL_NC could result in uncontained failures, but upon deeper
> > analysis it turns out there are already possible cases for uncontained
> > failures with DEVICE types too. Ultimately the platform must be
> > implemented in a way that ensures that all DEVICE_* and NORMAL_NC
> > accesses have no uncontained failures.
> 
> I would reorder/rephrase this changelog as follows:
> 
> - Describe what the problem is (ie KVM default s2 mappings)

The problem is that pgprot_writecombine() doesn't work in Linux
VMs. That is the first pagraph.

> - Describe how you are solving it

That is the middle paragraph "Allow VMs to select DEVICE_* or
NORMAL_NC on a page by page basis"

> - Add a link to the documentation that states why it is safe to do
>   that and the resulting s1/s2 mappings combination

AFAIK there is no documentation beyond the combining rules. Exactly
what should happen in various error conditions is implementation
defined. Catalin did you ever find anything?

> It must be clear why from a legacy standpoint this is a safe change
> to apply.

This is why:
 
> > Fortunately real platforms do tend to implement this.

It is why it is safe today, because real platforms don't throw
uncontained errors from typical PCI accesses that VFIO allows. I think
the conclusions was it turns out that is just because they don't do
errors at all, not because DEVICE_* prevents it.

> Remove this sentence, it adds no information for someone who
> is chasing bugs or just wants to understand the change itself.

So, if you hit a bug here you might evaluate if there is something
wrong with your platform, ie it is allowing uncontained errors in
unexpected places.

Jason
Lorenzo Pieralisi Sept. 13, 2023, 3:26 p.m. UTC | #4
On Mon, Sep 11, 2023 at 02:20:01PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 11, 2023 at 04:57:51PM +0200, Lorenzo Pieralisi wrote:
> > On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote:
> > > From: Ankit Agrawal <ankita@nvidia.com>
> > > 
> > > Linux allows device drivers to map IO memory on a per-page basis using
> > > "write combining" or WC. This is often done using
> > > pgprot_writecombing(). The driver knows which pages can support WC
> > 
> > pgprot_writecombine() ?
> > 
> > > access and the proper programming model to generate this IO. Generally
> > > the use case is to boost performance by using write combining to
> > > generate larger PCIe MemWr TLPs.
> > 
> > First off, this changeset does not affect *only* Linux guests, obviously.
> 
> I think everyone understands that. It can be clarified.

OK, this is not a Linux guest fix *only*, everyone understands that
but I would rather make sure that's crystal clear.

> > I understand that's the use case you are after but this change is
> > targeting all VMs, it must be clear.
> > 
> > Then WC and mapping to PCI TLPs, either you describe that in details
> > (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or
> > you don't describe it at all, as it stands I don't know how to use
> > this information.
> 
> How about another pargraph:
> 
>  KVM prevents all VMs (including Linux) from accessing NORMAL_NC
>  mappings, which is how Linux implements pgprot_writecombine(). This
>  prevents using this performance optimization within VMs.
> 
> I don't think we need to go into details how it works beyond that it
> requires NORMAL_NC.

I think it is worth summarizing the discussion that led to this change,
I can write something up.

> > > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for
> > > all IO memory. This puts the VM in charge of the memory attributes,
> > > and removes the KVM override to DEVICE_nGnRE.
> > > 
> > > Ultimately this makes pgprot_writecombing() work correctly in VMs and
> > 
> > pgprot_writecombine() ?
> > 
> > > allows drivers like mlx5 to fully operate their HW.
> > > 
> > > After some discussions with ARM and CPU architects we reached the
> > > conclusion there was no need for KVM to prevent the VM from selecting
> > > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
> > > that NORMAL_NC could result in uncontained failures, but upon deeper
> > > analysis it turns out there are already possible cases for uncontained
> > > failures with DEVICE types too. Ultimately the platform must be
> > > implemented in a way that ensures that all DEVICE_* and NORMAL_NC
> > > accesses have no uncontained failures.
> > 
> > I would reorder/rephrase this changelog as follows:
> > 
> > - Describe what the problem is (ie KVM default s2 mappings)
> 
> The problem is that pgprot_writecombine() doesn't work in Linux
> VMs. That is the first pagraph.

s/pagraph/paragraph

Well to me that's how the problem was spotted but OK, I can rewrite
the log and post it here, NP.

> > - Describe how you are solving it
> 
> That is the middle paragraph "Allow VMs to select DEVICE_* or
> NORMAL_NC on a page by page basis"
> 
> > - Add a link to the documentation that states why it is safe to do
> >   that and the resulting s1/s2 mappings combination
> 
> AFAIK there is no documentation beyond the combining rules. Exactly
> what should happen in various error conditions is implementation
> defined. Catalin did you ever find anything?
> 
> > It must be clear why from a legacy standpoint this is a safe change
> > to apply.
> 
> This is why:
>  
> > > Fortunately real platforms do tend to implement this.

Is it possible please to describe what "this" is in details ?

What does "real platforms" mean ?

Let's describe what HW/SW should be implemented to make this
safe.

> It is why it is safe today, because real platforms don't throw

"real platforms", I have a problem with that, see above.

> uncontained errors from typical PCI accesses that VFIO allows. I think
> the conclusions was it turns out that is just because they don't do
> errors at all, not because DEVICE_* prevents it.
> 
> > Remove this sentence, it adds no information for someone who
> > is chasing bugs or just wants to understand the change itself.
> 
> So, if you hit a bug here you might evaluate if there is something
> wrong with your platform, ie it is allowing uncontained errors in
> unexpected places.

Developers looking at commit log in the future must be able to
understand why it was a sound change to make. As it stands IMO
the commit log does not explain it fully.

I can write up the commit log and post it if I manage to summarize
it any better - more important the review on the code (that was already
provided), I will try to write something up asap.

Thanks,
Lorenzo
Jason Gunthorpe Sept. 13, 2023, 6:54 p.m. UTC | #5
On Wed, Sep 13, 2023 at 05:26:01PM +0200, Lorenzo Pieralisi wrote:
> > > I understand that's the use case you are after but this change is
> > > targeting all VMs, it must be clear.
> > > 
> > > Then WC and mapping to PCI TLPs, either you describe that in details
> > > (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or
> > > you don't describe it at all, as it stands I don't know how to use
> > > this information.
> > 
> > How about another pargraph:
> > 
> >  KVM prevents all VMs (including Linux) from accessing NORMAL_NC
> >  mappings, which is how Linux implements pgprot_writecombine(). This
> >  prevents using this performance optimization within VMs.
> > 
> > I don't think we need to go into details how it works beyond that it
> > requires NORMAL_NC.
> 
> I think it is worth summarizing the discussion that led to this change,
> I can write something up.

We tried here, in short the reason we all understood it was like this
today is because there was a belief that DEVICE_nGnRE allowed
fewer/none uncontained failures than NORMAL_NC.

AFIACT it turns out the specs are unclear and actual real world IP
from ARM does not behave this way. We learned that at best they are
equally unsafe, even perhaps NORMAL_NC is bit safer.

What makes it safe in real chips is not the unclear specification, or
the behavior of the ARM IP to generate uncontained failures, but
because the combination of all IP in the platform never triggers an
uncontained error anyhow.

For instance PCIe integration IP does not transform PCIe TLP failures
into uncontained errors. They generate all ones data on the bus
instead.

There is no spec that says this should happen, it is just what people
do - c&p from Intel.

On top of that we see that server focused designs built to run VMs
have put in the RAS work to ensure uncontained errors can't exist and
truely plug this hole.

Thus, at the end of the day the safety of KVM and VFIO is effectively
an unknowable platform specific property no matter what choice we make
in SW here. Thus let's make the choice that enables the broadest VM
functionality.

> > > > Fortunately real platforms do tend to implement this.
> 
> Is it possible please to describe what "this" is in details ?

"prevent uncontained errors for DEVICE_* and NORMAL_NC access"
 
> What does "real platforms" mean ?

Actual real world deployments of VFIO and KVM in production that
expect to be safe from uncontained errors.

> Let's describe what HW/SW should be implemented to make this
> safe.

The platform must not generate uncontained errors :)

> I can write up the commit log and post it if I manage to summarize
> it any better - more important the review on the code (that was already
> provided), I will try to write something up asap.

Thank you!

Jason
Lorenzo Pieralisi Sept. 26, 2023, 8:31 a.m. UTC | #6
On Wed, Sep 13, 2023 at 03:54:54PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 13, 2023 at 05:26:01PM +0200, Lorenzo Pieralisi wrote:

[...]

> > I can write up the commit log and post it if I manage to summarize
> > it any better - more important the review on the code (that was already
> > provided), I will try to write something up asap.
> 
> Thank you!
> 
> Jason

FWIW, I have come up with the commit log below - please review and
scrutinize/change it as deemed fit - it is not necessarily clearer
than this one and it definitely requires MarcZ/Catalin/Will attention
before it can be considered:

---
Currently, KVM for ARM64 maps at stage 2 memory that is
considered device (ie using pfn_is_map_memory() to discern
between device memory and memory itself) with DEVICE_nGnRE
memory attributes; this setting overrides (as per the ARM
architecture [1]) any device MMIO mapping present at stage
1, resulting in a set-up whereby a guest operating system
can't determine device MMIO mapping memory attributes on its
own but it is always overriden by the KVM stage 2 default.

This set-up does not allow guest operating systems to map
device memory on a page by page basis with combined attributes
other than DEVICE_nGnRE, which turns out to be an issue in that
guest operating systems (eg Linux) may request to map
devices MMIO regions with memory attributes that guarantee
better performance (eg gathering attribute - that for some
devices can generate larger PCIe memory writes TLPs)
and specific operations (eg unaligned transactions) such as
the NormalNC memory type.

The default device stage 2 mapping was chosen in KVM
for ARM64 since it was considered safer (ie it would
not allow guests to trigger uncontained failures
ultimately crashing the machine) but this turned out
to be imprecise.

Failures containability is a property of the platform
and is independent from the memory type used for MMIO
device memory mappings (ie DEVICE_nGnRE memory type is
even more problematic than NormalNC in terms of containability
since eg aborts triggered on loads cannot be made synchronous,
which make them harder to contain); this means that,
regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures; reworded, the default KVM device
stage 2 memory attributes play no role in making device
assignment safer for a given platform and therefore can
be relaxed.

For all these reasons, relax the KVM stage 2 device
memory attributes from DEVICE_nGnRE to NormalNC.

This puts guests in control (thanks to stage1+stage2
combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules
described in [1] and summarized here ([(S1) = Stage1][(S2) = Stage2]):

 S1           |   S2          |  Result
 NORMAL-WB    |  NORMAL-NC    |  NORMAL-NC
 NORMAL-WT    |  NORMAL-NC    |  NORMAL-NC
 NORMAL-NC    |  NORMAL-NC    |  NORMAL-NC
 DEVICE<attr> |  NORMAL-NC    |  DEVICE<attr>

[1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
Jason Gunthorpe Sept. 26, 2023, 12:25 p.m. UTC | #7
On Tue, Sep 26, 2023 at 10:31:38AM +0200, Lorenzo Pieralisi wrote:
> On Wed, Sep 13, 2023 at 03:54:54PM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 13, 2023 at 05:26:01PM +0200, Lorenzo Pieralisi wrote:
> 
> [...]
> 
> > > I can write up the commit log and post it if I manage to summarize
> > > it any better - more important the review on the code (that was already
> > > provided), I will try to write something up asap.
> > 
> > Thank you!
> > 
> > Jason
> 
> FWIW, I have come up with the commit log below - please review and
> scrutinize/change it as deemed fit - it is not necessarily clearer
> than this one and it definitely requires MarcZ/Catalin/Will attention
> before it can be considered:

I have no issue with this language.

Thanks,
Jason
Catalin Marinas Sept. 26, 2023, 1:52 p.m. UTC | #8
On Tue, Sep 26, 2023 at 10:31:38AM +0200, Lorenzo Pieralisi wrote:
> Currently, KVM for ARM64 maps at stage 2 memory that is
> considered device (ie using pfn_is_map_memory() to discern
> between device memory and memory itself) with DEVICE_nGnRE
> memory attributes; this setting overrides (as per the ARM
> architecture [1]) any device MMIO mapping present at stage
> 1, resulting in a set-up whereby a guest operating system
> can't determine device MMIO mapping memory attributes on its
> own but it is always overriden by the KVM stage 2 default.
> 
> This set-up does not allow guest operating systems to map
> device memory on a page by page basis with combined attributes
> other than DEVICE_nGnRE,

Well, it also has the option of DEVICE_nGnRnE ;).

> which turns out to be an issue in that
> guest operating systems (eg Linux) may request to map
> devices MMIO regions with memory attributes that guarantee
> better performance (eg gathering attribute - that for some
> devices can generate larger PCIe memory writes TLPs)
> and specific operations (eg unaligned transactions) such as
> the NormalNC memory type.
> 
> The default device stage 2 mapping was chosen in KVM
> for ARM64 since it was considered safer (ie it would
> not allow guests to trigger uncontained failures
> ultimately crashing the machine) but this turned out
> to be imprecise.
> 
> Failures containability is a property of the platform
> and is independent from the memory type used for MMIO
> device memory mappings (ie DEVICE_nGnRE memory type is
> even more problematic than NormalNC in terms of containability
> since eg aborts triggered on loads cannot be made synchronous,
> which make them harder to contain); this means that,
> regardless of the combined stage1+stage2 mappings a
> platform is safe if and only if device transactions cannot trigger
> uncontained failures; reworded, the default KVM device
> stage 2 memory attributes play no role in making device
> assignment safer for a given platform and therefore can
> be relaxed.
> 
> For all these reasons, relax the KVM stage 2 device
> memory attributes from DEVICE_nGnRE to NormalNC.
> 
> This puts guests in control (thanks to stage1+stage2
> combined memory attributes rules [1]) of device MMIO
> regions memory mappings, according to the rules
> described in [1] and summarized here ([(S1) = Stage1][(S2) = Stage2]):
> 
> �S1���������� |�� S2��������� |� Result
> �NORMAL-WB����|� NORMAL-NC����|� NORMAL-NC
> �NORMAL-WT����|� NORMAL-NC����|� NORMAL-NC
> �NORMAL-NC����|� NORMAL-NC����|� NORMAL-NC
> �DEVICE<attr>�|� NORMAL-NC����|� DEVICE<attr>

Not sure what's wrong with my font setup as I can't see the above table
but I know it from the Arm ARM.

Anyway, the text looks fine to me. Thanks for putting it together
Lorenzo.

One thing not mentioned here is that pci-vfio still maps such memory as
Device-nGnRnE in user space and relaxing this potentially creates an
alias. But such alias is only relevant of both the VMM and the VM try to
access the same device which I doubt is a realistic scenario.
Lorenzo Pieralisi Sept. 26, 2023, 4:12 p.m. UTC | #9
On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote:
> On Tue, Sep 26, 2023 at 10:31:38AM +0200, Lorenzo Pieralisi wrote:
> > Currently, KVM for ARM64 maps at stage 2 memory that is
> > considered device (ie using pfn_is_map_memory() to discern
> > between device memory and memory itself) with DEVICE_nGnRE
> > memory attributes; this setting overrides (as per the ARM
> > architecture [1]) any device MMIO mapping present at stage
> > 1, resulting in a set-up whereby a guest operating system
> > can't determine device MMIO mapping memory attributes on its
> > own but it is always overriden by the KVM stage 2 default.
> > 
> > This set-up does not allow guest operating systems to map
> > device memory on a page by page basis with combined attributes
> > other than DEVICE_nGnRE,
> 
> Well, it also has the option of DEVICE_nGnRnE ;).

That's true - we have to fix it up.

"This set-up does not allow guest operating systems to choose
device memory attributes on a page by page basis independently
from KVM stage 2 mappings,..."

> > which turns out to be an issue in that
> > guest operating systems (eg Linux) may request to map
> > devices MMIO regions with memory attributes that guarantee
> > better performance (eg gathering attribute - that for some
> > devices can generate larger PCIe memory writes TLPs)
> > and specific operations (eg unaligned transactions) such as
> > the NormalNC memory type.
> > 
> > The default device stage 2 mapping was chosen in KVM
> > for ARM64 since it was considered safer (ie it would
> > not allow guests to trigger uncontained failures
> > ultimately crashing the machine) but this turned out
> > to be imprecise.
> > 
> > Failures containability is a property of the platform
> > and is independent from the memory type used for MMIO
> > device memory mappings (ie DEVICE_nGnRE memory type is
> > even more problematic than NormalNC in terms of containability
> > since eg aborts triggered on loads cannot be made synchronous,
> > which make them harder to contain); this means that,
> > regardless of the combined stage1+stage2 mappings a
> > platform is safe if and only if device transactions cannot trigger
> > uncontained failures; reworded, the default KVM device
> > stage 2 memory attributes play no role in making device
> > assignment safer for a given platform and therefore can
> > be relaxed.
> > 
> > For all these reasons, relax the KVM stage 2 device
> > memory attributes from DEVICE_nGnRE to NormalNC.
> > 
> > This puts guests in control (thanks to stage1+stage2
> > combined memory attributes rules [1]) of device MMIO
> > regions memory mappings, according to the rules
> > described in [1] and summarized here ([(S1) = Stage1][(S2) = Stage2]):
> > 
> > �S1���������� |�� S2��������� |� Result
> > �NORMAL-WB����|� NORMAL-NC����|� NORMAL-NC
> > �NORMAL-WT����|� NORMAL-NC����|� NORMAL-NC
> > �NORMAL-NC����|� NORMAL-NC����|� NORMAL-NC
> > �DEVICE<attr>�|� NORMAL-NC����|� DEVICE<attr>
> 
> Not sure what's wrong with my font setup as I can't see the above table
> but I know it from the Arm ARM.
> 
> Anyway, the text looks fine to me. Thanks for putting it together
> Lorenzo.
> 
> One thing not mentioned here is that pci-vfio still maps such memory as
> Device-nGnRnE in user space and relaxing this potentially creates an
> alias. But such alias is only relevant of both the VMM and the VM try to
> access the same device which I doubt is a realistic scenario.

I can update the log and repost, fixing the comment above as well (need
to check what's wrong with the table characters).

Thanks,
Lorenzo
Lorenzo Pieralisi Oct. 5, 2023, 9:56 a.m. UTC | #10
On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote:

[...]

> Anyway, the text looks fine to me. Thanks for putting it together
> Lorenzo.

Thanks !

> One thing not mentioned here is that pci-vfio still maps such memory as
> Device-nGnRnE in user space and relaxing this potentially creates an
> alias. But such alias is only relevant of both the VMM and the VM try to
> access the same device which I doubt is a realistic scenario.

A revised log, FWIW:

---
Currently, KVM for ARM64 maps at stage 2 memory that is
considered device (ie it is not RAM) with DEVICE_nGnRE
memory attributes; this setting overrides (as per the ARM
architecture [1]) any device MMIO mapping present at stage
1, resulting in a set-up whereby a guest operating system
can't determine device MMIO mapping memory attributes on its
own but it is always overriden by the KVM stage 2 default.

This set-up does not allow guest operating systems to select
device memory attributes on a page by page basis independently
from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage
2 memory type attributes"), which turns out to be an issue in that
guest operating systems (eg Linux) may request to map
devices MMIO regions with memory attributes that guarantee
better performance (eg gathering attribute - that for some
devices can generate larger PCIe memory writes TLPs)
and specific operations (eg unaligned transactions) such as
the NormalNC memory type.

The default device stage 2 mapping was chosen in KVM
for ARM64 since it was considered safer (ie it would
not allow guests to trigger uncontained failures
ultimately crashing the machine) but this turned out
to be imprecise.

Failures containability is a property of the platform
and is independent from the memory type used for MMIO
device memory mappings (ie DEVICE_nGnRE memory type is
even more problematic than NormalNC in terms of containability
since eg aborts triggered on loads cannot be made synchronous,
which make them harder to contain); this means that,
regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures; reworded, the default KVM device
stage 2 memory attributes play no role in making device
assignment safer for a given platform and therefore can
be relaxed.

For all these reasons, relax the KVM stage 2 device
memory attributes from DEVICE_nGnRE to NormalNC.

This puts guests in control (thanks to stage1+stage2
combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules
described in [1] and summarized here ([(S1) - stage1],
[(S2) - stage 2]):

S1	     |  S2	     | Result
NORMAL-WB    |  NORMAL-NC    | NORMAL-NC
NORMAL-WT    |  NORMAL-NC    | NORMAL-NC
NORMAL-NC    |  NORMAL-NC    | NORMAL-NC
DEVICE<attr> |  NORMAL-NC    | DEVICE<attr>

It is worth noting that currently, to map devices MMIO space to user
space in a device pass-through use case the VFIO framework applies memory
attributes derived from pgprot_noncached() settings applied to VMAs, which
result in device-nGnRnE memory attributes for the stage-1 VMM mappings.

This means that a userspace mapping for device MMIO space carried
out with the current VFIO framework and a guest OS mapping for the same
MMIO space may result in a mismatched alias as described in [2].

Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change
anything in this respect, in that the mismatched aliases would only affect
(refer to [2] for a detailed explanation) ordering between the userspace and
GuestOS mappings resulting stream of transactions (ie it does not cause loss of
property for either stream of transactions on its own), which is harmless
given that the userspace and GuestOS access to the device is carried
out through independent transactions streams.

[1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
[2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
Jason Gunthorpe Oct. 5, 2023, 11:56 a.m. UTC | #11
On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote:
> On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote:
> 
> [...]
> 
> > Anyway, the text looks fine to me. Thanks for putting it together
> > Lorenzo.
> 
> Thanks !
> 
> > One thing not mentioned here is that pci-vfio still maps such memory as
> > Device-nGnRnE in user space and relaxing this potentially creates an
> > alias. But such alias is only relevant of both the VMM and the VM try to
> > access the same device which I doubt is a realistic scenario.
> 
> A revised log, FWIW:

What is the plan here, do you want Ankit to resend the series with
this text?

There were no comments on patch 1/2?

Thanks,
Jason
Lorenzo Pieralisi Oct. 5, 2023, 2:08 p.m. UTC | #12
On Thu, Oct 05, 2023 at 08:56:30AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote:
> > On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote:
> > 
> > [...]
> > 
> > > Anyway, the text looks fine to me. Thanks for putting it together
> > > Lorenzo.
> > 
> > Thanks !
> > 
> > > One thing not mentioned here is that pci-vfio still maps such memory as
> > > Device-nGnRnE in user space and relaxing this potentially creates an
> > > alias. But such alias is only relevant of both the VMM and the VM try to
> > > access the same device which I doubt is a realistic scenario.
> > 
> > A revised log, FWIW:
> 
> What is the plan here, do you want Ankit to resend the series with
> this text?

I have put together the text to summarize the discussions that led to
this patch; if that helps somehow, yes please, make it the commit log.

> There were no comments on patch 1/2?

I don't have enough background knowledge to make a statement on this
series' code, I just wanted to make sure the architectural details are
in the commit log so that it can help the reviewers and, if it makes
it to the kernel, future developers.

Thanks,
Lorenzo
Will Deacon Oct. 12, 2023, 12:27 p.m. UTC | #13
On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Linux allows device drivers to map IO memory on a per-page basis using
> "write combining" or WC. This is often done using
> pgprot_writecombing(). The driver knows which pages can support WC
> access and the proper programming model to generate this IO. Generally
> the use case is to boost performance by using write combining to
> generate larger PCIe MemWr TLPs.
> 
> Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for
> all IO memory. This puts the VM in charge of the memory attributes,
> and removes the KVM override to DEVICE_nGnRE.
> 
> Ultimately this makes pgprot_writecombing() work correctly in VMs and
> allows drivers like mlx5 to fully operate their HW.
> 
> After some discussions with ARM and CPU architects we reached the
> conclusion there was no need for KVM to prevent the VM from selecting
> between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
> that NORMAL_NC could result in uncontained failures, but upon deeper
> analysis it turns out there are already possible cases for uncontained
> failures with DEVICE types too. Ultimately the platform must be
> implemented in a way that ensures that all DEVICE_* and NORMAL_NC
> accesses have no uncontained failures.
> 
> Fortunately real platforms do tend to implement this.
> 
> This patch makes the VM's memory attributes behave as follows:
> 
>  S1           |   S2          |  Result
>  NORMAL-WB    |  NORMAL-NC    |  NORMAL-NC
>  NORMAL-WT    |  NORMAL-NC    |  NORMAL-NC
>  NORMAL-NC    |  NORMAL-NC    |  NORMAL-NC
>  DEVICE<attr> |  NORMAL-NC    |  DEVICE<attr>
> 
> See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf
> for details.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  arch/arm64/include/asm/memory.h | 2 ++
>  arch/arm64/kvm/hyp/pgtable.c    | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index fde4186cc387..c247e5f29d5a 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -147,6 +147,7 @@
>   * Memory types for Stage-2 translation
>   */
>  #define MT_S2_NORMAL		0xf
> +#define MT_S2_NORMAL_NC	0x5
>  #define MT_S2_DEVICE_nGnRE	0x1
>  
>  /*
> @@ -154,6 +155,7 @@
>   * Stage-2 enforces Normal-WB and Device-nGnRE
>   */
>  #define MT_S2_FWB_NORMAL	6
> +#define MT_S2_FWB_NORMAL_NC	5
>  #define MT_S2_FWB_DEVICE_nGnRE	1
>  
>  #ifdef CONFIG_ARM64_4K_PAGES
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index ccd291b6893d..a80949002191 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -696,7 +696,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
>  				kvm_pte_t *ptep)
>  {
>  	bool device = prot & KVM_PGTABLE_PROT_DEVICE;
> -	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) :
> +	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, NORMAL_NC) :
>  			    KVM_S2_MEMATTR(pgt, NORMAL);
>  	u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;

I think this is putting the policy into the low-level page-table code,
which isn't where it belongs. KVM_PGTABLE_PROT_DEVICE means that the
mapping should be created with device attributes. If you want other
attributes, then please introduce another entry to 'kvm_pgtable_prot'
and pass that instead (e.g. KVM_PGTABLE_PROT_NC, which coincidentally
we already have in Android [1] ;).

Will

[1] https://android.googlesource.com/kernel/common/+/72cc19df8b71095f9740ff0ca6a75bf7ed27b0cd%5E%21/
Will Deacon Oct. 12, 2023, 12:35 p.m. UTC | #14
On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote:
> On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote:
> 
> [...]
> 
> > Anyway, the text looks fine to me. Thanks for putting it together
> > Lorenzo.
> 
> Thanks !
> 
> > One thing not mentioned here is that pci-vfio still maps such memory as
> > Device-nGnRnE in user space and relaxing this potentially creates an
> > alias. But such alias is only relevant of both the VMM and the VM try to
> > access the same device which I doubt is a realistic scenario.
> 
> A revised log, FWIW:

Thanks for putting this together, Lorenzo. Just one thing below:

> ---
> Currently, KVM for ARM64 maps at stage 2 memory that is
> considered device (ie it is not RAM) with DEVICE_nGnRE
> memory attributes; this setting overrides (as per the ARM
> architecture [1]) any device MMIO mapping present at stage
> 1, resulting in a set-up whereby a guest operating system
> can't determine device MMIO mapping memory attributes on its
> own but it is always overriden by the KVM stage 2 default.
> 
> This set-up does not allow guest operating systems to select
> device memory attributes on a page by page basis independently
> from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage
> 2 memory type attributes"), which turns out to be an issue in that
> guest operating systems (eg Linux) may request to map
> devices MMIO regions with memory attributes that guarantee
> better performance (eg gathering attribute - that for some
> devices can generate larger PCIe memory writes TLPs)
> and specific operations (eg unaligned transactions) such as
> the NormalNC memory type.
> 
> The default device stage 2 mapping was chosen in KVM
> for ARM64 since it was considered safer (ie it would
> not allow guests to trigger uncontained failures
> ultimately crashing the machine) but this turned out
> to be imprecise.
> 
> Failures containability is a property of the platform
> and is independent from the memory type used for MMIO
> device memory mappings (ie DEVICE_nGnRE memory type is
> even more problematic than NormalNC in terms of containability
> since eg aborts triggered on loads cannot be made synchronous,
> which make them harder to contain); this means that,
> regardless of the combined stage1+stage2 mappings a
> platform is safe if and only if device transactions cannot trigger
> uncontained failures; reworded, the default KVM device
> stage 2 memory attributes play no role in making device
> assignment safer for a given platform and therefore can
> be relaxed.
> 
> For all these reasons, relax the KVM stage 2 device
> memory attributes from DEVICE_nGnRE to NormalNC.

The reasoning above suggests to me that this should probably just be
Normal cacheable, as that is what actually allows the guest to control
the attributes. So what is the rationale behind stopping at Normal-NC?

Will
Jason Gunthorpe Oct. 12, 2023, 1:20 p.m. UTC | #15
On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote:

> > Failures containability is a property of the platform
> > and is independent from the memory type used for MMIO
> > device memory mappings (ie DEVICE_nGnRE memory type is
> > even more problematic than NormalNC in terms of containability
> > since eg aborts triggered on loads cannot be made synchronous,
> > which make them harder to contain); this means that,
> > regardless of the combined stage1+stage2 mappings a
> > platform is safe if and only if device transactions cannot trigger
> > uncontained failures; reworded, the default KVM device
> > stage 2 memory attributes play no role in making device
> > assignment safer for a given platform and therefore can
> > be relaxed.
> > 
> > For all these reasons, relax the KVM stage 2 device
> > memory attributes from DEVICE_nGnRE to NormalNC.
> 
> The reasoning above suggests to me that this should probably just be
> Normal cacheable, as that is what actually allows the guest to control
> the attributes. So what is the rationale behind stopping at Normal-NC?

I agree it would be very nice if the all the memory in the guest could
just be cachable and the guest could select everything.

However, I think Lorenzo over stated the argument. The off-list
discussion was focused on NormalNC for MMIO only. Nobody raised the
idea that cachable was safe from uncontained errors for MMIO.

I'm looking through the conversations and I wouldn't jump to
concluding that "cachable MMIO" is safe from uncontained failures.

Catalin has already raised a number of conerns in the other patch
about making actual "designed to be cachable memory" into KVM
cachable.

Regards,
Jason
Catalin Marinas Oct. 12, 2023, 1:53 p.m. UTC | #16
On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote:
> On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote:
> > For all these reasons, relax the KVM stage 2 device
> > memory attributes from DEVICE_nGnRE to NormalNC.
> 
> The reasoning above suggests to me that this should probably just be
> Normal cacheable, as that is what actually allows the guest to control
> the attributes. So what is the rationale behind stopping at Normal-NC?

It's more like we don't have any clue on what may happen. MTE is
obviously a case where it can go wrong (we can blame the architecture
design here) but I recall years ago where a malicious guest could bring
the platform down by mapping the GIC CPU interface as cacheable.

Not sure how error containment works with cacheable memory. A cacheable
access to a device may stay in the cache a lot longer after the guest
has been scheduled out, only evicted at some random time. We may no
longer be able to associate it with the guest, especially if the guest
exited. Also not sure about claiming back the device after killing the
guest, do we need cache maintenance?

So, for now I'd only relax this if we know there's RAM(-like) on the
other side and won't trigger some potentially uncontainable errors as a
result.
Lorenzo Pieralisi Oct. 12, 2023, 2:29 p.m. UTC | #17
On Thu, Oct 12, 2023 at 10:20:45AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote:
> 
> > > Failures containability is a property of the platform
> > > and is independent from the memory type used for MMIO
> > > device memory mappings (ie DEVICE_nGnRE memory type is
> > > even more problematic than NormalNC in terms of containability
> > > since eg aborts triggered on loads cannot be made synchronous,
> > > which make them harder to contain); this means that,
> > > regardless of the combined stage1+stage2 mappings a
> > > platform is safe if and only if device transactions cannot trigger
> > > uncontained failures; reworded, the default KVM device
> > > stage 2 memory attributes play no role in making device
> > > assignment safer for a given platform and therefore can
> > > be relaxed.
> > > 
> > > For all these reasons, relax the KVM stage 2 device
> > > memory attributes from DEVICE_nGnRE to NormalNC.
> > 
> > The reasoning above suggests to me that this should probably just be
> > Normal cacheable, as that is what actually allows the guest to control
> > the attributes. So what is the rationale behind stopping at Normal-NC?
> 
> I agree it would be very nice if the all the memory in the guest could
> just be cachable and the guest could select everything.
> 
> However, I think Lorenzo over stated the argument.
>
> The off-list discussion was focused on NormalNC for MMIO only. Nobody
> raised the idea that cachable was safe from uncontained errors for
> MMIO.

True, it should be made clearer ie "independent from the
non-cacheable/uncacheable memory type...", please update the
log accordingly, forgive me the confusion I raised.

Lorenzo
 
> I'm looking through the conversations and I wouldn't jump to
> concluding that "cachable MMIO" is safe from uncontained failures.
> 
> Catalin has already raised a number of conerns in the other patch
> about making actual "designed to be cachable memory" into KVM
> cachable.
> 
> Regards,
> Jason
Will Deacon Oct. 12, 2023, 2:48 p.m. UTC | #18
On Thu, Oct 12, 2023 at 02:53:21PM +0100, Catalin Marinas wrote:
> On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote:
> > On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote:
> > > For all these reasons, relax the KVM stage 2 device
> > > memory attributes from DEVICE_nGnRE to NormalNC.
> > 
> > The reasoning above suggests to me that this should probably just be
> > Normal cacheable, as that is what actually allows the guest to control
> > the attributes. So what is the rationale behind stopping at Normal-NC?
> 
> It's more like we don't have any clue on what may happen. MTE is
> obviously a case where it can go wrong (we can blame the architecture
> design here) but I recall years ago where a malicious guest could bring
> the platform down by mapping the GIC CPU interface as cacheable.

... and do we know that isn't the case for non-cacheable? If not, why not?

Also, are you saying we used to map the GIC CPU interface as cacheable
at stage-2? I remember exclusives causing a problem, but I don't remember
the guest having a cacheable mapping.

> Not sure how error containment works with cacheable memory. A cacheable
> access to a device may stay in the cache a lot longer after the guest
> has been scheduled out, only evicted at some random time.

But similarly, non-cacheable stores can be buffered. Why isn't that a
problem?

> We may no longer be able to associate it with the guest, especially if the
> guest exited. Also not sure about claiming back the device after killing
> the guest, do we need cache maintenance?

Claiming back the device also seems strange if the guest has been using
non-cacheable accesses since I think you could get write merging and
reordering with subsequent device accesses trying to reset the device.

> So, for now I'd only relax this if we know there's RAM(-like) on the
> other side and won't trigger some potentially uncontainable errors as a
> result.

I guess my wider point is that I'm not convinced that non-cacheable is
actually much better and I think we're going way off the deep end looking
at what particular implementations do and trying to justify to ourselves
that non-cacheable is safe, even though it's still a normal memory type
at the end of the day.

Obviously, it's up to Marc and Oliver if they want to do this, but I'm
wary without an official statement from Arm to say that Normal-NC is
correct. There's mention of such a statement in the cover letter:

  > We hope ARM will publish information helping platform designers
  > follow these guidelines.

but imo we shouldn't merge this without either:

  (a) _Architectural_ guidance (as opposed to some random whitepaper or
      half-baked certification scheme).

- or -

  (b) A concrete justification based on the current architecture as to
      why Normal-NC is the right thing to do for KVM.

The current wording talks about use-cases (I get this) and error containment
(it's a property of the system) but doesn't talk at all about why Normal-NC
is the right result.

Will
Jason Gunthorpe Oct. 12, 2023, 3:44 p.m. UTC | #19
On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:

> I guess my wider point is that I'm not convinced that non-cacheable is
> actually much better and I think we're going way off the deep end looking
> at what particular implementations do and trying to justify to ourselves
> that non-cacheable is safe, even though it's still a normal memory type
> at the end of the day.

When we went over this with ARM it became fairly clear there wasn't an
official statement that Device-* is safe from uncontained
failures. For instance, looking at the actual IP, our architects
pointed out that ARM IP already provides ways for Device-* to trigger
uncontained failures today.

We then mutually concluded that KVM safe implementations must already
be preventing uncontained failures for Device-* at the system level
and that same prevention will carry over to NormalNC as well.

IMHO, this seems to be a gap where ARM has not fully defined when
uncontained failures are allowed and left that as an implementation
choice.

In other words, KVM safety around uncontained failure is not a
property that can be reasoned about from the ARM architecture alone.

> The current wording talks about use-cases (I get this) and error containment
> (it's a property of the system) but doesn't talk at all about why Normal-NC
> is the right result.

Given that Device-* and NormalNC are equally implementation defined
with regards to uncontained failures, NormalNC allows more VM
functionality.

Further, we have a broad agreement that this use case is important,
and that NormalNC is the correct way to adress it.

I think you are right to ask for more formality from ARM team but also
we shouldn't hold up fixing real functional bugs in real shipping
server ARM products.

Jason
Will Deacon Oct. 12, 2023, 4:39 p.m. UTC | #20
On Thu, Oct 12, 2023 at 12:44:39PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:
> 
> > I guess my wider point is that I'm not convinced that non-cacheable is
> > actually much better and I think we're going way off the deep end looking
> > at what particular implementations do and trying to justify to ourselves
> > that non-cacheable is safe, even though it's still a normal memory type
> > at the end of the day.
> 
> When we went over this with ARM it became fairly clear there wasn't an
> official statement that Device-* is safe from uncontained
> failures. For instance, looking at the actual IP, our architects
> pointed out that ARM IP already provides ways for Device-* to trigger
> uncontained failures today.
> 
> We then mutually concluded that KVM safe implementations must already
> be preventing uncontained failures for Device-* at the system level
> and that same prevention will carry over to NormalNC as well.
> 
> IMHO, this seems to be a gap where ARM has not fully defined when
> uncontained failures are allowed and left that as an implementation
> choice.
> 
> In other words, KVM safety around uncontained failure is not a
> property that can be reasoned about from the ARM architecture alone.
> 
> > The current wording talks about use-cases (I get this) and error containment
> > (it's a property of the system) but doesn't talk at all about why Normal-NC
> > is the right result.
> 
> Given that Device-* and NormalNC are equally implementation defined
> with regards to uncontained failures, NormalNC allows more VM
> functionality.
> 
> Further, we have a broad agreement that this use case is important,
> and that NormalNC is the correct way to adress it.
> 
> I think you are right to ask for more formality from ARM team but also
> we shouldn't hold up fixing real functional bugs in real shipping
> server ARM products.

All I'm asking for is justification as to why Normal-NC is the right
memory type rather than any other normal memory type. If it's not possible
to explain that architecturally, then I'm not sure this change belongs in
architecture code.

Ultimately, we need to be able to maintain this stuff, so we can't just
blindly implement changes based on a combination of off-list discussions
and individual product needs. For example, if somebody else rocks up
tomorrow and asks for this to be Normal-writethrough, what grounds do
we have to say no if we've taken this change already?

So please let's get to a point where we can actually reason about this.

Will
Catalin Marinas Oct. 12, 2023, 5:26 p.m. UTC | #21
On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:
> On Thu, Oct 12, 2023 at 02:53:21PM +0100, Catalin Marinas wrote:
> > On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote:
> > > On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote:
> > > > For all these reasons, relax the KVM stage 2 device
> > > > memory attributes from DEVICE_nGnRE to NormalNC.
> > > 
> > > The reasoning above suggests to me that this should probably just be
> > > Normal cacheable, as that is what actually allows the guest to control
> > > the attributes. So what is the rationale behind stopping at Normal-NC?
> > 
> > It's more like we don't have any clue on what may happen. MTE is
> > obviously a case where it can go wrong (we can blame the architecture
> > design here) but I recall years ago where a malicious guest could bring
> > the platform down by mapping the GIC CPU interface as cacheable.
> 
> ... and do we know that isn't the case for non-cacheable? If not, why not?

Trying to get this information from the hw folk and architects is really
hard. So we only relax it one step at a time ;). But given the MTE
problems, I'd not go for cacheable Stage 2 unless we have FEAT_MTE_PERM
implemented (both hw and sw). S2 cacheable allows the guest to map it as
Normal Tagged.

> Also, are you saying we used to map the GIC CPU interface as cacheable
> at stage-2? I remember exclusives causing a problem, but I don't remember
> the guest having a cacheable mapping.

The guest never had a cacheable mapping, IIRC it was more of a
theoretical problem, plugging a hole. Now, maybe I misremember, it's
pretty hard to search the git logs given how the code was moved around
(but I do remember the building we were in when discussing this, it was
on the ground floor ;)).

> > Not sure how error containment works with cacheable memory. A cacheable
> > access to a device may stay in the cache a lot longer after the guest
> > has been scheduled out, only evicted at some random time.
> 
> But similarly, non-cacheable stores can be buffered. Why isn't that a
> problem?

RAS might track this for cacheable mappings as well, I just haven't
figured out the details.

> > We may no longer be able to associate it with the guest, especially if the
> > guest exited. Also not sure about claiming back the device after killing
> > the guest, do we need cache maintenance?
> 
> Claiming back the device also seems strange if the guest has been using
> non-cacheable accesses since I think you could get write merging and
> reordering with subsequent device accesses trying to reset the device.

True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)).

> > So, for now I'd only relax this if we know there's RAM(-like) on the
> > other side and won't trigger some potentially uncontainable errors as a
> > result.
> 
> I guess my wider point is that I'm not convinced that non-cacheable is
> actually much better and I think we're going way off the deep end looking
> at what particular implementations do and trying to justify to ourselves
> that non-cacheable is safe, even though it's still a normal memory type
> at the end of the day.

Is this about Device vs NC or Device/NC vs Normal Cacheable? The
justification for the former has been summarised in Lorenzo's write-up.
How the hardware behaves, it depends a lot on the RAS implementation.
The BSA has some statements but not sure it covers everything.

Things can go wrong but that's not because Device does anything better.
Given the RAS implementation, external aborts caused on Device memory
(e.g. wrong size access) is uncontainable. For Normal NC it can be
contained (I can dig out the reasoning behind this if you want, IIUC
something to do with not being able to cancel an already issued Device
access since such accesses don't allow speculation due to side-effects;
for Normal NC, it's just about the software not getting the data).

> Obviously, it's up to Marc and Oliver if they want to do this, but I'm
> wary without an official statement from Arm to say that Normal-NC is
> correct. There's mention of such a statement in the cover letter:
> 
>   > We hope ARM will publish information helping platform designers
>   > follow these guidelines.
> 
> but imo we shouldn't merge this without either:
> 
>   (a) _Architectural_ guidance (as opposed to some random whitepaper or
>       half-baked certification scheme).

Well, you know the story, the architects will probably make it a SoC or
integration issue, PCIe etc., not something that can live in the Arm
ARM. The best we could get is more recommendations in the RAS spec
around containment but not for things that might happen outside the CPU,
e.g. PCIe root complex.

> - or -
> 
>   (b) A concrete justification based on the current architecture as to
>       why Normal-NC is the right thing to do for KVM.

To put it differently, we don't have any strong arguments why Device is
the right thing to do. We chose Device based on some understanding
software people had about how the hardware behaves, which apparently
wasn't entirely correct (and summarised by Lorenzo).
Jason Gunthorpe Oct. 12, 2023, 6:36 p.m. UTC | #22
On Thu, Oct 12, 2023 at 05:39:31PM +0100, Will Deacon wrote:

> All I'm asking for is justification as to why Normal-NC is the right
> memory type rather than any other normal memory type. If it's not possible
> to explain that architecturally, then I'm not sure this change belongs in
> architecture code.

Well, I think Catalin summarized it nicely, I second his ask at the end:

We are basically at your scenario below - can you justify why
DEVICE_GRE is correct today, architecturally? We could not. Earlier
someone said uncontained failure prevention, but a deep dive on that
found it is not so.

> Ultimately, we need to be able to maintain this stuff, so we can't just
> blindly implement changes based on a combination of off-list discussions
> and individual product needs. For example, if somebody else rocks up
> tomorrow and asks for this to be Normal-writethrough, what grounds do
> we have to say no if we've taken this change already?

Hmm. Something got lost here.

This patch is talking about the S2 MemAttr[3:0]. There are only 4
relevant values (when FEAT_S2FWB) (see D5.5.5 in ARM DDI 0487F.c)

 0b001 - Today: force VM to be Device nGnRE

 0b101 - Proposed: prevent the VM from selecting cachable, allow it to
         choose Device-* or NormalNC

 0b110 - Force write back. Would totally break MMIO, ruled out

 0b111 - Allow the VM to select anything, including cachable.
         This is nice, but summarizing Catalin's remarks:
           a) The kernel can't do cache maintenance (defeats FWB)
           b) Catalin's concerns about MTE and Atomics triggering
	      uncontained failures
           c) It is unclear about uncontained failures for cachable
              in general

So the only argument is 001 v 110 v 111

Catalin has explained why 111 is not good as a default. Most likely
with some future ACPI/BSA/etc work, and some cache syncing in the
kernel, someone could define a way to allow 111 as a choice. So, I
think we can rule out 111 as being the default choice without more the
kernel getting more detailed system level knowledge.

Further, patch 1 is about making 110 a driver-opt-in choice for VFIO
memory which reduces the need for 111.

For 001 v 110: 001 is less functional in the VM. 001 offers no
advantage.

!FEAT_S2FWB has similar choices and similar argument.

So, IMHO, if someone comes to ask for something it would be to ask for
111 and we do have a set of pretty clear reasons why it should not be
111. (indeed we wanted to ask for that instead of patch 1 but there
are too many things required to get there),

Jason
Will Deacon Oct. 13, 2023, 9:29 a.m. UTC | #23
On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote:
> On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:
> > Claiming back the device also seems strange if the guest has been using
> > non-cacheable accesses since I think you could get write merging and
> > reordering with subsequent device accesses trying to reset the device.
> 
> True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)).

We do have a good story for this part: use Device-nGnRE!

> > > So, for now I'd only relax this if we know there's RAM(-like) on the
> > > other side and won't trigger some potentially uncontainable errors as a
> > > result.
> > 
> > I guess my wider point is that I'm not convinced that non-cacheable is
> > actually much better and I think we're going way off the deep end looking
> > at what particular implementations do and trying to justify to ourselves
> > that non-cacheable is safe, even though it's still a normal memory type
> > at the end of the day.
> 
> Is this about Device vs NC or Device/NC vs Normal Cacheable? The
> justification for the former has been summarised in Lorenzo's write-up.
> How the hardware behaves, it depends a lot on the RAS implementation.
> The BSA has some statements but not sure it covers everything.

I don't know how to be more clear, but I'm asking why Normal-NC is the right
memory type to use. Jason's explanation on the other thread about how it's
basically the only option with FWB (with some hand-waving about why
Normal-cacheable isn't safe) will have to do, but it needs to be in the
commit message.

The host maps MMIO with Device-nGnRE. Allowing a guest to map it as Normal
surely means the host is going to need additional logic to deal with that?
We briefly discussed claiming back a device above and I'm not convinced
that the code we have for doing that today will work correctly if the
guest has issued a bunch of Normal-NC stores prior to the device being
unmapped.

Could we change these patches so that the memory type of the stage-1 VMA
in the VMM is reflected in the stage-2? In other words, continue to use
Device mappings at stage-2 for I/O but relax to Normal-NC if that's
how the VMM has it mapped?

> Things can go wrong but that's not because Device does anything better.
> Given the RAS implementation, external aborts caused on Device memory
> (e.g. wrong size access) is uncontainable. For Normal NC it can be
> contained (I can dig out the reasoning behind this if you want, IIUC
> something to do with not being able to cancel an already issued Device
> access since such accesses don't allow speculation due to side-effects;
> for Normal NC, it's just about the software not getting the data).

I really think these details belong in the commit message.

> > Obviously, it's up to Marc and Oliver if they want to do this, but I'm
> > wary without an official statement from Arm to say that Normal-NC is
> > correct. There's mention of such a statement in the cover letter:
> > 
> >   > We hope ARM will publish information helping platform designers
> >   > follow these guidelines.
> > 
> > but imo we shouldn't merge this without either:
> > 
> >   (a) _Architectural_ guidance (as opposed to some random whitepaper or
> >       half-baked certification scheme).
> 
> Well, you know the story, the architects will probably make it a SoC or
> integration issue, PCIe etc., not something that can live in the Arm
> ARM. The best we could get is more recommendations in the RAS spec
> around containment but not for things that might happen outside the CPU,
> e.g. PCIe root complex.

The Arm ARM _does_ mention PCI config space when talking about early write
acknowledgement, so there's some precedence for providing guidance around
which memory types to use.

> > - or -
> > 
> >   (b) A concrete justification based on the current architecture as to
> >       why Normal-NC is the right thing to do for KVM.
> 
> To put it differently, we don't have any strong arguments why Device is
> the right thing to do. We chose Device based on some understanding
> software people had about how the hardware behaves, which apparently
> wasn't entirely correct (and summarised by Lorenzo).

I think we use Device because that's what the host uses in its stage-1
and mismatched aliases are bad.

Will
Will Deacon Oct. 13, 2023, 9:29 a.m. UTC | #24
On Thu, Oct 12, 2023 at 03:36:24PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 12, 2023 at 05:39:31PM +0100, Will Deacon wrote:
> 
> > All I'm asking for is justification as to why Normal-NC is the right
> > memory type rather than any other normal memory type. If it's not possible
> > to explain that architecturally, then I'm not sure this change belongs in
> > architecture code.
> 
> Well, I think Catalin summarized it nicely, I second his ask at the end:
> 
> We are basically at your scenario below - can you justify why
> DEVICE_GRE is correct today, architecturally? We could not. Earlier
> someone said uncontained failure prevention, but a deep dive on that
> found it is not so.

This logic can be used to justify the use of any other memory type. I'm
asking specifically why Normal-NC is correct.

> > Ultimately, we need to be able to maintain this stuff, so we can't just
> > blindly implement changes based on a combination of off-list discussions
> > and individual product needs. For example, if somebody else rocks up
> > tomorrow and asks for this to be Normal-writethrough, what grounds do
> > we have to say no if we've taken this change already?
> 
> Hmm. Something got lost here.

Well, I didn't assume FWB since these patches change the behaviour
regardless...

> This patch is talking about the S2 MemAttr[3:0]. There are only 4
> relevant values (when FEAT_S2FWB) (see D5.5.5 in ARM DDI 0487F.c)
> 
>  0b001 - Today: force VM to be Device nGnRE
> 
>  0b101 - Proposed: prevent the VM from selecting cachable, allow it to
>          choose Device-* or NormalNC
> 
>  0b110 - Force write back. Would totally break MMIO, ruled out
> 
>  0b111 - Allow the VM to select anything, including cachable.
>          This is nice, but summarizing Catalin's remarks:
>            a) The kernel can't do cache maintenance (defeats FWB)
>            b) Catalin's concerns about MTE and Atomics triggering
> 	      uncontained failures
>            c) It is unclear about uncontained failures for cachable
>               in general
> 
> So the only argument is 001 v 110 v 111

Ok, so I can see why you end up with Normal-NC on a system that implements
FWB. Systems without FWB have other options, but you can reasonably argue
that having consistent behaviour between the two configurations is
desirable. Please can you add that to the commit message?

I still don't understand how to reclaim an MMIO region that the guest has
mapped Normal-NC (see the thread with Catalin).

Will
Catalin Marinas Oct. 13, 2023, 1:08 p.m. UTC | #25
On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote:
> On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote:
> > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:
> > > Claiming back the device also seems strange if the guest has been using
> > > non-cacheable accesses since I think you could get write merging and
> > > reordering with subsequent device accesses trying to reset the device.
> > 
> > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)).
> 
> We do have a good story for this part: use Device-nGnRE!

Don't we actually need Device-nGnRnE for this, coupled with a DSB for
endpoint completion?

Device-nGnRE may be sufficient as a read from that device would ensure
that the previous write is observable (potentially with a DMB if
accessing separate device regions) but I don't think we do this now
either. Even this, isn't it device-specific? I don't know enough about
PCIe, posted writes, reordering, maybe others can shed some light.

For Normal NC, if the access doesn't have side-effects (or rather the
endpoint is memory-like), I think we are fine. The Stage 2 unmapping +
TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU
was pushed sufficiently far as not to affect subsequent writes by other
CPUs.

For I/O accesses that change some state of the device, I'm not sure the
TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only
nE + DSB as long as the PCIe device plays along nicely.

> Could we change these patches so that the memory type of the stage-1 VMA
> in the VMM is reflected in the stage-2? In other words, continue to use
> Device mappings at stage-2 for I/O but relax to Normal-NC if that's
> how the VMM has it mapped?

We've been through this and it's not feasible. The VMM does not have
detailed knowledge of the BARs of the PCIe device it is mapping (and the
prefetchable BAR attribute is useless). It may end up with a Normal
mapping of a BAR with read side-effects. It's only the guest driver that
knows all the details. The safest is for the VMM to keep it as Device (I
think vfio-pci goes for the strongest nGnRnE).

Yes, we end up with mismatched aliases but they only matter if the VMM
also accesses the I/O range via its own mapping. So far I haven't seen
case that suggests this.

> > Things can go wrong but that's not because Device does anything better.
> > Given the RAS implementation, external aborts caused on Device memory
> > (e.g. wrong size access) is uncontainable. For Normal NC it can be
> > contained (I can dig out the reasoning behind this if you want, IIUC
> > something to do with not being able to cancel an already issued Device
> > access since such accesses don't allow speculation due to side-effects;
> > for Normal NC, it's just about the software not getting the data).
> 
> I really think these details belong in the commit message.

I guess another task for Lorenzo ;).

> > > Obviously, it's up to Marc and Oliver if they want to do this, but I'm
> > > wary without an official statement from Arm to say that Normal-NC is
> > > correct. There's mention of such a statement in the cover letter:
> > > 
> > >   > We hope ARM will publish information helping platform designers
> > >   > follow these guidelines.
> > > 
> > > but imo we shouldn't merge this without either:
> > > 
> > >   (a) _Architectural_ guidance (as opposed to some random whitepaper or
> > >       half-baked certification scheme).
> > 
> > Well, you know the story, the architects will probably make it a SoC or
> > integration issue, PCIe etc., not something that can live in the Arm
> > ARM. The best we could get is more recommendations in the RAS spec
> > around containment but not for things that might happen outside the CPU,
> > e.g. PCIe root complex.
> 
> The Arm ARM _does_ mention PCI config space when talking about early write
> acknowledgement, so there's some precedence for providing guidance around
> which memory types to use.

Ah, yes, it looks like it does, though mostly around the config space.
We could ask them to add some notes but I don't think we have the
problem well defined yet.

Trying to restate what we aim: the guest driver knows what attributes it
needs and would set the appropriate attributes: Device or Normal. KVM's
role is not to fix bugs in the guest driver by constraining the
attributes but rather to avoid potential security issues with malicious
(or buggy) guests:

1) triggering uncontained errors

2) accessing memory that it shouldn't (like the MTE tag access)

3) causing delayed side-effects after the host reclaims the device

... anything else?

For (1), Normal NC vs. Device doesn't make any difference, slightly
better for the former. (2) so far is solved by not allowing Cacheable
(or disabling MTE, enabling FEAT_MTE_PERM in the future). I'm now trying
to understand (3), I think it needs more digging.

> > >   (b) A concrete justification based on the current architecture as to
> > >       why Normal-NC is the right thing to do for KVM.
> > 
> > To put it differently, we don't have any strong arguments why Device is
> > the right thing to do. We chose Device based on some understanding
> > software people had about how the hardware behaves, which apparently
> > wasn't entirely correct (and summarised by Lorenzo).
> 
> I think we use Device because that's what the host uses in its stage-1
> and mismatched aliases are bad.

They are "constrained" bad ;).
Jason Gunthorpe Oct. 13, 2023, 1:45 p.m. UTC | #26
On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote:
> On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote:
> > On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote:
> > > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:
> > > > Claiming back the device also seems strange if the guest has been using
> > > > non-cacheable accesses since I think you could get write merging and
> > > > reordering with subsequent device accesses trying to reset the device.
> > > 
> > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)).
> > 
> > We do have a good story for this part: use Device-nGnRE!
> 
> Don't we actually need Device-nGnRnE for this, coupled with a DSB for
> endpoint completion?
> 
> Device-nGnRE may be sufficient as a read from that device would ensure
> that the previous write is observable (potentially with a DMB if
> accessing separate device regions) but I don't think we do this now
> either. Even this, isn't it device-specific? I don't know enough about
> PCIe, posted writes, reordering, maybe others can shed some light.
> 
> For Normal NC, if the access doesn't have side-effects (or rather the
> endpoint is memory-like), I think we are fine. The Stage 2 unmapping +
> TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU
> was pushed sufficiently far as not to affect subsequent writes by other
> CPUs.
> 
> For I/O accesses that change some state of the device, I'm not sure the
> TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only
> nE + DSB as long as the PCIe device plays along nicely.

Can someone explain this concern a little more simply please?

Let's try something simpler. I have no KVM. My kernel driver 
creates a VMA with pgprot_writecombine (NormalNC). Userpsace does a
write to the NormalNC and immediately unmaps the VMA

What is the issue?

And then how does making KVM the thing that creates the NormalNC
change this?

Not knowing the whole details, here is my story about how it should work:

Unmapping the VMA's must already have some NormalNC friendly ordering
barrier across all CPUs or we have a bigger problem. This barrier
definately must close write combining.

VFIO issues a config space write to reset the PCI function.  Config
space writes MUST NOT write combine with anything. This is already
impossible for PCIe since they are different TLP types at the PCIe
level.

By the PCIe rules, config space write must order strictly after all
other CPU's accesses. Once the reset non-posted write returns back to
VFIO we know that:

 1) There is no reference in any CPU page table to the MMIO PFN
 2) No CPU has pending data in any write buffer
 3) The interconnect and PCIe fabric have no inflight operations
 4) The device is in a clean post-reset state

?

> knows all the details. The safest is for the VMM to keep it as Device (I
> think vfio-pci goes for the strongest nGnRnE).

We are probably going to allow VFIO to let userspace pick if it should
be pgprot_device or pgprot_writecombine.

The alias issue could be resolved by teaching KVM how to insert a
physical PFN based on some VFIO FD/dmabuf rather than a VMA so that
the PFNs are never mapped in the hypervisor side.

Jsaon
Lorenzo Pieralisi Oct. 13, 2023, 3:28 p.m. UTC | #27
On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote:

[...]

> Yes, we end up with mismatched aliases but they only matter if the VMM
> also accesses the I/O range via its own mapping. So far I haven't seen
> case that suggests this.
> 
> > > Things can go wrong but that's not because Device does anything better.
> > > Given the RAS implementation, external aborts caused on Device memory
> > > (e.g. wrong size access) is uncontainable. For Normal NC it can be
> > > contained (I can dig out the reasoning behind this if you want, IIUC
> > > something to do with not being able to cancel an already issued Device
> > > access since such accesses don't allow speculation due to side-effects;
> > > for Normal NC, it's just about the software not getting the data).
> > 
> > I really think these details belong in the commit message.
> 
> I guess another task for Lorenzo ;).

I will do, I start wondering though whether this documentation belongs
in this commit log only or at Documentation/arch/arm64 level (or both),
I am pretty sure this thread can turn out quite useful as a reference
(it is for me) if we manage to summarize it that would benefit
everyone.

Lorenzo
Catalin Marinas Oct. 19, 2023, 11:07 a.m. UTC | #28
On Fri, Oct 13, 2023 at 10:45:41AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote:
> > On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote:
> > > On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote:
> > > > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:
> > > > > Claiming back the device also seems strange if the guest has been using
> > > > > non-cacheable accesses since I think you could get write merging and
> > > > > reordering with subsequent device accesses trying to reset the device.
> > > > 
> > > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)).
> > > 
> > > We do have a good story for this part: use Device-nGnRE!
> > 
> > Don't we actually need Device-nGnRnE for this, coupled with a DSB for
> > endpoint completion?
> > 
> > Device-nGnRE may be sufficient as a read from that device would ensure
> > that the previous write is observable (potentially with a DMB if
> > accessing separate device regions) but I don't think we do this now
> > either. Even this, isn't it device-specific? I don't know enough about
> > PCIe, posted writes, reordering, maybe others can shed some light.
> > 
> > For Normal NC, if the access doesn't have side-effects (or rather the
> > endpoint is memory-like), I think we are fine. The Stage 2 unmapping +
> > TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU
> > was pushed sufficiently far as not to affect subsequent writes by other
> > CPUs.
> > 
> > For I/O accesses that change some state of the device, I'm not sure the
> > TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only
> > nE + DSB as long as the PCIe device plays along nicely.
> 
> Can someone explain this concern a little more simply please?
> 
> Let's try something simpler. I have no KVM. My kernel driver 
> creates a VMA with pgprot_writecombine (NormalNC). Userpsace does a
> write to the NormalNC and immediately unmaps the VMA
> 
> What is the issue?

The issue is when the device is reclaimed to be given to another
user-space process, do we know the previous transaction reached the
device? With the TLBI + DSB in unmap, we can only tell that a subsequent
map + write (in a new process) is ordered after the write in the old
process. Maybe that's sufficient in most cases.

> And then how does making KVM the thing that creates the NormalNC
> change this?

They are similar.

> Not knowing the whole details, here is my story about how it should work:
> 
> Unmapping the VMA's must already have some NormalNC friendly ordering
> barrier across all CPUs or we have a bigger problem. This barrier
> definately must close write combining.

I think what we have is TLBI + DSB generating the DVM/DVMSync messages
should ensure that the Normal NC writes on other CPUs reach some
serialisation point (not necessarily the device, AFAICT we can't
guarantee end-point completion here).

> VFIO issues a config space write to reset the PCI function.  Config
> space writes MUST NOT write combine with anything. This is already
> impossible for PCIe since they are different TLP types at the PCIe
> level.

Yes, config space writes are fine, vfio-pci even maps them as
Device_nGnRnE. But AFAIK a guest won't have direct access to the config
space.

> By the PCIe rules, config space write must order strictly after all
> other CPU's accesses. Once the reset non-posted write returns back to
> VFIO we know that:
> 
>  1) There is no reference in any CPU page table to the MMIO PFN
>  2) No CPU has pending data in any write buffer
>  3) The interconnect and PCIe fabric have no inflight operations
>  4) The device is in a clean post-reset state

I think from the CPU perspective, we can guarantee that a Normal_NC
write on CPU0 for example reaches a serialisation point before a config
space (Device_nGnRnE) write on CPU1 by the host as long as CPU1 issued a
TLBI+DSB. Now, what I'm not sure is what this serialisation point is. If
it is the PCIe root complex, we are probably fine, we hope it deals with
any ordering between the Normal_NC write and the config space one.

Talking to Will earlier, I think we can deem the PCIe scenario
(somewhat) safe but not as a generic mechanism for other non-PCIe
devices (e.g. platform). With this concern, can we make this Stage 2
relaxation in KVM only for vfio-pci mappings? I don't have an example of
non-PCIe device assignment to figure out how this should work though.

> > knows all the details. The safest is for the VMM to keep it as Device (I
> > think vfio-pci goes for the strongest nGnRnE).
> 
> We are probably going to allow VFIO to let userspace pick if it should
> be pgprot_device or pgprot_writecombine.

I guess that's for the direct use by an application rather than VMM+VM.
IIUC people work around this currently by mapping PCIe BARs as
pgprot_writecombine() via sysfs. Getting vfio-pci to allow different
mappings is probably a good idea, though it doesn't currently help with
the KVM case as we can't force the VMM to know the specifics of the
device it is giving to a guest.

> The alias issue could be resolved by teaching KVM how to insert a
> physical PFN based on some VFIO FD/dmabuf rather than a VMA so that
> the PFNs are never mapped in the hypervisor side.

This should work as well and solves the aliasing problem, though it
requires changes to the VMM as well, not just KVM, which currently
relies on vfio-pci mmap().
Catalin Marinas Oct. 19, 2023, 11:12 a.m. UTC | #29
On Fri, Oct 13, 2023 at 05:28:10PM +0200, Lorenzo Pieralisi wrote:
> On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote:
> 
> [...]
> 
> > Yes, we end up with mismatched aliases but they only matter if the VMM
> > also accesses the I/O range via its own mapping. So far I haven't seen
> > case that suggests this.
> > 
> > > > Things can go wrong but that's not because Device does anything better.
> > > > Given the RAS implementation, external aborts caused on Device memory
> > > > (e.g. wrong size access) is uncontainable. For Normal NC it can be
> > > > contained (I can dig out the reasoning behind this if you want, IIUC
> > > > something to do with not being able to cancel an already issued Device
> > > > access since such accesses don't allow speculation due to side-effects;
> > > > for Normal NC, it's just about the software not getting the data).
> > > 
> > > I really think these details belong in the commit message.
> > 
> > I guess another task for Lorenzo ;).
> 
> I will do, I start wondering though whether this documentation belongs
> in this commit log only or at Documentation/arch/arm64 level (or both),
> I am pretty sure this thread can turn out quite useful as a reference
> (it is for me) if we manage to summarize it that would benefit
> everyone.

I think it makes sense to add something in the Documentation for easier
future reference. We can also keep adding to it as we learn more. The
commit log can be shorter in this case.
Jason Gunthorpe Oct. 19, 2023, 11:51 a.m. UTC | #30
On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote:

> The issue is when the device is reclaimed to be given to another
> user-space process, do we know the previous transaction reached the
> device? With the TLBI + DSB in unmap, we can only tell that a subsequent
> map + write (in a new process) is ordered after the write in the old
> process. Maybe that's sufficient in most cases.

I can't think of a case where ordering is insufficient. As long as the
device perceives a strict seperation in order of writes preceeding the
'reset' writel and writes following the reset writel() the absolute
time of completion can not matter.

At least for VFIO PCI the reset sequences uses non-posted config TLPs
followed by config read TLPs.

For something like mlx5 when it reclaims a WC VMA it does a RPC which
is dma_wmb(), a writel(), a device DMA read, a device DMA write and a
CPU read fo DMA'd data.

Both of these are pretty "firm" ordering barriers.

> > Unmapping the VMA's must already have some NormalNC friendly ordering
> > barrier across all CPUs or we have a bigger problem. This barrier
> > definately must close write combining.
> 
> I think what we have is TLBI + DSB generating the DVM/DVMSync messages
> should ensure that the Normal NC writes on other CPUs reach some
> serialisation point (not necessarily the device, AFAICT we can't
> guarantee end-point completion here).

This is what our internal architects thought as well.

> Talking to Will earlier, I think we can deem the PCIe scenario
> (somewhat) safe but not as a generic mechanism for other non-PCIe
> devices (e.g. platform). With this concern, can we make this Stage 2
> relaxation in KVM only for vfio-pci mappings? I don't have an example of
> non-PCIe device assignment to figure out how this should work though.

It is not a KVM problem. As I implied above it is VFIO's
responsibility to reliably reset the device, not KVMs. If for some
reason VFIO cannot do that on some SOC then VFIO devices should not
exist.

It is not KVM's job to double guess VFIO's own security properties.

Specifically about platform the generic VFIO platform driver is the
ACPI based one. If the FW provides an ACPI method for device reset
that is not properly serializing, that is a FW bug. We can quirk it in
VFIO and block using those devices if they actually exist.

I expect the non-generic VFIO platform drivers to take care of this
issue internally with, barriers, read from devices, whatver is
required to make their SOCs order properly. Just as I would expect a
normal Linux platform driver to directly manage whatever
implementation specific ordering quirks the SOC may have.

Generic PCI drivers are the ones that rely on the implicit ordering at
the PCIe TLP level from TLBI + DSB. I would say this is part of the
undocumented arch API around pgprot_wc

> > > knows all the details. The safest is for the VMM to keep it as Device (I
> > > think vfio-pci goes for the strongest nGnRnE).
> > 
> > We are probably going to allow VFIO to let userspace pick if it should
> > be pgprot_device or pgprot_writecombine.
> 
> I guess that's for the direct use by an application rather than
> VMM+VM.

Yes, there have been requests for this.

> IIUC people work around this currently by mapping PCIe BARs as
> pgprot_writecombine() via sysfs. Getting vfio-pci to allow different
> mappings is probably a good idea, though it doesn't currently help with
> the KVM case as we can't force the VMM to know the specifics of the
> device it is giving to a guest.

Yes to all points

Thanks,
Jason
Lorenzo Pieralisi Oct. 19, 2023, 1:35 p.m. UTC | #31
On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote:

[...]

> > VFIO issues a config space write to reset the PCI function.  Config
> > space writes MUST NOT write combine with anything. This is already
> > impossible for PCIe since they are different TLP types at the PCIe
> > level.
> 
> Yes, config space writes are fine, vfio-pci even maps them as
> Device_nGnRnE. But AFAIK a guest won't have direct access to the config
> space.
> 
> > By the PCIe rules, config space write must order strictly after all
> > other CPU's accesses. Once the reset non-posted write returns back to
> > VFIO we know that:
> > 
> >  1) There is no reference in any CPU page table to the MMIO PFN
> >  2) No CPU has pending data in any write buffer
> >  3) The interconnect and PCIe fabric have no inflight operations
> >  4) The device is in a clean post-reset state
> 
> I think from the CPU perspective, we can guarantee that a Normal_NC
> write on CPU0 for example reaches a serialisation point before a config
> space (Device_nGnRnE) write on CPU1 by the host as long as CPU1 issued a
> TLBI+DSB. Now, what I'm not sure is what this serialisation point is. If
> it is the PCIe root complex, we are probably fine, we hope it deals with
> any ordering between the Normal_NC write and the config space one.

If it is the PCI host bridge (and for PCI it should be since it is the
logic between the ARM world - where ARM ordering rules and barriers
apply - and PCI protocol), either it enforces PCI ordering rules or it
is broken by design; if it is the latter, at that stage device
assignment would be the least of our problems.

For non-PCI device assignment, I am not sure at all we can rely on
anything other than what Jason mentioned, eg resets (and the components
that through eg MMIO are carrying them out) are not architected, the
device MMIO space and the MMIO space used to trigger the reset (again,
it is an example) may well be placed on different interconnect paths,
it is device specific.

Lorenzo

> Talking to Will earlier, I think we can deem the PCIe scenario
> (somewhat) safe but not as a generic mechanism for other non-PCIe
> devices (e.g. platform). With this concern, can we make this Stage 2
> relaxation in KVM only for vfio-pci mappings? I don't have an example of
> non-PCIe device assignment to figure out how this should work though.
> 
> > > knows all the details. The safest is for the VMM to keep it as Device (I
> > > think vfio-pci goes for the strongest nGnRnE).
> > 
> > We are probably going to allow VFIO to let userspace pick if it should
> > be pgprot_device or pgprot_writecombine.
> 
> I guess that's for the direct use by an application rather than VMM+VM.
> IIUC people work around this currently by mapping PCIe BARs as
> pgprot_writecombine() via sysfs. Getting vfio-pci to allow different
> mappings is probably a good idea, though it doesn't currently help with
> the KVM case as we can't force the VMM to know the specifics of the
> device it is giving to a guest.
> 
> > The alias issue could be resolved by teaching KVM how to insert a
> > physical PFN based on some VFIO FD/dmabuf rather than a VMA so that
> > the PFNs are never mapped in the hypervisor side.
> 
> This should work as well and solves the aliasing problem, though it
> requires changes to the VMM as well, not just KVM, which currently
> relies on vfio-pci mmap().
> 
> -- 
> Catalin
Catalin Marinas Oct. 20, 2023, 11:21 a.m. UTC | #32
On Thu, Oct 19, 2023 at 08:51:42AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote:
> > Talking to Will earlier, I think we can deem the PCIe scenario
> > (somewhat) safe but not as a generic mechanism for other non-PCIe
> > devices (e.g. platform). With this concern, can we make this Stage 2
> > relaxation in KVM only for vfio-pci mappings? I don't have an example of
> > non-PCIe device assignment to figure out how this should work though.
> 
> It is not a KVM problem. As I implied above it is VFIO's
> responsibility to reliably reset the device, not KVMs. If for some
> reason VFIO cannot do that on some SOC then VFIO devices should not
> exist.
> 
> It is not KVM's job to double guess VFIO's own security properties.

I'd argue that since KVM is the one relaxing the memory attributes
beyond what the VFIO driver allows the VMM to use, it is KVM's job to
consider the security implications. This is fine for vfio-pci and
Normal_NC but I'm not sure we can generalise.

> Specifically about platform the generic VFIO platform driver is the
> ACPI based one. If the FW provides an ACPI method for device reset
> that is not properly serializing, that is a FW bug. We can quirk it in
> VFIO and block using those devices if they actually exist.
> 
> I expect the non-generic VFIO platform drivers to take care of this
> issue internally with, barriers, read from devices, whatver is
> required to make their SOCs order properly. Just as I would expect a
> normal Linux platform driver to directly manage whatever
> implementation specific ordering quirks the SOC may have.

This would be a new requirement if an existing VFIO platform driver
relied on all mappings being Device. But maybe that's just theoretical
at the moment, are there any concrete examples outside vfio-pci? If not,
we can document it as per Lorenzo's suggestion to summarise this
discussion under Documentation/.
Jason Gunthorpe Oct. 20, 2023, 11:47 a.m. UTC | #33
On Fri, Oct 20, 2023 at 12:21:50PM +0100, Catalin Marinas wrote:
> On Thu, Oct 19, 2023 at 08:51:42AM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote:
> > > Talking to Will earlier, I think we can deem the PCIe scenario
> > > (somewhat) safe but not as a generic mechanism for other non-PCIe
> > > devices (e.g. platform). With this concern, can we make this Stage 2
> > > relaxation in KVM only for vfio-pci mappings? I don't have an example of
> > > non-PCIe device assignment to figure out how this should work though.
> > 
> > It is not a KVM problem. As I implied above it is VFIO's
> > responsibility to reliably reset the device, not KVMs. If for some
> > reason VFIO cannot do that on some SOC then VFIO devices should not
> > exist.
> > 
> > It is not KVM's job to double guess VFIO's own security properties.
> 
> I'd argue that since KVM is the one relaxing the memory attributes
> beyond what the VFIO driver allows the VMM to use, it is KVM's job to
> consider the security implications. This is fine for vfio-pci and
> Normal_NC but I'm not sure we can generalise.

I can see that, but I belive we should take this responsibility into
VFIO as a requirement. As I said in the other email we do want to
extend VFIO to support NormalNC VMAs for DPDK, so we need to take this
anyhow.

> > Specifically about platform the generic VFIO platform driver is the
> > ACPI based one. If the FW provides an ACPI method for device reset
> > that is not properly serializing, that is a FW bug. We can quirk it in
> > VFIO and block using those devices if they actually exist.
> > 
> > I expect the non-generic VFIO platform drivers to take care of this
> > issue internally with, barriers, read from devices, whatver is
> > required to make their SOCs order properly. Just as I would expect a
> > normal Linux platform driver to directly manage whatever
> > implementation specific ordering quirks the SOC may have.
> 
> This would be a new requirement if an existing VFIO platform driver
> relied on all mappings being Device. But maybe that's just theoretical
> at the moment, are there any concrete examples outside vfio-pci? If not,
> we can document it as per Lorenzo's suggestion to summarise this
> discussion under Documentation/.

My point is if this becomes a real world concern we have a solid
answer on how to resolve it - fix the VFIO driver to have a stronger
barrier before reset.

I'm confident it is not a problem for PCI and IIRC the remaining ARM
platform drivers were made primarily for DPDK, not KVM.

So I agree with documenting and perhaps a comment someplace in VFIO is
also warranted.

Jason
Lorenzo Pieralisi Oct. 20, 2023, 2:03 p.m. UTC | #34
On Fri, Oct 20, 2023 at 08:47:19AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 20, 2023 at 12:21:50PM +0100, Catalin Marinas wrote:
> > On Thu, Oct 19, 2023 at 08:51:42AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote:
> > > > Talking to Will earlier, I think we can deem the PCIe scenario
> > > > (somewhat) safe but not as a generic mechanism for other non-PCIe
> > > > devices (e.g. platform). With this concern, can we make this Stage 2
> > > > relaxation in KVM only for vfio-pci mappings? I don't have an example of
> > > > non-PCIe device assignment to figure out how this should work though.
> > > 
> > > It is not a KVM problem. As I implied above it is VFIO's
> > > responsibility to reliably reset the device, not KVMs. If for some
> > > reason VFIO cannot do that on some SOC then VFIO devices should not
> > > exist.
> > > 
> > > It is not KVM's job to double guess VFIO's own security properties.
> > 
> > I'd argue that since KVM is the one relaxing the memory attributes
> > beyond what the VFIO driver allows the VMM to use, it is KVM's job to
> > consider the security implications. This is fine for vfio-pci and
> > Normal_NC but I'm not sure we can generalise.
> 
> I can see that, but I belive we should take this responsibility into
> VFIO as a requirement. As I said in the other email we do want to
> extend VFIO to support NormalNC VMAs for DPDK, so we need to take this
> anyhow.
> 
> > > Specifically about platform the generic VFIO platform driver is the
> > > ACPI based one. If the FW provides an ACPI method for device reset
> > > that is not properly serializing, that is a FW bug. We can quirk it in
> > > VFIO and block using those devices if they actually exist.
> > > 
> > > I expect the non-generic VFIO platform drivers to take care of this
> > > issue internally with, barriers, read from devices, whatver is
> > > required to make their SOCs order properly. Just as I would expect a
> > > normal Linux platform driver to directly manage whatever
> > > implementation specific ordering quirks the SOC may have.
> > 
> > This would be a new requirement if an existing VFIO platform driver
> > relied on all mappings being Device. But maybe that's just theoretical
> > at the moment, are there any concrete examples outside vfio-pci? If not,
> > we can document it as per Lorenzo's suggestion to summarise this
> > discussion under Documentation/.
> 
> My point is if this becomes a real world concern we have a solid
> answer on how to resolve it - fix the VFIO driver to have a stronger
> barrier before reset.

Just to make sure I am parsing this correctly: this case above is
related to a non-PCI VFIO device passthrough where a guest would want to
map the device MMIO at stage-1 with normal-NC memory type (well, let's
say with a memory attribute != device-nGnRE - that combined with the new
stage-2 default might cause transactions ordering/grouping trouble with
eg device resets), correct ? IIRC, all requests related to honouring
"write-combine" style stage-1 mappings were for PCI(e) devices but
that's as far as what *I* was made aware of goes.

> I'm confident it is not a problem for PCI and IIRC the remaining ARM
> platform drivers were made primarily for DPDK, not KVM.
> 
> So I agree with documenting and perhaps a comment someplace in VFIO is
> also warranted.

We will do that, I will start adding the recent discussions to the
new documentation file. Side note: for those who attend LPC it would be
useful to review the resulting documentation together there, it should
happen around v6.7-rc1.

Lorenzo
Jason Gunthorpe Oct. 20, 2023, 2:28 p.m. UTC | #35
On Fri, Oct 20, 2023 at 04:03:57PM +0200, Lorenzo Pieralisi wrote:
> > My point is if this becomes a real world concern we have a solid
> > answer on how to resolve it - fix the VFIO driver to have a stronger
> > barrier before reset.
> 
> Just to make sure I am parsing this correctly: this case above is
> related to a non-PCI VFIO device passthrough where a guest would want to
> map the device MMIO at stage-1 with normal-NC memory type (well, let's
> say with a memory attribute != device-nGnRE - that combined with the new
> stage-2 default might cause transactions ordering/grouping trouble with
> eg device resets), correct ? 

This is what I have understood was Will's concern, yes.

> IIRC, all requests related to honouring "write-combine" style
> stage-1 mappings were for PCI(e) devices but that's as far as what
> *I* was made aware of goes.

Yes, this is what I am aware of as well.

Though I do not object to the idea from the VFIO side that platform
devices would also have to support NormalNC too.

The theoretical missing peice is that someone would say they have a
SOC issue XYZ and thus their VFIO platform devices must fully block
NormalNC. I suggest if someone comes with this and really, really
wants VFIO, then we could use a VMA flag to indicate that KVM must not
upgrade it. Currently I have no knowledge of such a thing existing.

With PCI we've made the argument that if NormalNC is broken unsafe for
KVM in the SOC then probably so is Device-*. I think the same basic
argument holds for platform devices too. Thus I'm skeptical that
someone can come and say they have SOC issue XYZ and NormalNC is
broken but Device-* is perfectly safe.

> We will do that, I will start adding the recent discussions to the
> new documentation file. Side note: for those who attend LPC it would be
> useful to review the resulting documentation together there, it should
> happen around v6.7-rc1.

I will be there, let me know

Thanks,
Jason
Lorenzo Pieralisi Nov. 9, 2023, 3:34 p.m. UTC | #36
On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote:

[...]

> > > Things can go wrong but that's not because Device does anything better.
> > > Given the RAS implementation, external aborts caused on Device memory
> > > (e.g. wrong size access) is uncontainable. For Normal NC it can be
> > > contained (I can dig out the reasoning behind this if you want, IIUC
> > > something to do with not being able to cancel an already issued Device
> > > access since such accesses don't allow speculation due to side-effects;
> > > for Normal NC, it's just about the software not getting the data).
> > 
> > I really think these details belong in the commit message.
> 
> I guess another task for Lorenzo ;).

Updated commit log (it might be [is] too verbose) below, it should probably
be moved into a documentation file but to do that I should decouple
it from this changeset (ie a document explaining memory attributes
and error containment for ARM64 - indipendent from KVM S2 defaults).

I'd also add a Link: to the lore archive entry for reference (did not
add it in the log below).

Please let me know what's the best option here.

Thanks,
Lorenzo

-- >8 --
Currently, KVM for ARM64 maps at stage 2 memory that is
considered device (ie it is not RAM) with DEVICE_nGnRE
memory attributes; this setting overrides (as per the ARM
architecture [1]) any device MMIO mapping present at stage
1, resulting in a set-up whereby a guest operating system
can't determine device MMIO mapping memory attributes on its
own but it is always overriden by the KVM stage 2 default.

This set-up does not allow guest operating systems to select
device memory attributes on a page by page basis independently
from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage
2 memory type attributes"), which turns out to be an issue in that
guest operating systems (eg Linux) may request to map
devices MMIO regions with memory attributes that guarantee
better performance (eg gathering attribute - that for some
devices can generate larger PCIe memory writes TLPs)
and specific operations (eg unaligned transactions) such as
the NormalNC memory type.

The default device stage 2 mapping was chosen in KVM
for ARM64 since it was considered safer (ie it would
not allow guests to trigger uncontained failures
ultimately crashing the machine) but this turned out
to be imprecise.

Failures containability is a property of the platform
and is independent from the memory type used for MMIO
device memory mappings.

Actually, DEVICE_nGnRE memory type is even more problematic
than eg Normal-NC memory type in terms of faults containability
in that eg aborts triggered on DEVICE_nGnRE loads cannot be made,
architecturally, synchronous (ie that would imply that the processor
should issue at most 1 load transaction at a time - ie it can't pipeline
them - otherwise the synchronous abort semantics would break the
no-speculation attribute attached to DEVICE_XXX memory).

This means that regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures and that in turn relies on platform
capabilities and the device type being assigned (ie PCIe AER/DPC
error containment and RAS architecture[3]); therefore the default
KVM device stage 2 memory attributes play no role in making device
assignment safer for a given platform (if the platform design
adheres to design guidelines outlined in [3]) and therefore can
be relaxed.

For all these reasons, relax the KVM stage 2 device
memory attributes from DEVICE_nGnRE to Normal-NC.

A different Normal memory type default at stage-2
(eg Normal Write-through) could have been chosen
instead of Normal-NC but Normal-NC was chosen
because:

- Its attributes are well-understood compared to
  other Normal memory types for MMIO mappings
- On systems implementing S2FWB (FEAT_S2FWB), that's the only sensible
  default for normal memory types. For systems implementing
  FEAT_S2FWB (see [1] D8.5.5 S2=stage-2 - S2 MemAttr[3:0]), the options
  to choose the memory types are as follows:

  if S2 MemAttr[2] == 0, the mapping defaults to DEVICE_XXX
  (XXX determined by S2 MemAttr[1:0]). This is what we have
  today (MemAttr[2:0] == 0b001) and therefore it is not described
  any further.

  if S2 MemAttr[2] == 1, there are the following options:
 
  S2 MemAttr[2:0] | Resulting mapping
  -----------------------------------------------------------------------------
  0b101           | Prevent the guest from selecting cachable memory type, it
		  | allows it to choose Device-* or Normal-NC
  0b110           | It forces write-back memory type; it breaks MMIO.
  0b111           | Allow the VM to select any memory type including cachable.
		  | It is unclear whether this is safe from a platform
		  | perspective, especially wrt uncontained failures and
		  | cacheability (eg device reclaim/reset and cache
		  | maintenance).
  ------------------------------------------------------------------------------

  - For !FEAT_S2FWB systems, it is logical to choose a default S2 mapping
    identical to FEAT_S2FWB (that basically would force Normal-NC, see
    option S2 MemAttr[2:0] == 0b101 above), to guarantee a coherent approach
    between the two

Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
trigger any issue on guest device reclaim use cases either (ie device
MMIO unmap followed by a device reset) at least for PCIe devices, in that
in PCIe a device reset is architected and carried out through PCI config
space transactions that are naturally ordered wrt MMIO transactions
according to the PCI ordering rules.

Having Normal-NC S2 default puts guests in control (thanks to
stage1+stage2 combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules described in [1]
and summarized here ([(S1) - stage1], [(S2) - stage 2]):

S1	     |  S2	     | Result
NORMAL-WB    |  NORMAL-NC    | NORMAL-NC
NORMAL-WT    |  NORMAL-NC    | NORMAL-NC
NORMAL-NC    |  NORMAL-NC    | NORMAL-NC
DEVICE<attr> |  NORMAL-NC    | DEVICE<attr>

It is worth noting that currently, to map devices MMIO space to user
space in a device pass-through use case the VFIO framework applies memory
attributes derived from pgprot_noncached() settings applied to VMAs, which
result in device-nGnRnE memory attributes for the stage-1 VMM mappings.

This means that a userspace mapping for device MMIO space carried
out with the current VFIO framework and a guest OS mapping for the same
MMIO space may result in a mismatched alias as described in [2].

Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change
anything in this respect, in that the mismatched aliases would only affect
(refer to [2] for a detailed explanation) ordering between the userspace and
GuestOS mappings resulting stream of transactions (ie it does not cause loss of
property for either stream of transactions on its own), which is harmless
given that the userspace and GuestOS access to the device is carried
out through independent transactions streams.

[1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
[2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
[3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf
Jason Gunthorpe Nov. 10, 2023, 2:26 p.m. UTC | #37
On Thu, Nov 09, 2023 at 04:34:10PM +0100, Lorenzo Pieralisi wrote:

> Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
> trigger any issue on guest device reclaim use cases either (ie device
> MMIO unmap followed by a device reset) at least for PCIe devices, in that
> in PCIe a device reset is architected and carried out through PCI config
> space transactions that are naturally ordered wrt MMIO transactions
> according to the PCI ordering rules.

This is not how I see that thread concluding..

The device reclaim problem belongs solely to VFIO, not KVM. VFIO must
ensure global ordering of access before the VMA is unmaped and access
after, that includes ordering whatever mechanism the VFIO driver uses
for reset.

If there are quirky SOCs, or non-PCI devices that need something
stronger than the TLBI/etc sequence it should be fixed in VFIO (or
maybe even the arch code), not by blocking NORMAL_NC in the KVM. Such
a quirky SOC would broadly have security issues beyond KVM.

> It is worth noting that currently, to map devices MMIO space to user
> space in a device pass-through use case the VFIO framework applies memory
> attributes derived from pgprot_noncached() settings applied to VMAs, which

Sometimes. VFIO uses a mix of pgprot_noncached and pgprot_device. AFAIK
we should change to to always use pgprot_device..

Thanks,
Jason
Lorenzo Pieralisi Nov. 13, 2023, 12:42 a.m. UTC | #38
On Fri, Nov 10, 2023 at 10:26:49AM -0400, Jason Gunthorpe wrote:
> On Thu, Nov 09, 2023 at 04:34:10PM +0100, Lorenzo Pieralisi wrote:
> 
> > Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
> > trigger any issue on guest device reclaim use cases either (ie device
> > MMIO unmap followed by a device reset) at least for PCIe devices, in that
> > in PCIe a device reset is architected and carried out through PCI config
> > space transactions that are naturally ordered wrt MMIO transactions
> > according to the PCI ordering rules.
> 
> This is not how I see that thread concluding..
> 
> The device reclaim problem belongs solely to VFIO, not KVM. VFIO must
> ensure global ordering of access before the VMA is unmaped and access
> after, that includes ordering whatever mechanism the VFIO driver uses
> for reset.
> 
> If there are quirky SOCs, or non-PCI devices that need something
> stronger than the TLBI/etc sequence it should be fixed in VFIO (or
> maybe even the arch code), not by blocking NORMAL_NC in the KVM. Such
> a quirky SOC would broadly have security issues beyond KVM.

https://lore.kernel.org/linux-arm-kernel/20231013092934.GA13524@willie-the-truck

I think that Will's point _was_ related to the change we are making for
KVM S2 mappings and related device transactions on device reclaim - ie
reset, that's what I tried to convey (I probably simplified too much)
that for PCI at least that should not trigger any regression/issue, in
that BAR MMIO and reset transactions are decoupled streams and must
follow the PCI ordering rules.

Yes, it is VFIO responsibility but changing the S2 KVM mappings
may have (for non-PCI devices) side effects compared to what
we have today, I am not saying this should be a blocker I just
summarized the thread above, the paragraph can be expanded.

Lorenzo

> 
> > It is worth noting that currently, to map devices MMIO space to user
> > space in a device pass-through use case the VFIO framework applies memory
> > attributes derived from pgprot_noncached() settings applied to VMAs, which
> 
> Sometimes. VFIO uses a mix of pgprot_noncached and pgprot_device. AFAIK
> we should change to to always use pgprot_device..
> 
> Thanks,
> Jason
Catalin Marinas Nov. 13, 2023, 5:41 p.m. UTC | #39
Hi Lorenzo,

Thanks for putting this together. It looks fine to me, just some general
comments below.

On Thu, Nov 09, 2023 at 04:34:10PM +0100, Lorenzo Pieralisi wrote:
> Updated commit log (it might be [is] too verbose) below, it should probably
> be moved into a documentation file but to do that I should decouple
> it from this changeset (ie a document explaining memory attributes
> and error containment for ARM64 - indipendent from KVM S2 defaults).
> 
> I'd also add a Link: to the lore archive entry for reference (did not
> add it in the log below).
> 
> Please let me know what's the best option here.

It would be good to have this under Documentation/arch/arm64/, though as
a document at the end of the series describing what the new behaviour
is, the kernel/KVM expectations (after the patches have been applied).
We could even submit it separately once we agreed on how the series
looks like.

> -- >8 --
> Currently, KVM for ARM64 maps at stage 2 memory that is
> considered device (ie it is not RAM) with DEVICE_nGnRE
> memory attributes; this setting overrides (as per the ARM
> architecture [1]) any device MMIO mapping present at stage
> 1, resulting in a set-up whereby a guest operating system
> can't determine device MMIO mapping memory attributes on its
> own but it is always overriden by the KVM stage 2 default.

To be fully correct, stage 1 and stage 2 attributes combine in a way
that results in the more restrictive properties, so not a simple
override. But to keep things simple, especially if this part only goes
in the commit log, leave it as is.

> This set-up does not allow guest operating systems to select
> device memory attributes on a page by page basis independently

Not sure it's even worth specifying "page by page". It just doesn't
allow the guest to relax the device map memory attributes at all.

> from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage
> 2 memory type attributes"), which turns out to be an issue in that
> guest operating systems (eg Linux) may request to map
> devices MMIO regions with memory attributes that guarantee
> better performance (eg gathering attribute - that for some
> devices can generate larger PCIe memory writes TLPs)
> and specific operations (eg unaligned transactions) such as
> the NormalNC memory type.

Another case is correct guest behaviour where it would not expect
unaligned accesses from Normal NC mappings.

> The default device stage 2 mapping was chosen in KVM
> for ARM64 since it was considered safer (ie it would
> not allow guests to trigger uncontained failures
> ultimately crashing the machine) but this turned out
> to be imprecise.

s/imprecise/asynchronous (SError)/

Another reason was probably that it matches the default VFIO mapping
(though the latter is slightly stricter as in Device_nGnRnE).

> Failures containability is a property of the platform
> and is independent from the memory type used for MMIO
> device memory mappings.
> 
> Actually, DEVICE_nGnRE memory type is even more problematic
> than eg Normal-NC memory type in terms of faults containability
> in that eg aborts triggered on DEVICE_nGnRE loads cannot be made,
> architecturally, synchronous (ie that would imply that the processor
> should issue at most 1 load transaction at a time - ie it can't pipeline
> them - otherwise the synchronous abort semantics would break the
> no-speculation attribute attached to DEVICE_XXX memory).
> 
> This means that regardless of the combined stage1+stage2 mappings a
> platform is safe if and only if device transactions cannot trigger
> uncontained failures and that in turn relies on platform
> capabilities and the device type being assigned (ie PCIe AER/DPC
> error containment and RAS architecture[3]); therefore the default
> KVM device stage 2 memory attributes play no role in making device
> assignment safer for a given platform (if the platform design
> adheres to design guidelines outlined in [3]) and therefore can
> be relaxed.
> 
> For all these reasons, relax the KVM stage 2 device
> memory attributes from DEVICE_nGnRE to Normal-NC.

I think this covers the safety aspect w.r.t. uncontained errors.

> A different Normal memory type default at stage-2
> (eg Normal Write-through) could have been chosen
> instead of Normal-NC but Normal-NC was chosen
> because:
> 
> - Its attributes are well-understood compared to
>   other Normal memory types for MMIO mappings

Ideally we'd have gone for Device_GRE from a performance perspective but
it doesn't support unaligned accesses. Basically at this stage we don't
want cacheable accesses to MMIO until we gain a better understanding on
what the impact would be.

> - On systems implementing S2FWB (FEAT_S2FWB), that's the only sensible
>   default for normal memory types. For systems implementing
>   FEAT_S2FWB (see [1] D8.5.5 S2=stage-2 - S2 MemAttr[3:0]), the options
>   to choose the memory types are as follows:
> 
>   if S2 MemAttr[2] == 0, the mapping defaults to DEVICE_XXX
>   (XXX determined by S2 MemAttr[1:0]). This is what we have
>   today (MemAttr[2:0] == 0b001) and therefore it is not described
>   any further.
> 
>   if S2 MemAttr[2] == 1, there are the following options:
>  
>   S2 MemAttr[2:0] | Resulting mapping
>   -----------------------------------------------------------------------------
>   0b101           | Prevent the guest from selecting cachable memory type, it
> 		  | allows it to choose Device-* or Normal-NC
>   0b110           | It forces write-back memory type; it breaks MMIO.
>   0b111           | Allow the VM to select any memory type including cachable.
> 		  | It is unclear whether this is safe from a platform
> 		  | perspective, especially wrt uncontained failures and
> 		  | cacheability (eg device reclaim/reset and cache
> 		  | maintenance).
>   ------------------------------------------------------------------------------
> 
>   - For !FEAT_S2FWB systems, it is logical to choose a default S2 mapping
>     identical to FEAT_S2FWB (that basically would force Normal-NC, see
>     option S2 MemAttr[2:0] == 0b101 above), to guarantee a coherent approach
>     between the two

Is this text only to say that we can't use write-through memory because
with FEAT_S2FWB it is forced cacheable? I'd probably keep in simple and
just state that we went for Normal NC to avoid cache
allocation/snooping.

It just crossed my mind, I think we also assume here that on platforms
with transparent caches, Normal NC of MMIO won't be upgraded to
cacheable (that's irrespective of KVM). This may need to be stated in
some Arm BSA document (I'm not aware of any arch rules that prevent this
from happening). AFAIK, platforms with transparent caches only do this
for the memory ranges where the DRAM is expected.

> Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
> trigger any issue on guest device reclaim use cases either (ie device
> MMIO unmap followed by a device reset) at least for PCIe devices, in that
> in PCIe a device reset is architected and carried out through PCI config
> space transactions that are naturally ordered wrt MMIO transactions
> according to the PCI ordering rules.

We could state this in a doc that for device pass-through, there's an
expectation that the device can be reclaimed along the lines of the PCIe
reset behaviour. If one admin decides to allow device pass-through for
some unsafe devices/platforms, it's their problem really. I don't think
a Device_nGnRE mapping would help in those cases anyway.

This could be expanded to Normal Cacheable at some point but KVM would
have to do cache maintenance after unmapping the device from the guest
(so let's leave this for a separate series).

> Having Normal-NC S2 default puts guests in control (thanks to
> stage1+stage2 combined memory attributes rules [1]) of device MMIO
> regions memory mappings, according to the rules described in [1]
> and summarized here ([(S1) - stage1], [(S2) - stage 2]):
> 
> S1	     |  S2	     | Result
> NORMAL-WB    |  NORMAL-NC    | NORMAL-NC
> NORMAL-WT    |  NORMAL-NC    | NORMAL-NC
> NORMAL-NC    |  NORMAL-NC    | NORMAL-NC
> DEVICE<attr> |  NORMAL-NC    | DEVICE<attr>
> 
> It is worth noting that currently, to map devices MMIO space to user
> space in a device pass-through use case the VFIO framework applies memory
> attributes derived from pgprot_noncached() settings applied to VMAs, which
> result in device-nGnRnE memory attributes for the stage-1 VMM mappings.
> 
> This means that a userspace mapping for device MMIO space carried
> out with the current VFIO framework and a guest OS mapping for the same
> MMIO space may result in a mismatched alias as described in [2].
> 
> Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change
> anything in this respect, in that the mismatched aliases would only affect
> (refer to [2] for a detailed explanation) ordering between the userspace and
> GuestOS mappings resulting stream of transactions (ie it does not cause loss of
> property for either stream of transactions on its own), which is harmless
> given that the userspace and GuestOS access to the device is carried
> out through independent transactions streams.

I think this was a recent clarification from the architects (in private
emails). My understanding for some time has been that the mere presence
of a Normal alias would affect the semantics of the user (VMM) accesses
to the Device mapping. Apparently, this only matters if there is some
expected ordering between the user (VMM) access and the VM one with
different aliases. I can't tell where the mismatched aliases rules state
this but I haven't read them in a while.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index fde4186cc387..c247e5f29d5a 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -147,6 +147,7 @@ 
  * Memory types for Stage-2 translation
  */
 #define MT_S2_NORMAL		0xf
+#define MT_S2_NORMAL_NC	0x5
 #define MT_S2_DEVICE_nGnRE	0x1
 
 /*
@@ -154,6 +155,7 @@ 
  * Stage-2 enforces Normal-WB and Device-nGnRE
  */
 #define MT_S2_FWB_NORMAL	6
+#define MT_S2_FWB_NORMAL_NC	5
 #define MT_S2_FWB_DEVICE_nGnRE	1
 
 #ifdef CONFIG_ARM64_4K_PAGES
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index ccd291b6893d..a80949002191 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -696,7 +696,7 @@  static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
 				kvm_pte_t *ptep)
 {
 	bool device = prot & KVM_PGTABLE_PROT_DEVICE;
-	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) :
+	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, NORMAL_NC) :
 			    KVM_S2_MEMATTR(pgt, NORMAL);
 	u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;