diff mbox

[v2,01/15] docs: create Memory Bandwidth Allocation (MBA) feature document

Message ID 1503537289-56036-2-git-send-email-yi.y.sun@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yi Sun Aug. 24, 2017, 1:14 a.m. UTC
This patch creates MBA feature document in doc/features/. It describes
key points to implement MBA which is described in details in Intel SDM
"Introduction to Memory Bandwidth Allocation".

Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
v2:
    - declare 'HW' in Terminology.
      (suggested by Chao Peng)
    - replace 'COS ID of VCPU' to 'COS ID of domain'.
      (suggested by Chao Peng)
    - replace 'COS register' to 'Thrtl MSR'.
      (suggested by Chao Peng)
    - add description for 'psr-mba-show' to state that the decimal value is
      shown for linear mode but hexadecimal value is shown for non-linear mode.
      (suggested by Chao Peng)
    - remove content in 'Areas for improvement'.
      (suggested by Chao Peng)
    - use '<>' to specify mandatory argument to a command.
      (suggested by Wei Liu)
v1:
    - remove a special character to avoid the error when building pandoc.
---
 docs/features/intel_psr_mba.pandoc | 256 +++++++++++++++++++++++++++++++++++++
 1 file changed, 256 insertions(+)
 create mode 100644 docs/features/intel_psr_mba.pandoc

Comments

Roger Pau Monné Aug. 29, 2017, 11:46 a.m. UTC | #1
On Thu, Aug 24, 2017 at 09:14:35AM +0800, Yi Sun wrote:
> This patch creates MBA feature document in doc/features/. It describes
> key points to implement MBA which is described in details in Intel SDM
> "Introduction to Memory Bandwidth Allocation".
> 
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> ---
> v2:
>     - declare 'HW' in Terminology.
>       (suggested by Chao Peng)
>     - replace 'COS ID of VCPU' to 'COS ID of domain'.
>       (suggested by Chao Peng)
>     - replace 'COS register' to 'Thrtl MSR'.
>       (suggested by Chao Peng)
>     - add description for 'psr-mba-show' to state that the decimal value is
>       shown for linear mode but hexadecimal value is shown for non-linear mode.
>       (suggested by Chao Peng)
>     - remove content in 'Areas for improvement'.
>       (suggested by Chao Peng)
>     - use '<>' to specify mandatory argument to a command.
>       (suggested by Wei Liu)
> v1:
>     - remove a special character to avoid the error when building pandoc.
> ---
>  docs/features/intel_psr_mba.pandoc | 256 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 256 insertions(+)
>  create mode 100644 docs/features/intel_psr_mba.pandoc
> 
> diff --git a/docs/features/intel_psr_mba.pandoc b/docs/features/intel_psr_mba.pandoc
> new file mode 100644
> index 0000000..21592e8
> --- /dev/null
> +++ b/docs/features/intel_psr_mba.pandoc
> @@ -0,0 +1,256 @@
> +% Intel Memory Bandwidth Allocation (MBA) Feature
> +% Revision 1.4
> +
> +\clearpage
> +
> +# Basics
> +
> +---------------- ----------------------------------------------------
> +         Status: **Tech Preview**
> +
> +Architecture(s): Intel x86
> +
> +   Component(s): Hypervisor, toolstack
> +
> +       Hardware: MBA is supported on Skylake Server and beyond
> +---------------- ----------------------------------------------------
> +
> +# Terminology
> +
> +* CAT         Cache Allocation Technology
> +* CBM         Capacity BitMasks
> +* CDP         Code and Data Prioritization
> +* COS/CLOS    Class of Service
> +* HW          Hardware
> +* MBA         Memory Bandwidth Allocation
> +* MSRs        Machine Specific Registers
> +* PSR         Intel Platform Shared Resource
> +* THRTL       Throttle value or delay value
> +
> +# Overview
> +
> +The Memory Bandwidth Allocation (MBA) feature provides indirect and approximate
> +control over memory bandwidth available per-core. This feature provides OS/
> +hypervisor the ability to slow misbehaving apps/domains or create advanced
> +closed-loop control system via exposing control over a credit-based throttling
> +mechanism.

I don't really understand what "advanced closed-loop control system
via exposing..." means. From my understand it is clear/simpler to
write it as:

"... the ability to slow misbehaving apps/domains by using a
credit-based throttling mechanism".

> +
> +# User details
> +
> +* Feature Enabling:
> +
> +  Add "psr=mba" to boot line parameter to enable MBA feature.
> +
> +* xl interfaces:
> +
> +  1. `psr-mba-show [domain-id]`:
> +
> +     Show memory bandwidth throttling for domain. For linear mode, it shows the
> +     decimal value. For non-linear mode, it shows hexadecimal value.

You should first explain what are the linear and non-linear modes.

> +
> +  2. `psr-mba-set [OPTIONS] <domain-id> <throttling>`:
> +
> +     Set memory bandwidth throttling for domain.
> +
> +     Options:
> +     '-s': Specify the socket to process, otherwise all sockets are processed.
> +
> +     Throttling value set in register implies memory bandwidth blocked, i.e.
> +     higher throttling value results in lower bandwidth. The max throttling
> +     value can be got through CPUID.

This is also hard to understand IMHO. I would rather write it as:

"The throttling value describes the amount of blocked bandwidth".
Although I have to admit I don't really understand this interface,
wouldn't it be easier to specify the memory bandwidth allowed
per-domain, rather the amount of bandwidth removed?

Using your approach the user has to first get the total bandwidth, and
then subtract the removed bandwidth in order to know the remaining
bandwidth for a domain.

Also, IMHO you should provide a command to print the max throttling,
remember that from Xen's PoV Dom0 is just another domain, and the
CPUID values reported to Dom0 don't need to be the same as found on
bare metal.

> +
> +     The response of the throttling value could be linear mode or non-linear
> +     mode.
> +
> +     Linear mode: the input precision is defined as 100-(MBA_MAX). For instance,

What's MBA_MAX? I don't see any reference/description of it above.

> +     if the MBA_MAX value is 90, the input precision is 10%. Values not an even
> +     multiple of the precision (e.g., 12%) will be rounded down (e.g., to 10%
> +     delay applied) by HW automatically.
> +
> +     Non-linear mode: input delay values are powers-of-two from zero to the
> +     MBA_MAX value from CPUID. In this case any values not a power of two will
> +     be rounded down the next nearest power of two by HW automatically.
> +
> +# Technical details
> +
> +MBA is a member of Intel PSR features, it shares the base PSR infrastructure
> +in Xen.
> +
> +## Hardware perspective
> +
> +  MBA defines a range of MSRs to support specifying a delay value (Thrtl) per
> +  COS, with details below.
> +
> +  ```
> +   +----------------------------+----------------+
> +   | MSR (per socket)           |    Address     |
> +   +----------------------------+----------------+
> +   | IA32_L2_QOS_Ext_BW_Thrtl_0 |     0xD50      |
> +   +----------------------------+----------------+
> +   | ...                        |  ...           |
> +   +----------------------------+----------------+
> +   | IA32_L2_QOS_Ext_BW_Thrtl_n | 0xD50+n (n<64) |
> +   +----------------------------+----------------+
> +  ```

Are you sure you want to hardcode this n<64? Isn't there a chance this
is going to be bumped in newer hardware?

> +
> +  When context switch happens, the COS ID of domain is written to per-thread MSR
> +  `IA32_PQR_ASSOC`, and then hardware enforces bandwidth allocation according
> +  to the throttling value stored in the Thrtl MSR register.
> +
> +## The relationship between MBA and CAT/CDP
> +
> +  Generally speaking, MBA is completely independent of CAT/CDP, and any
> +  combination may be applied at any time, e.g. enabling MBA with CAT
> +  disabled.
> +
> +  But it needs to be noticed that MBA shares COS infrastructure with CAT,
> +  although MBA is enumerated by different CPUID leaf from CAT (which
> +  indicates that the max COS of MBA may be different from CAT). In some
> +  cases, a domain is permitted to have a COS that is beyond one (or more)
> +  of PSR features but within the others. For instance, let's assume the max
> +  COS of MBA is 8 but the max COS of L3 CAT is 16, when a domain is assigned
> +  9 as COS, the L3 CAT CBM associated to COS 9 would be enforced, but for MBA,
> +  the HW works as default value is set since COS 9 is beyond the max COS (8)
> +  of MBA.
> +
> +## Design Overview
> +
> +* Core COS/Thrtl association
> +
> +  When enforcing Memory Bandwidth Allocation, all cores of domains have
> +  the same default Thrtl MSR (COS0) which stores the same Thrtl (0). The
> +  default Thrtl MSR is used only in hypervisor and is transparent to tool stack
> +  and user.
> +
> +  System administrator can change PSR allocation policy at runtime by
                        ^s
> +  tool stack. Since MBA shares COS ID with CAT/CDP, a COS ID corresponds to a
     ^ using the tool ...
> +  2-tuple, like [CBM, Thrtl] with only-CAT enalbed, when CDP is enabled,
                                              ^ enabled
> +  the COS ID corresponds to a 3-tuple, like [Code_CBM, Data_CBM, Thrtl]. If
> +  neither CAT nor CDP is enabled, things would be easier, one COS ID corresponds
                                            ^ are easier, since one...
> +  to one Thrtl.
> +
> +* VCPU schedule
> +
> +  This part reuses CAT COS infrastructure.
> +
> +* Multi-sockets
> +
> +  Different sockets may have different MBA ability (like max COS)
> +  although it is consistent on the same socket. So the capability
> +  of per-socket MBA is specified.
> +
> +  This part reuses CAT COS infrastructure.
> +
> +## Implementation Description
> +
> +* Hypervisor interfaces:
> +
> +  1. Boot line param: "psr=mba" to enable the feature.
> +
> +  2. SYSCTL:
> +          - XEN_SYSCTL_PSR_MBA_get_info: Get system MBA information.
> +
> +  3. DOMCTL:
> +          - XEN_DOMCTL_PSR_MBA_OP_GET_THRTL: Get throttling for a domain.
> +          - XEN_DOMCTL_PSR_MBA_OP_SET_THRTL: Set throttling for a domain.
> +
> +* xl interfaces:
> +
> +  1. psr-mba-show [domain-id]
> +          Show system/domain runtime MBA throttling value. For linear mode,
> +          it shows the decimal value. For non-linear mode, it shows hexadecimal
> +          value.
> +          => XEN_SYSCTL_PSR_MBA_get_info/XEN_DOMCTL_PSR_MBA_OP_GET_THRTL
> +
> +  2. psr-mba-set [OPTIONS] <domain-id> <throttling>
> +          Set bandwidth throttling for a domain.
> +          => XEN_DOMCTL_PSR_MBA_OP_SET_THRTL
> +
> +  3. psr-hwinfo
> +          Show PSR HW information, including L3 CAT/CDP/L2 CAT/MBA.
> +          => XEN_SYSCTL_PSR_MBA_get_info
> +
> +* Key data structure:
> +
> +  1. Feature HW info
> +
> +     ```
> +     struct {
> +         unsigned int thrtl_max;
> +         unsigned int linear;
> +     } mba_info;

Is this a domctl structure? a libxl one?

> +
> +     - Member `thrtl_max`
> +
> +       `thrtl_max` is the max throttling value to be set.
> +
> +     - Member `linear`
> +
> +       `linear` means the response of delay value is linear or not.
> +
> +     As mentioned above, MBA is a member of Intel PSR features, it would
> +     share the base PSR infrastructure in Xen. For example, the 'cos_max'
> +     is a common HW property for all features. So, for other data structure
> +     details, please refer 'intel_psr_cat_cdp.pandoc'.
> +
> +# Limitations
> +
> +MBA can only work on HW which enables it (check by CPUID).
> +
> +# Testing
> +
> +We can execute these commands to verify MBA on different HWs supporting them.
> +
> +For example:
> +    root@:~$ xl psr-hwinfo --mba
> +    Memory Bandwidth Allocation (MBA):
> +    Socket ID       : 0
> +    Linear Mode     : Enabled
> +    Maximum COS     : 7
> +    Maximum Throttling Value: 90
> +    Default Throttling Value: 0
> +
> +    root@:~$ xl psr-mba-set 1 0xa

Could you elaborate a little bit on why '0xa' is used here? IMHO The
example should provide some context.

Thanks, Roger.
Yi Sun Aug. 30, 2017, 5:20 a.m. UTC | #2
Thanks a lot for the review comments!

On 17-08-29 12:46:49, Roger Pau Monn� wrote:
> On Thu, Aug 24, 2017 at 09:14:35AM +0800, Yi Sun wrote:
> > +# Overview
> > +
> > +The Memory Bandwidth Allocation (MBA) feature provides indirect and approximate
> > +control over memory bandwidth available per-core. This feature provides OS/
> > +hypervisor the ability to slow misbehaving apps/domains or create advanced
> > +closed-loop control system via exposing control over a credit-based throttling
> > +mechanism.
> 
> I don't really understand what "advanced closed-loop control system
> via exposing..." means. From my understand it is clear/simpler to
> write it as:
> 
> "... the ability to slow misbehaving apps/domains by using a
> credit-based throttling mechanism".
> 
Thanks, 'closed-loop' looks redundant, will remove it.

> > +
> > +# User details
> > +
> > +* Feature Enabling:
> > +
> > +  Add "psr=mba" to boot line parameter to enable MBA feature.
> > +
> > +* xl interfaces:
> > +
> > +  1. `psr-mba-show [domain-id]`:
> > +
> > +     Show memory bandwidth throttling for domain. For linear mode, it shows the
> > +     decimal value. For non-linear mode, it shows hexadecimal value.
> 
> You should first explain what are the linear and non-linear modes.
> 
Ok, will move below linear/non-linear modes explanation to here.

> > +
> > +  2. `psr-mba-set [OPTIONS] <domain-id> <throttling>`:
> > +
> > +     Set memory bandwidth throttling for domain.
> > +
> > +     Options:
> > +     '-s': Specify the socket to process, otherwise all sockets are processed.
> > +
> > +     Throttling value set in register implies memory bandwidth blocked, i.e.
> > +     higher throttling value results in lower bandwidth. The max throttling
> > +     value can be got through CPUID.
> 
> This is also hard to understand IMHO. I would rather write it as:
> 
> "The throttling value describes the amount of blocked bandwidth".

Thanks! This is not accurate. In fact, throttling value means the approximate
amount of delaying the traffic between core and memory.

> Although I have to admit I don't really understand this interface,
> wouldn't it be easier to specify the memory bandwidth allowed
> per-domain, rather the amount of bandwidth removed?
> 
> Using your approach the user has to first get the total bandwidth, and
> then subtract the removed bandwidth in order to know the remaining
> bandwidth for a domain.
> 
The HW only provides throttling set method to control the bandwidth. So, I
think it is straightforward to set throttling in tools layer. The 'psr-mba-set'
is designed as a simple command to set what HW needs.

Also, mentioned by SDM, "The throttling values exposed by MBA are approximate,
and are calibrated to specific traffic patterns.". So, it is hard to provide
exact bandwidth control in 'psr-mba-set'.

> Also, IMHO you should provide a command to print the max throttling,

The 'psr-hwinfo' can show the max throttling. Because it is part of MBA HW info.

> remember that from Xen's PoV Dom0 is just another domain, and the
> CPUID values reported to Dom0 don't need to be the same as found on
> bare metal.
> 
But the CPUID values got through 'psr' commands should be ones found on bare
metal, right? Because these commands directly get the values from hypervisor
through domctl/sysctl.

> > +
> > +     The response of the throttling value could be linear mode or non-linear
> > +     mode.
> > +
> > +     Linear mode: the input precision is defined as 100-(MBA_MAX). For instance,
> 
> What's MBA_MAX? I don't see any reference/description of it above.
> 
Sorry, will explain it.

> > +     if the MBA_MAX value is 90, the input precision is 10%. Values not an even
> > +     multiple of the precision (e.g., 12%) will be rounded down (e.g., to 10%
> > +     delay applied) by HW automatically.
> > +
> > +     Non-linear mode: input delay values are powers-of-two from zero to the
> > +     MBA_MAX value from CPUID. In this case any values not a power of two will
> > +     be rounded down the next nearest power of two by HW automatically.
> > +
> > +# Technical details
> > +
> > +MBA is a member of Intel PSR features, it shares the base PSR infrastructure
> > +in Xen.
> > +
> > +## Hardware perspective
> > +
> > +  MBA defines a range of MSRs to support specifying a delay value (Thrtl) per
> > +  COS, with details below.
> > +
> > +  ```
> > +   +----------------------------+----------------+
> > +   | MSR (per socket)           |    Address     |
> > +   +----------------------------+----------------+
> > +   | IA32_L2_QOS_Ext_BW_Thrtl_0 |     0xD50      |
> > +   +----------------------------+----------------+
> > +   | ...                        |  ...           |
> > +   +----------------------------+----------------+
> > +   | IA32_L2_QOS_Ext_BW_Thrtl_n | 0xD50+n (n<64) |
> > +   +----------------------------+----------------+
> > +  ```
> 
> Are you sure you want to hardcode this n<64? Isn't there a chance this
> is going to be bumped in newer hardware?
> 
This is just a HW limitation declared in SDM. In fact, there is no such hard
code limitation. Hypervisor side checks the 'cos_max' got through CPUID.

> > +
> > +  When context switch happens, the COS ID of domain is written to per-thread MSR
> > +  `IA32_PQR_ASSOC`, and then hardware enforces bandwidth allocation according
> > +  to the throttling value stored in the Thrtl MSR register.
> > +
> > +## The relationship between MBA and CAT/CDP
> > +
> > +  Generally speaking, MBA is completely independent of CAT/CDP, and any
> > +  combination may be applied at any time, e.g. enabling MBA with CAT
> > +  disabled.
> > +
> > +  But it needs to be noticed that MBA shares COS infrastructure with CAT,
> > +  although MBA is enumerated by different CPUID leaf from CAT (which
> > +  indicates that the max COS of MBA may be different from CAT). In some
> > +  cases, a domain is permitted to have a COS that is beyond one (or more)
> > +  of PSR features but within the others. For instance, let's assume the max
> > +  COS of MBA is 8 but the max COS of L3 CAT is 16, when a domain is assigned
> > +  9 as COS, the L3 CAT CBM associated to COS 9 would be enforced, but for MBA,
> > +  the HW works as default value is set since COS 9 is beyond the max COS (8)
> > +  of MBA.
> > +
> > +## Design Overview
> > +
> > +* Core COS/Thrtl association
> > +
> > +  When enforcing Memory Bandwidth Allocation, all cores of domains have
> > +  the same default Thrtl MSR (COS0) which stores the same Thrtl (0). The
> > +  default Thrtl MSR is used only in hypervisor and is transparent to tool stack
> > +  and user.
> > +
> > +  System administrator can change PSR allocation policy at runtime by
>                         ^s
> > +  tool stack. Since MBA shares COS ID with CAT/CDP, a COS ID corresponds to a
>      ^ using the tool ...

Will modify this and below words.

> > +  2-tuple, like [CBM, Thrtl] with only-CAT enalbed, when CDP is enabled,
>                                               ^ enabled
> > +  the COS ID corresponds to a 3-tuple, like [Code_CBM, Data_CBM, Thrtl]. If
> > +  neither CAT nor CDP is enabled, things would be easier, one COS ID corresponds
>                                             ^ are easier, since one...
> > +  to one Thrtl.
> > +
> > +* VCPU schedule
> > +
> > +  This part reuses CAT COS infrastructure.
> > +
> > +* Multi-sockets
> > +
> > +  Different sockets may have different MBA ability (like max COS)
> > +  although it is consistent on the same socket. So the capability
> > +  of per-socket MBA is specified.
> > +
> > +  This part reuses CAT COS infrastructure.
> > +
> > +## Implementation Description
> > +
> > +* Hypervisor interfaces:
> > +
> > +  1. Boot line param: "psr=mba" to enable the feature.
> > +
> > +  2. SYSCTL:
> > +          - XEN_SYSCTL_PSR_MBA_get_info: Get system MBA information.
> > +
> > +  3. DOMCTL:
> > +          - XEN_DOMCTL_PSR_MBA_OP_GET_THRTL: Get throttling for a domain.
> > +          - XEN_DOMCTL_PSR_MBA_OP_SET_THRTL: Set throttling for a domain.
> > +
> > +* xl interfaces:
> > +
> > +  1. psr-mba-show [domain-id]
> > +          Show system/domain runtime MBA throttling value. For linear mode,
> > +          it shows the decimal value. For non-linear mode, it shows hexadecimal
> > +          value.
> > +          => XEN_SYSCTL_PSR_MBA_get_info/XEN_DOMCTL_PSR_MBA_OP_GET_THRTL
> > +
> > +  2. psr-mba-set [OPTIONS] <domain-id> <throttling>
> > +          Set bandwidth throttling for a domain.
> > +          => XEN_DOMCTL_PSR_MBA_OP_SET_THRTL
> > +
> > +  3. psr-hwinfo
> > +          Show PSR HW information, including L3 CAT/CDP/L2 CAT/MBA.
> > +          => XEN_SYSCTL_PSR_MBA_get_info
> > +
> > +* Key data structure:
> > +
> > +  1. Feature HW info
> > +
> > +     ```
> > +     struct {
> > +         unsigned int thrtl_max;
> > +         unsigned int linear;
> > +     } mba_info;
> 
> Is this a domctl structure? a libxl one?
> 
Nope, this is a hypervisor side data structure used in 'psr.c'.

> > +
> > +     - Member `thrtl_max`
> > +
> > +       `thrtl_max` is the max throttling value to be set.
> > +
> > +     - Member `linear`
> > +
> > +       `linear` means the response of delay value is linear or not.
> > +
> > +     As mentioned above, MBA is a member of Intel PSR features, it would
> > +     share the base PSR infrastructure in Xen. For example, the 'cos_max'
> > +     is a common HW property for all features. So, for other data structure
> > +     details, please refer 'intel_psr_cat_cdp.pandoc'.
> > +
> > +# Limitations
> > +
> > +MBA can only work on HW which enables it (check by CPUID).
> > +
> > +# Testing
> > +
> > +We can execute these commands to verify MBA on different HWs supporting them.
> > +
> > +For example:
> > +    root@:~$ xl psr-hwinfo --mba
> > +    Memory Bandwidth Allocation (MBA):
> > +    Socket ID       : 0
> > +    Linear Mode     : Enabled
> > +    Maximum COS     : 7
> > +    Maximum Throttling Value: 90
> > +    Default Throttling Value: 0
> > +
> > +    root@:~$ xl psr-mba-set 1 0xa
> 
> Could you elaborate a little bit on why '0xa' is used here? IMHO The
> example should provide some context.
> 
Sure, I will explain the meaning of '0xa' or '10' here.

> Thanks, Roger.
Roger Pau Monné Aug. 30, 2017, 7:42 a.m. UTC | #3
On Wed, Aug 30, 2017 at 01:20:14PM +0800, Yi Sun wrote:
> Thanks a lot for the review comments!
> 
> On 17-08-29 12:46:49, Roger Pau Monn� wrote:
> > On Thu, Aug 24, 2017 at 09:14:35AM +0800, Yi Sun wrote:
> > Although I have to admit I don't really understand this interface,
> > wouldn't it be easier to specify the memory bandwidth allowed
> > per-domain, rather the amount of bandwidth removed?
> > 
> > Using your approach the user has to first get the total bandwidth, and
> > then subtract the removed bandwidth in order to know the remaining
> > bandwidth for a domain.
> > 
> The HW only provides throttling set method to control the bandwidth. So, I
> think it is straightforward to set throttling in tools layer. The 'psr-mba-set'
> is designed as a simple command to set what HW needs.
> 
> Also, mentioned by SDM, "The throttling values exposed by MBA are approximate,
> and are calibrated to specific traffic patterns.". So, it is hard to provide
> exact bandwidth control in 'psr-mba-set'.

OK, I think I will wait until I see the example explained in order to
express my opinion on the proposed toolstack interface.

> > Also, IMHO you should provide a command to print the max throttling,
> 
> The 'psr-hwinfo' can show the max throttling. Because it is part of MBA HW info.
> 
> > remember that from Xen's PoV Dom0 is just another domain, and the
> > CPUID values reported to Dom0 don't need to be the same as found on
> > bare metal.
> > 
> But the CPUID values got through 'psr' commands should be ones found on bare
> metal, right? Because these commands directly get the values from hypervisor
> through domctl/sysctl.

Yes, if they are provided by the hypervisor (ie: cpuid instruction is
executed in Xen, not Dom0).

> > > +     if the MBA_MAX value is 90, the input precision is 10%. Values not an even
> > > +     multiple of the precision (e.g., 12%) will be rounded down (e.g., to 10%
> > > +     delay applied) by HW automatically.
> > > +
> > > +     Non-linear mode: input delay values are powers-of-two from zero to the
> > > +     MBA_MAX value from CPUID. In this case any values not a power of two will
> > > +     be rounded down the next nearest power of two by HW automatically.
> > > +
> > > +# Technical details
> > > +
> > > +MBA is a member of Intel PSR features, it shares the base PSR infrastructure
> > > +in Xen.
> > > +
> > > +## Hardware perspective
> > > +
> > > +  MBA defines a range of MSRs to support specifying a delay value (Thrtl) per
> > > +  COS, with details below.
> > > +
> > > +  ```
> > > +   +----------------------------+----------------+
> > > +   | MSR (per socket)           |    Address     |
> > > +   +----------------------------+----------------+
> > > +   | IA32_L2_QOS_Ext_BW_Thrtl_0 |     0xD50      |
> > > +   +----------------------------+----------------+
> > > +   | ...                        |  ...           |
> > > +   +----------------------------+----------------+
> > > +   | IA32_L2_QOS_Ext_BW_Thrtl_n | 0xD50+n (n<64) |
> > > +   +----------------------------+----------------+
> > > +  ```
> > 
> > Are you sure you want to hardcode this n<64? Isn't there a chance this
> > is going to be bumped in newer hardware?
> > 
> This is just a HW limitation declared in SDM. In fact, there is no such hard
> code limitation. Hypervisor side checks the 'cos_max' got through CPUID.

Then I would remove the n<64, or else this will get out-of-sync
without anyone noticing.

Thanks, Roger.
diff mbox

Patch

diff --git a/docs/features/intel_psr_mba.pandoc b/docs/features/intel_psr_mba.pandoc
new file mode 100644
index 0000000..21592e8
--- /dev/null
+++ b/docs/features/intel_psr_mba.pandoc
@@ -0,0 +1,256 @@ 
+% Intel Memory Bandwidth Allocation (MBA) Feature
+% Revision 1.4
+
+\clearpage
+
+# Basics
+
+---------------- ----------------------------------------------------
+         Status: **Tech Preview**
+
+Architecture(s): Intel x86
+
+   Component(s): Hypervisor, toolstack
+
+       Hardware: MBA is supported on Skylake Server and beyond
+---------------- ----------------------------------------------------
+
+# Terminology
+
+* CAT         Cache Allocation Technology
+* CBM         Capacity BitMasks
+* CDP         Code and Data Prioritization
+* COS/CLOS    Class of Service
+* HW          Hardware
+* MBA         Memory Bandwidth Allocation
+* MSRs        Machine Specific Registers
+* PSR         Intel Platform Shared Resource
+* THRTL       Throttle value or delay value
+
+# Overview
+
+The Memory Bandwidth Allocation (MBA) feature provides indirect and approximate
+control over memory bandwidth available per-core. This feature provides OS/
+hypervisor the ability to slow misbehaving apps/domains or create advanced
+closed-loop control system via exposing control over a credit-based throttling
+mechanism.
+
+# User details
+
+* Feature Enabling:
+
+  Add "psr=mba" to boot line parameter to enable MBA feature.
+
+* xl interfaces:
+
+  1. `psr-mba-show [domain-id]`:
+
+     Show memory bandwidth throttling for domain. For linear mode, it shows the
+     decimal value. For non-linear mode, it shows hexadecimal value.
+
+  2. `psr-mba-set [OPTIONS] <domain-id> <throttling>`:
+
+     Set memory bandwidth throttling for domain.
+
+     Options:
+     '-s': Specify the socket to process, otherwise all sockets are processed.
+
+     Throttling value set in register implies memory bandwidth blocked, i.e.
+     higher throttling value results in lower bandwidth. The max throttling
+     value can be got through CPUID.
+
+     The response of the throttling value could be linear mode or non-linear
+     mode.
+
+     Linear mode: the input precision is defined as 100-(MBA_MAX). For instance,
+     if the MBA_MAX value is 90, the input precision is 10%. Values not an even
+     multiple of the precision (e.g., 12%) will be rounded down (e.g., to 10%
+     delay applied) by HW automatically.
+
+     Non-linear mode: input delay values are powers-of-two from zero to the
+     MBA_MAX value from CPUID. In this case any values not a power of two will
+     be rounded down the next nearest power of two by HW automatically.
+
+# Technical details
+
+MBA is a member of Intel PSR features, it shares the base PSR infrastructure
+in Xen.
+
+## Hardware perspective
+
+  MBA defines a range of MSRs to support specifying a delay value (Thrtl) per
+  COS, with details below.
+
+  ```
+   +----------------------------+----------------+
+   | MSR (per socket)           |    Address     |
+   +----------------------------+----------------+
+   | IA32_L2_QOS_Ext_BW_Thrtl_0 |     0xD50      |
+   +----------------------------+----------------+
+   | ...                        |  ...           |
+   +----------------------------+----------------+
+   | IA32_L2_QOS_Ext_BW_Thrtl_n | 0xD50+n (n<64) |
+   +----------------------------+----------------+
+  ```
+
+  When context switch happens, the COS ID of domain is written to per-thread MSR
+  `IA32_PQR_ASSOC`, and then hardware enforces bandwidth allocation according
+  to the throttling value stored in the Thrtl MSR register.
+
+## The relationship between MBA and CAT/CDP
+
+  Generally speaking, MBA is completely independent of CAT/CDP, and any
+  combination may be applied at any time, e.g. enabling MBA with CAT
+  disabled.
+
+  But it needs to be noticed that MBA shares COS infrastructure with CAT,
+  although MBA is enumerated by different CPUID leaf from CAT (which
+  indicates that the max COS of MBA may be different from CAT). In some
+  cases, a domain is permitted to have a COS that is beyond one (or more)
+  of PSR features but within the others. For instance, let's assume the max
+  COS of MBA is 8 but the max COS of L3 CAT is 16, when a domain is assigned
+  9 as COS, the L3 CAT CBM associated to COS 9 would be enforced, but for MBA,
+  the HW works as default value is set since COS 9 is beyond the max COS (8)
+  of MBA.
+
+## Design Overview
+
+* Core COS/Thrtl association
+
+  When enforcing Memory Bandwidth Allocation, all cores of domains have
+  the same default Thrtl MSR (COS0) which stores the same Thrtl (0). The
+  default Thrtl MSR is used only in hypervisor and is transparent to tool stack
+  and user.
+
+  System administrator can change PSR allocation policy at runtime by
+  tool stack. Since MBA shares COS ID with CAT/CDP, a COS ID corresponds to a
+  2-tuple, like [CBM, Thrtl] with only-CAT enalbed, when CDP is enabled,
+  the COS ID corresponds to a 3-tuple, like [Code_CBM, Data_CBM, Thrtl]. If
+  neither CAT nor CDP is enabled, things would be easier, one COS ID corresponds
+  to one Thrtl.
+
+* VCPU schedule
+
+  This part reuses CAT COS infrastructure.
+
+* Multi-sockets
+
+  Different sockets may have different MBA ability (like max COS)
+  although it is consistent on the same socket. So the capability
+  of per-socket MBA is specified.
+
+  This part reuses CAT COS infrastructure.
+
+## Implementation Description
+
+* Hypervisor interfaces:
+
+  1. Boot line param: "psr=mba" to enable the feature.
+
+  2. SYSCTL:
+          - XEN_SYSCTL_PSR_MBA_get_info: Get system MBA information.
+
+  3. DOMCTL:
+          - XEN_DOMCTL_PSR_MBA_OP_GET_THRTL: Get throttling for a domain.
+          - XEN_DOMCTL_PSR_MBA_OP_SET_THRTL: Set throttling for a domain.
+
+* xl interfaces:
+
+  1. psr-mba-show [domain-id]
+          Show system/domain runtime MBA throttling value. For linear mode,
+          it shows the decimal value. For non-linear mode, it shows hexadecimal
+          value.
+          => XEN_SYSCTL_PSR_MBA_get_info/XEN_DOMCTL_PSR_MBA_OP_GET_THRTL
+
+  2. psr-mba-set [OPTIONS] <domain-id> <throttling>
+          Set bandwidth throttling for a domain.
+          => XEN_DOMCTL_PSR_MBA_OP_SET_THRTL
+
+  3. psr-hwinfo
+          Show PSR HW information, including L3 CAT/CDP/L2 CAT/MBA.
+          => XEN_SYSCTL_PSR_MBA_get_info
+
+* Key data structure:
+
+  1. Feature HW info
+
+     ```
+     struct {
+         unsigned int thrtl_max;
+         unsigned int linear;
+     } mba_info;
+
+     - Member `thrtl_max`
+
+       `thrtl_max` is the max throttling value to be set.
+
+     - Member `linear`
+
+       `linear` means the response of delay value is linear or not.
+
+     As mentioned above, MBA is a member of Intel PSR features, it would
+     share the base PSR infrastructure in Xen. For example, the 'cos_max'
+     is a common HW property for all features. So, for other data structure
+     details, please refer 'intel_psr_cat_cdp.pandoc'.
+
+# Limitations
+
+MBA can only work on HW which enables it (check by CPUID).
+
+# Testing
+
+We can execute these commands to verify MBA on different HWs supporting them.
+
+For example:
+    root@:~$ xl psr-hwinfo --mba
+    Memory Bandwidth Allocation (MBA):
+    Socket ID       : 0
+    Linear Mode     : Enabled
+    Maximum COS     : 7
+    Maximum Throttling Value: 90
+    Default Throttling Value: 0
+
+    root@:~$ xl psr-mba-set 1 0xa
+
+    root@:~$ xl psr-mba-show 1
+    Socket ID       : 0
+    Default THRTL   : 0
+       ID                     NAME            THRTL
+        1                 ubuntu14             0xa
+
+# Areas for improvement
+
+N/A
+
+# Known issues
+
+N/A
+
+# References
+
+"INTEL RESOURCE DIRECTOR TECHNOLOGY (INTEL RDT) ALLOCATION FEATURES" [Intel 64 and IA-32 Architectures Software Developer Manuals, vol3](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
+
+# History
+
+------------------------------------------------------------------------
+Date       Revision Version  Notes
+---------- -------- -------- -------------------------------------------
+2017-01-10 1.0      Xen 4.9  Design document written
+2017-07-10 1.1      Xen 4.10 Changes:
+                             1. Modify data structure according to latest
+                                codes;
+                             2. Add content for 'Areas for improvement';
+                             3. Other minor changes.
+2017-08-09 1.2      Xen 4.10 Changes:
+                             1. Remove a special character to avoid error when
+                                building pandoc.
+2017-08-15 1.3      Xen 4.10 Changes:
+                             1. Add terminology 'HW'.
+                             2. Change 'COS ID of VCPU' to 'COS ID of domain'.
+                             3. Change 'COS register' to 'Thrtl MSR'.
+                             4. Explain the value shown for 'psr-mba-show' under
+                                different modes.
+                             5. Remove content in 'Areas for improvement'.
+2017-08-16 1.4      Xen 4.10 Changes:
+                             1. Add '<>' for mandatory argument.
+---------- -------- -------- -------------------------------------------