[v2,1/2] x86/sgx: Add accounting for tracking overcommit

Message ID	20220107181618.6597-2-kristen@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-sgx-owner@kernel.org> From: Kristen Carlson Accardi <kristen@linux.intel.com> To: linux-sgx@vger.kernel.org, Jarkko Sakkinen <jarkko@kernel.org>, Dave Hansen <dave.hansen@linux.intel.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com> Cc: linux-kernel@vger.kernel.org, Kristen Carlson Accardi <kristen@linux.intel.com> Subject: [PATCH v2 1/2] x86/sgx: Add accounting for tracking overcommit Date: Fri, 7 Jan 2022 10:16:16 -0800 Message-Id: <20220107181618.6597-2-kristen@linux.intel.com> In-Reply-To: <20220107181618.6597-1-kristen@linux.intel.com> References: <20220107181618.6597-1-kristen@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	x86/sgx: Limit EPC overcommit \| expand [v2,0/2] x86/sgx: Limit EPC overcommit [v2,1/2] x86/sgx: Add accounting for tracking overcommit [v2,2/2] x86/sgx: account backing pages

Kristen Carlson Accardi Jan. 7, 2022, 6:16 p.m. UTC

When the system runs out of enclave memory, SGX can reclaim EPC pages
by swapping to normal RAM. This normal RAM is allocated via a
per-enclave shared memory area. The shared memory area is not mapped
into the enclave or the task mapping it, which makes its memory use
opaque (including to the OOM killer). Having lots of hard to find
memory around is problematic, especially when there is no limit.

Introduce a global counter that can be used to limit the number of pages
that enclaves are able to consume for backing storage.  This parameter
is a percentage value that is used in conjunction with the number of
EPC pages in the system to set a cap on the amount of backing RAM that
can be consumed.

The overcommit percentage value is 150, which limits the total number of
shared memory pages that may be consumed by all enclaves as backing pages
to 1.5X of EPC pages on the system.

For example, on an SGX system that has 128MB of EPC, this would
cap the amount of normal RAM that SGX consumes for its shared memory
areas at 192MB.

The SGX overcommit percent works differently than the core VM overcommit
limit. Enclaves request backing pages one page at a time, and the number
of in use backing pages that are allowed is a global resource that is
limited for all enclaves.

Introduce a pair of functions which can be used by callers when requesting
backing RAM pages. These functions are responsible for accounting the
page charges. A request may return an error if the request will cause the
counter to exceed the backing page cap.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 46 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/sgx.h  |  2 ++
 2 files changed, 48 insertions(+)

Dave Hansen Jan. 7, 2022, 6:46 p.m. UTC | #1

On 1/7/22 10:16, Kristen Carlson Accardi wrote:
> The overcommit percentage value is 150, which limits the total number of
> shared memory pages that may be consumed by all enclaves as backing pages
> to 1.5X of EPC pages on the system.

Hi Kristen,

Could you give some background on how this value was chosen and how it 
might impact userspace?

Kristen Carlson Accardi Jan. 7, 2022, 7:16 p.m. UTC | #2

On Fri, 2022-01-07 at 10:46 -0800, Dave Hansen wrote:
> On 1/7/22 10:16, Kristen Carlson Accardi wrote:
> > The overcommit percentage value is 150, which limits the total
> > number of
> > shared memory pages that may be consumed by all enclaves as backing
> > pages
> > to 1.5X of EPC pages on the system.
> 
> Hi Kristen,
> 
> Could you give some background on how this value was chosen and how
> it 
> might impact userspace?

Yes,
The value of 1.5x the number of EPC pages was chosen because it will
handle the most common case of a few enclaves that don't need much
overcommit without any impact to user space. In the less commone case
where there are many enclaves or a few large enclaves which need a lot
of overcommit due to large EPC memory requirements, the reclaimer may
fail to allocate a backing page for swapping if the limit has been
reached. In this case the page will not be able to be reclaimed and the
system will not be able to allocate any new EPC pages. Any ioctl or
call to add new EPC pages will get -ENOMEM, so for example, new
enclaves will fail to load, and new EPC pages will not be able to be
added.

Does that make sense?

Kristen

Jarkko Sakkinen Jan. 8, 2022, 3:59 p.m. UTC | #3

On Fri, Jan 07, 2022 at 10:16:16AM -0800, Kristen Carlson Accardi wrote:
> When the system runs out of enclave memory, SGX can reclaim EPC pages
> by swapping to normal RAM. This normal RAM is allocated via a
> per-enclave shared memory area. The shared memory area is not mapped
> into the enclave or the task mapping it, which makes its memory use
> opaque (including to the OOM killer). Having lots of hard to find
> memory around is problematic, especially when there is no limit.
> 
> Introduce a global counter that can be used to limit the number of pages
> that enclaves are able to consume for backing storage.  This parameter
> is a percentage value that is used in conjunction with the number of
> EPC pages in the system to set a cap on the amount of backing RAM that
> can be consumed.
> 
> The overcommit percentage value is 150, which limits the total number of
> shared memory pages that may be consumed by all enclaves as backing pages
> to 1.5X of EPC pages on the system.
> 
> For example, on an SGX system that has 128MB of EPC, this would
> cap the amount of normal RAM that SGX consumes for its shared memory
> areas at 192MB.
> 
> The SGX overcommit percent works differently than the core VM overcommit
> limit. Enclaves request backing pages one page at a time, and the number
> of in use backing pages that are allowed is a global resource that is
> limited for all enclaves.
> 
> Introduce a pair of functions which can be used by callers when requesting
> backing RAM pages. These functions are responsible for accounting the
> page charges. A request may return an error if the request will cause the
> counter to exceed the backing page cap.
> 
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 46 ++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/sgx.h  |  2 ++
>  2 files changed, 48 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 2857a49f2335..8a7bfe0d9023 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -43,6 +43,45 @@ static struct sgx_numa_node *sgx_numa_nodes;
>  
>  static LIST_HEAD(sgx_dirty_page_list);
>  
> +/*
> + * Limits the amount of normal RAM that SGX can consume for EPC
> + * overcommit to the total EPC pages * sgx_overcommit_percent / 100
> + */
> +static int sgx_overcommit_percent = 150;
> +
> +/* The number of pages that can be allocated globally for backing storage. */
> +static atomic_long_t sgx_nr_available_backing_pages;
> +
> +/**
> + * sgx_charge_mem() - charge for a page used for backing storage
> + *
> + * Backing storage usage is capped by the sgx_nr_available_backing_pages.
> + * If the backing storage usage is over the overcommit limit,
> + * return an error.
> + *
> + * Return:
> + * 0:		The page requested does not exceed the limit
> + * -ENOMEM:	The page requested exceeds the overcommit limit
> + */
> +int sgx_charge_mem(void)
> +{
> +	if (!atomic_long_add_unless(&sgx_nr_available_backing_pages, -1, 0))
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +/**
> + * sgx_uncharge_mem() - uncharge a page previously used for backing storage
> + *
> + * When backing storage is no longer in use, increment the
> + * sgx_nr_available_backing_pages counter.
> + */
> +void sgx_uncharge_mem(void)
> +{
> +	atomic_long_inc(&sgx_nr_available_backing_pages);
> +}
> +
>  /*
>   * Reset post-kexec EPC pages to the uninitialized state. The pages are removed
>   * from the input list, and made available for the page allocator. SECS pages
> @@ -786,6 +825,8 @@ static bool __init sgx_page_cache_init(void)
>  	u64 pa, size;
>  	int nid;
>  	int i;
> +	u64 total_epc_bytes = 0;
> +	u64 available_backing_bytes;

Use reverse-christmas-tree declaration order here. This was required for
the original patch set, so I expect this to still hold.

>  
>  	sgx_numa_nodes = kmalloc_array(num_possible_nodes(), sizeof(*sgx_numa_nodes), GFP_KERNEL);
>  	if (!sgx_numa_nodes)
> @@ -830,6 +871,7 @@ static bool __init sgx_page_cache_init(void)
>  
>  		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
>  		sgx_numa_nodes[nid].size += size;
> +		total_epc_bytes += size;
>  
>  		sgx_nr_epc_sections++;
>  	}
> @@ -839,6 +881,10 @@ static bool __init sgx_page_cache_init(void)
>  		return false;
>  	}
>  
> +	available_backing_bytes = total_epc_bytes * (sgx_overcommit_percent / 100);
> +

IMHO this empty line could be removed, and group these statements together.

> +	atomic_long_set(&sgx_nr_available_backing_pages, available_backing_bytes >> PAGE_SHIFT);
> +
>  	return true;
>  }
>  
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 0f17def9fe6f..3507a9983fc1 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -89,6 +89,8 @@ void sgx_free_epc_page(struct sgx_epc_page *page);
>  void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
>  int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
> +int sgx_charge_mem(void);
> +void sgx_uncharge_mem(void);
>  
>  #ifdef CONFIG_X86_SGX_KVM
>  int __init sgx_vepc_init(void);
> -- 
> 2.20.1
> 

Other than that, looks good to me.

BR,
Jarkko

Haitao Huang Jan. 11, 2022, 2:20 p.m. UTC | #4

On Fri, 07 Jan 2022 13:16:12 -0600, Kristen Carlson Accardi  
<kristen@linux.intel.com> wrote:

> On Fri, 2022-01-07 at 10:46 -0800, Dave Hansen wrote:
>> On 1/7/22 10:16, Kristen Carlson Accardi wrote:
>> > The overcommit percentage value is 150, which limits the total
>> > number of
>> > shared memory pages that may be consumed by all enclaves as backing
>> > pages
>> > to 1.5X of EPC pages on the system.
>>
>> Hi Kristen,
>>
>> Could you give some background on how this value was chosen and how
>> it
>> might impact userspace?
>
> Yes,
> The value of 1.5x the number of EPC pages was chosen because it will
> handle the most common case of a few enclaves that don't need much
> overcommit without any impact to user space. In the less commone case
> where there are many enclaves or a few large enclaves which need a lot
> of overcommit due to large EPC memory requirements, the reclaimer may
> fail to allocate a backing page for swapping if the limit has been
> reached. In this case the page will not be able to be reclaimed and the
> system will not be able to allocate any new EPC pages. Any ioctl or
> call to add new EPC pages will get -ENOMEM, so for example, new
> enclaves will fail to load, and new EPC pages will not be able to be
> added.
>
> Does that make sense?

If the system has a ton of RAM but limited EPC, I think it makes sense to  
allow more EPC swapping, can we do min(0.5*RAM, 2*EPC)?
I suppose if the system is used for heavy enclave load, user would be  
willing to at least use half of RAM.

Thanks
Haitao

Dave Hansen Jan. 11, 2022, 3:43 p.m. UTC | #5

On 1/11/22 06:20, Haitao Huang wrote:
> If the system has a ton of RAM but limited EPC, I think it makes sense 
> to allow more EPC swapping, can we do min(0.5*RAM, 2*EPC)?
> I suppose if the system is used for heavy enclave load, user would be 
> willing to at least use half of RAM.

If I have 100GB of RAM and 100MB of EPC, can I really *meaningfully* run 
50GB of enclaves?  In that case, if everything was swapped out evenly, I 
would only have a 499/500 chance that a given page reference would fault.

This isn't about a "heavy enclave load".  If there is *that* much 
swapped-out enclave memory, will an enclave even make meaningful forward 
progress?

Haitao Huang Jan. 11, 2022, 4:33 p.m. UTC | #6

On Tue, 11 Jan 2022 09:43:35 -0600, Dave Hansen <dave.hansen@intel.com>  
wrote:

> On 1/11/22 06:20, Haitao Huang wrote:
>> If the system has a ton of RAM but limited EPC, I think it makes sense  
>> to allow more EPC swapping, can we do min(0.5*RAM, 2*EPC)?
>> I suppose if the system is used for heavy enclave load, user would be  
>> willing to at least use half of RAM.
>
> If I have 100GB of RAM and 100MB of EPC, can I really *meaningfully* run  
> 50GB of enclaves?  In that case, if everything was swapped out evenly, I  
> would only have a 499/500 chance that a given page reference would fault.
>

The formula will cap swapping at 2*EPC so only 200MB swapped out.  So the  
miss is at most 1/3.
The original hard coded cap 1.5*EPC may still consume too much RAM if  
RAM<1.5*EPC.

> This isn't about a "heavy enclave load".  If there is *that* much  
> swapped-out enclave memory, will an enclave even make meaningful forward  
> progress?

Dave Hansen Jan. 11, 2022, 5:39 p.m. UTC | #7

On 1/11/22 08:33, Haitao Huang wrote:
> On Tue, 11 Jan 2022 09:43:35 -0600, Dave Hansen <dave.hansen@intel.com> 
> wrote:
>> On 1/11/22 06:20, Haitao Huang wrote:
>>> If the system has a ton of RAM but limited EPC, I think it makes 
>>> sense to allow more EPC swapping, can we do min(0.5*RAM, 2*EPC)?
>>> I suppose if the system is used for heavy enclave load, user would be 
>>> willing to at least use half of RAM.
>>
>> If I have 100GB of RAM and 100MB of EPC, can I really *meaningfully* 
>> run 50GB of enclaves?  In that case, if everything was swapped out 
>> evenly, I would only have a 499/500 chance that a given page reference 
>> would fault.
> 
> The formula will cap swapping at 2*EPC so only 200MB swapped out.  So 
> the miss is at most 1/3.
> The original hard coded cap 1.5*EPC may still consume too much RAM if 
> RAM<1.5*EPC.

Oh, sorry, I read that backwards.

Basing it on the amount of RAM is a bit nasty.  You might either really 
overly restrict the amount of allowed EPC, or you have to handle hotplug.

Kristen Carlson Accardi Jan. 14, 2022, 5:45 p.m. UTC | #8

On Tue, 2022-01-11 at 09:39 -0800, Dave Hansen wrote:
> On 1/11/22 08:33, Haitao Huang wrote:
> > On Tue, 11 Jan 2022 09:43:35 -0600, Dave Hansen <
> > dave.hansen@intel.com> 
> > wrote:
> > > On 1/11/22 06:20, Haitao Huang wrote:
> > > > If the system has a ton of RAM but limited EPC, I think it
> > > > makes 
> > > > sense to allow more EPC swapping, can we do min(0.5*RAM,
> > > > 2*EPC)?
> > > > I suppose if the system is used for heavy enclave load, user
> > > > would be 
> > > > willing to at least use half of RAM.
> > > 
> > > If I have 100GB of RAM and 100MB of EPC, can I really
> > > *meaningfully* 
> > > run 50GB of enclaves?  In that case, if everything was swapped
> > > out 
> > > evenly, I would only have a 499/500 chance that a given page
> > > reference 
> > > would fault.
> > 
> > The formula will cap swapping at 2*EPC so only 200MB swapped out. 
> > So 
> > the miss is at most 1/3.
> > The original hard coded cap 1.5*EPC may still consume too much RAM
> > if 
> > RAM<1.5*EPC.
> 
> Oh, sorry, I read that backwards.
> 
> Basing it on the amount of RAM is a bit nasty.  You might either
> really 
> overly restrict the amount of allowed EPC, or you have to handle
> hotplug.

My opinion is that we should keep the current algorithm for now as it
is pretty straightforward, and cgroups will eventually allow for more
control.

Kristen Carlson Accardi Jan. 14, 2022, 5:47 p.m. UTC | #9

On Sat, 2022-01-08 at 17:59 +0200, Jarkko Sakkinen wrote:
> On Fri, Jan 07, 2022 at 10:16:16AM -0800, Kristen Carlson Accardi
> wrote:
> > When the system runs out of enclave memory, SGX can reclaim EPC
> > pages
> > by swapping to normal RAM. This normal RAM is allocated via a
> > per-enclave shared memory area. The shared memory area is not
> > mapped
> > into the enclave or the task mapping it, which makes its memory use
> > opaque (including to the OOM killer). Having lots of hard to find
> > memory around is problematic, especially when there is no limit.
> > 
> > Introduce a global counter that can be used to limit the number of
> > pages
> > that enclaves are able to consume for backing storage.  This
> > parameter
> > is a percentage value that is used in conjunction with the number
> > of
> > EPC pages in the system to set a cap on the amount of backing RAM
> > that
> > can be consumed.
> > 
> > The overcommit percentage value is 150, which limits the total
> > number of
> > shared memory pages that may be consumed by all enclaves as backing
> > pages
> > to 1.5X of EPC pages on the system.
> > 
> > For example, on an SGX system that has 128MB of EPC, this would
> > cap the amount of normal RAM that SGX consumes for its shared
> > memory
> > areas at 192MB.
> > 
> > The SGX overcommit percent works differently than the core VM
> > overcommit
> > limit. Enclaves request backing pages one page at a time, and the
> > number
> > of in use backing pages that are allowed is a global resource that
> > is
> > limited for all enclaves.
> > 
> > Introduce a pair of functions which can be used by callers when
> > requesting
> > backing RAM pages. These functions are responsible for accounting
> > the
> > page charges. A request may return an error if the request will
> > cause the
> > counter to exceed the backing page cap.
> > 
> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > ---
> >  arch/x86/kernel/cpu/sgx/main.c | 46
> > ++++++++++++++++++++++++++++++++++
> >  arch/x86/kernel/cpu/sgx/sgx.h  |  2 ++
> >  2 files changed, 48 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c
> > b/arch/x86/kernel/cpu/sgx/main.c
> > index 2857a49f2335..8a7bfe0d9023 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -43,6 +43,45 @@ static struct sgx_numa_node *sgx_numa_nodes;
> >  
> >  static LIST_HEAD(sgx_dirty_page_list);
> >  
> > +/*
> > + * Limits the amount of normal RAM that SGX can consume for EPC
> > + * overcommit to the total EPC pages * sgx_overcommit_percent /
> > 100
> > + */
> > +static int sgx_overcommit_percent = 150;
> > +
> > +/* The number of pages that can be allocated globally for backing
> > storage. */
> > +static atomic_long_t sgx_nr_available_backing_pages;
> > +
> > +/**
> > + * sgx_charge_mem() - charge for a page used for backing storage
> > + *
> > + * Backing storage usage is capped by the
> > sgx_nr_available_backing_pages.
> > + * If the backing storage usage is over the overcommit limit,
> > + * return an error.
> > + *
> > + * Return:
> > + * 0:		The page requested does not exceed the limit
> > + * -ENOMEM:	The page requested exceeds the overcommit limit
> > + */
> > +int sgx_charge_mem(void)
> > +{
> > +	if (!atomic_long_add_unless(&sgx_nr_available_backing_pages,
> > -1, 0))
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * sgx_uncharge_mem() - uncharge a page previously used for
> > backing storage
> > + *
> > + * When backing storage is no longer in use, increment the
> > + * sgx_nr_available_backing_pages counter.
> > + */
> > +void sgx_uncharge_mem(void)
> > +{
> > +	atomic_long_inc(&sgx_nr_available_backing_pages);
> > +}
> > +
> >  /*
> >   * Reset post-kexec EPC pages to the uninitialized state. The
> > pages are removed
> >   * from the input list, and made available for the page allocator.
> > SECS pages
> > @@ -786,6 +825,8 @@ static bool __init sgx_page_cache_init(void)
> >  	u64 pa, size;
> >  	int nid;
> >  	int i;
> > +	u64 total_epc_bytes = 0;
> > +	u64 available_backing_bytes;
> 
> Use reverse-christmas-tree declaration order here. This was required
> for
> the original patch set, so I expect this to still hold.

sure.

> 
> >  
> >  	sgx_numa_nodes = kmalloc_array(num_possible_nodes(),
> > sizeof(*sgx_numa_nodes), GFP_KERNEL);
> >  	if (!sgx_numa_nodes)
> > @@ -830,6 +871,7 @@ static bool __init sgx_page_cache_init(void)
> >  
> >  		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
> >  		sgx_numa_nodes[nid].size += size;
> > +		total_epc_bytes += size;
> >  
> >  		sgx_nr_epc_sections++;
> >  	}
> > @@ -839,6 +881,10 @@ static bool __init sgx_page_cache_init(void)
> >  		return false;
> >  	}
> >  
> > +	available_backing_bytes = total_epc_bytes *
> > (sgx_overcommit_percent / 100);
> > +
> 
> IMHO this empty line could be removed, and group these statements
> together.

ok.

> 
> > +	atomic_long_set(&sgx_nr_available_backing_pages,
> > available_backing_bytes >> PAGE_SHIFT);
> > +
> >  	return true;
> >  }
> >  
> > diff --git a/arch/x86/kernel/cpu/sgx/sgx.h
> > b/arch/x86/kernel/cpu/sgx/sgx.h
> > index 0f17def9fe6f..3507a9983fc1 100644
> > --- a/arch/x86/kernel/cpu/sgx/sgx.h
> > +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> > @@ -89,6 +89,8 @@ void sgx_free_epc_page(struct sgx_epc_page
> > *page);
> >  void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
> >  int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
> >  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool
> > reclaim);
> > +int sgx_charge_mem(void);
> > +void sgx_uncharge_mem(void);
> >  
> >  #ifdef CONFIG_X86_SGX_KVM
> >  int __init sgx_vepc_init(void);
> > -- 
> > 2.20.1
> > 
> 
> Other than that, looks good to me.

thanks for taking another look. I will incorporate your suggestions
into the next revision.

[v2,1/2] x86/sgx: Add accounting for tracking overcommit

Commit Message

Comments

Patch