Message ID | 20240617125321.36658-9-haitao.huang@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add Cgroup support for SGX EPC memory | expand |
On 18/06/2024 12:53 am, Huang, Haitao wrote: > From: Kristen Carlson Accardi <kristen@linux.intel.com> > > Currently in the EPC page allocation, the kernel simply fails the > allocation when the current EPC cgroup fails to charge due to its usage > reaching limit. This is not ideal. When that happens, a better way is > to reclaim EPC page(s) from the current EPC cgroup (and/or its > descendants) to reduce its usage so the new allocation can succeed. > > Add the basic building blocks to support per-cgroup reclamation. > > Currently the kernel only has one place to reclaim EPC pages: the global > EPC LRU list. To support the "per-cgroup" EPC reclaim, maintain an LRU > list for each EPC cgroup, and introduce a "cgroup" variant function to > reclaim EPC pages from a given EPC cgroup and its descendants. > > Currently the kernel does the global EPC reclaim in sgx_reclaim_page(). > It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16) > pages. Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN > pages from the global LRU, and then tries to reclaim these pages at once > for better performance. > > Implement the "cgroup" variant EPC reclaim in a similar way, but keep > the implementation simple: 1) change sgx_reclaim_pages() to take an LRU > as input, and return the pages that are "scanned" and attempted for > reclamation (but not necessarily reclaimed successfully); 2) loop the > given EPC cgroup and its descendants and do the new sgx_reclaim_pages() > until SGX_NR_TO_SCAN pages are "scanned". > > This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always > tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC > cgroup, and only moves to its descendants when there's no enough > reclaimable EPC pages to "scan" in its LRU. It should be enough for > most cases. [...] > In other cases, the caller may invoke this function in a > loop to ensure enough pages reclaimed for its usage. To ensure all > descendant groups scanned in a round-robin fashion in those cases, > sgx_cgroup_reclaim_pages() takes in a starting cgroup and returns the > next cgroup that the caller can pass in as the new starting cgroup for a > subsequent call. AFAICT this part is new, and I believe this "round-robin" thing is just for the "global reclaim"? Or is it also for per-cgroup reclaim where more than SGX_NR_TO_SCAN pages needs to be reclaimed? I wish the changelog should just point out what consumers will use this new sgx_cgroup_reclaim_pages(), like: The sgx_cgroup_reclaim_pages() will be used in three cases: 1) direct/sync per-cgroup reclaim in try_charge() 2) indirect/async per-cgroup reclaim triggered in try_charge() 3) global reclaim And then describe how will they use sgx_cgroup_reclaim_pages(): Both 1) and 2) can result in needing to reclaim more than SGX_NR_TO_SCAN pages, in which case we should <fill in how to reclaim>. For 3), the new global reclaim should try tot match the existing global reclaim behaviour, that is to try to treat all EPC pages equally. <continue to explain how can sgx_cgroup_reclaim_pages() achieve this.> With above context, we can justify why to make sgx_cgroup_reclaim_pages() in this form. > > Note, this simple implementation doesn't _exactly_ mimic the current > global EPC reclaim (which always tries to do the actual reclaim in batch > of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN > reclaimable pages, the actual reclaim of EPC pages will be split into > smaller batches _across_ multiple LRUs with each being smaller than > SGX_NR_TO_SCAN pages. > > A more precise way to mimic the current global EPC reclaim would be to > have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages > _across_ the given EPC cgroup _AND_ its descendants, and then do the > actual reclaim in one batch. But this is unnecessarily complicated at > this stage. > > Alternatively, the current sgx_reclaim_pages() could be changed to > return the actual "reclaimed" pages, but not "scanned" pages. However, > the reclamation is a lengthy process, forcing a successful reclamation > of predetermined number of pages may block the caller for too long. And > that may not be acceptable in some synchronous contexts, e.g., in > serving an ioctl(). > > With this building block in place, add synchronous reclamation support > in sgx_cgroup_try_charge(): trigger a call to > sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the > caller allows synchronous reclaim as indicated by s newly added > parameter. > > A later patch will add support for asynchronous reclamation reusing > sgx_cgroup_reclaim_pages(). It seems you also should mention the new global reclaim will also use this sgx_cgroup_reclaim_pages()? [...] > +/** > + * sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree > + * @root: The root of cgroup tree to reclaim from. > + * @start: The descendant cgroup from which to start the tree walking. > + * > + * This function performs a pre-order walk in the cgroup tree under the given > + * root, starting from the node %start, or from the root if %start is NULL. The > + * function will attempt to reclaim pages at each node until a fixed number of > + * pages (%SGX_NR_TO_SCAN) are attempted for reclamation. No guarantee of > + * success on the actual reclamation process. In extreme cases, if all pages in > + * front of the LRUs are recently accessed, i.e., considered "too young" to > + * reclaim, no page will actually be reclaimed after walking the whole tree. > + * > + * In some cases, a caller may want to ensure enough reclamation until its > + * specific need is met. In those cases, the caller should invoke this function > + * in a loop, and at each iteration passes in the same root and the next node > + * returned from the previous call as the new %start. > + * > + * Return: The next misc cgroup in the subtree to continue the scanning and > + * attempt for more reclamation from this subtree if needed. > [...] > Caller must > + * release the reference if the returned is not used as %start for a subsequent > + * call. > This sentence isn't clear to me. First of all, release the reference "of what"? The %start, or the one returned by this function? And is it because of ... > + */ > +static struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, struct misc_cg *start) > +{ > + struct cgroup_subsys_state *css_root, *pos; > + struct cgroup_subsys_state *next = NULL; > + struct sgx_cgroup *sgx_cg; > + unsigned int cnt = 0; > + > + /* Caller must ensure css_root and start ref's acquired */ ... the caller must acquire the ref of both @css_root and @css_start, and ... > + css_root = &root->css; > + if (start) > + pos = &start->css; > + else > + pos = css_root; > + > + while (cnt < SGX_NR_TO_SCAN) { > + sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos)); > + cnt += sgx_reclaim_pages(&sgx_cg->lru); > + > + rcu_read_lock(); > + > + next = css_next_descendant_pre(pos, css_root); > + > + if (pos != css_root) > + css_put(pos); ... the ref is decreased internally? > + > + if (!next || !css_tryget(next)) { > + /* We are done if next is NULL or not safe to continue > + * the walk if next is dead. Return NULL and the caller > + * determines whether to restart from root. > + */ Incorrect comment style. > + rcu_read_unlock(); > + return NULL; > + } > + > + rcu_read_unlock(); > + pos = next; There's no ref grab here, wouldn't the above ... if (pos != css_root) css_put(pos); ... decrease the ref w/o having it been increased? > + } > + > + return css_misc(next); Here AFAICT the ref isn't increased, but ... [...] > +/** > + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page > * @sgx_cg: The EPC cgroup to be charged for the page. > + * @reclaim: Whether or not synchronous EPC reclaim is allowed. > * Return: > * * %0 - If successfully charged. > * * -errno - for failures. > */ > -int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg) > +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim) > { > - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE); > + int ret; > + struct misc_cg *cg_next = NULL; > + > + for (;;) { > + ret = __sgx_cgroup_try_charge(sgx_cg); > + > + if (ret != -EBUSY) > + goto out; > + > + if (reclaim == SGX_NO_RECLAIM) { > + ret = -ENOMEM; > + goto out; > + } > + > + cg_next = sgx_cgroup_reclaim_pages(sgx_cg->cg, cg_next); > + cond_resched(); > + } > + > +out: > + if (cg_next != sgx_cg->cg) > + put_misc_cg(cg_next); ... if I am reading correctly, here you does the put anyway. > + return ret; > } > And when there are more than SGX_NR_TO_SCAN pages that need to reclaim, the above ... for (;;) { cg_next = sgx_cgroup_reclaim_pages(sgx_cg->cg, cg_next); } ... actually tries to reclaim those pages from @sgx_cg _AND_ it's descendants, and tries to do it _EQUALLY_. Is this desired, or should we always try to reclaim from the @sgx_cg first, but only moves to the desendants when the @sgx_cg shouldn't be reclaimed anymore? Anyway, it's different from the previous behaviour. [...] > -static bool sgx_should_reclaim(unsigned long watermark) > +static bool sgx_should_reclaim_global(unsigned long watermark) > { > return atomic_long_read(&sgx_nr_free_pages) < watermark && > !list_empty(&sgx_global_lru.reclaimable); > } > > +static void sgx_reclaim_pages_global(void) > +{ > + sgx_reclaim_pages(&sgx_global_lru); > +} > + > /* > * sgx_reclaim_direct() should be called (without enclave's mutex held) > * in locations where SGX memory resources might be low and might be > @@ -394,8 +405,8 @@ static bool sgx_should_reclaim(unsigned long watermark) > */ > void sgx_reclaim_direct(void) > { > - if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) > - sgx_reclaim_pages(); > + if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES)) > + sgx_reclaim_pages_global(); > } > I wish the rename was mentioned in the changelog too.
On Thu, 20 Jun 2024 08:28:57 -0500, Huang, Kai <kai.huang@intel.com> wrote: > > On 18/06/2024 12:53 am, Huang, Haitao wrote: >> From: Kristen Carlson Accardi <kristen@linux.intel.com> >> >> Currently in the EPC page allocation, the kernel simply fails the >> allocation when the current EPC cgroup fails to charge due to its usage >> reaching limit. This is not ideal. When that happens, a better way is >> to reclaim EPC page(s) from the current EPC cgroup (and/or its >> descendants) to reduce its usage so the new allocation can succeed. >> >> Add the basic building blocks to support per-cgroup reclamation. >> >> Currently the kernel only has one place to reclaim EPC pages: the global >> EPC LRU list. To support the "per-cgroup" EPC reclaim, maintain an LRU >> list for each EPC cgroup, and introduce a "cgroup" variant function to >> reclaim EPC pages from a given EPC cgroup and its descendants. >> >> Currently the kernel does the global EPC reclaim in sgx_reclaim_page(). >> It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16) >> pages. Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN >> pages from the global LRU, and then tries to reclaim these pages at once >> for better performance. >> >> Implement the "cgroup" variant EPC reclaim in a similar way, but keep >> the implementation simple: 1) change sgx_reclaim_pages() to take an LRU >> as input, and return the pages that are "scanned" and attempted for >> reclamation (but not necessarily reclaimed successfully); 2) loop the >> given EPC cgroup and its descendants and do the new sgx_reclaim_pages() >> until SGX_NR_TO_SCAN pages are "scanned". >> This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always >> tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC >> cgroup, and only moves to its descendants when there's no enough >> reclaimable EPC pages to "scan" in its LRU. It should be enough for >> most cases. > > [...] > >> In other cases, the caller may invoke this function in a >> loop to ensure enough pages reclaimed for its usage. To ensure all >> descendant groups scanned in a round-robin fashion in those cases, >> sgx_cgroup_reclaim_pages() takes in a starting cgroup and returns the >> next cgroup that the caller can pass in as the new starting cgroup for a >> subsequent call. > > > AFAICT this part is new, and I believe this "round-robin" thing is just > for the "global reclaim"? Or is it also for per-cgroup reclaim where > more > than SGX_NR_TO_SCAN pages needs to be reclaimed? > > I wish the changelog should just point out what consumers will use this > new sgx_cgroup_reclaim_pages(), like: > > The sgx_cgroup_reclaim_pages() will be used in three cases: > > 1) direct/sync per-cgroup reclaim in try_charge() > 2) indirect/async per-cgroup reclaim triggered in try_charge() > 3) global reclaim > > And then describe how will they use sgx_cgroup_reclaim_pages(): > > Both 1) and 2) can result in needing to reclaim more than SGX_NR_TO_SCAN > pages, in which case we should <fill in how to reclaim>. > > For 3), the new global reclaim should try tot match the existing global > reclaim behaviour, that is to try to treat all EPC pages equally. > <continue to explain how can sgx_cgroup_reclaim_pages() achieve this.> > > With above context, we can justify why to make sgx_cgroup_reclaim_pages() > in this form. > This new part is only to address the issue you raised in this thread: https://lore.kernel.org/lkml/op.2ndsydgywjvjmi@hhuan26-mobl.amr.corp.intel.com/ Really it has nothing to do whether global, direct/async, per-cgroup contexts. They all should use the function the same way. This paragraph describes the design and I thought the above new statements justify the reason we return 'next' so it can reclaim into descedant in round-robin fashion? No sure we need get into details of different usages of the functions which are in code actually? I'll address the new global reclamation behavior later when this function was actually used for that. And the difference really exists before and not because of this change. >> >> Note, this simple implementation doesn't _exactly_ mimic the current >> global EPC reclaim (which always tries to do the actual reclaim in batch >> of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN >> reclaimable pages, the actual reclaim of EPC pages will be split into >> smaller batches _across_ multiple LRUs with each being smaller than >> SGX_NR_TO_SCAN pages. >> >> A more precise way to mimic the current global EPC reclaim would be to >> have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages >> _across_ the given EPC cgroup _AND_ its descendants, and then do the >> actual reclaim in one batch. But this is unnecessarily complicated at >> this stage. >> >> Alternatively, the current sgx_reclaim_pages() could be changed to >> return the actual "reclaimed" pages, but not "scanned" pages. However, >> the reclamation is a lengthy process, forcing a successful reclamation >> of predetermined number of pages may block the caller for too long. And >> that may not be acceptable in some synchronous contexts, e.g., in >> serving an ioctl(). >> >> With this building block in place, add synchronous reclamation support >> in sgx_cgroup_try_charge(): trigger a call to >> sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the >> caller allows synchronous reclaim as indicated by s newly added >> parameter. >> >> A later patch will add support for asynchronous reclamation reusing >> sgx_cgroup_reclaim_pages(). > > It seems you also should mention the new global reclaim will also use > this sgx_cgroup_reclaim_pages()? > > [...] > >> +/** >> + * sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree >> + * @root: The root of cgroup tree to reclaim from. >> + * @start: The descendant cgroup from which to start the tree walking. >> + * >> + * This function performs a pre-order walk in the cgroup tree under >> the given >> + * root, starting from the node %start, or from the root if %start is >> NULL. The >> + * function will attempt to reclaim pages at each node until a fixed >> number of >> + * pages (%SGX_NR_TO_SCAN) are attempted for reclamation. No guarantee >> of >> + * success on the actual reclamation process. In extreme cases, if all >> pages in >> + * front of the LRUs are recently accessed, i.e., considered "too >> young" to >> + * reclaim, no page will actually be reclaimed after walking the whole >> tree. >> + * >> + * In some cases, a caller may want to ensure enough reclamation until >> its >> + * specific need is met. In those cases, the caller should invoke this >> function >> + * in a loop, and at each iteration passes in the same root and the >> next node >> + * returned from the previous call as the new %start. >> + * >> + * Return: The next misc cgroup in the subtree to continue the >> scanning and >> + * attempt for more reclamation from this subtree if needed. > > [...] > >> Caller must >> + * release the reference if the returned is not used as %start for a >> subsequent >> + * call. >> > > This sentence isn't clear to me. > > First of all, release the reference "of what"? The %start, or the one > returned by this function? > will this be better? * Return: A reference to next misc cgroup in the subtree to continue the * scanning and attempt for more reclamation from this subtree if needed. * Caller must release the returned reference if the returned is not used as * %start for a subsequent call. > And is it because of ... > >> + */ >> +static struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, >> struct misc_cg *start) >> +{ >> + struct cgroup_subsys_state *css_root, *pos; >> + struct cgroup_subsys_state *next = NULL; >> + struct sgx_cgroup *sgx_cg; >> + unsigned int cnt = 0; >> + >> + /* Caller must ensure css_root and start ref's acquired */ > > ... the caller must acquire the ref of both @css_root and @css_start, and > ... yes > >> + css_root = &root->css; >> + if (start) >> + pos = &start->css; >> + else >> + pos = css_root; >> + >> + while (cnt < SGX_NR_TO_SCAN) { >> + sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos)); >> + cnt += sgx_reclaim_pages(&sgx_cg->lru); >> + >> + rcu_read_lock(); >> + >> + next = css_next_descendant_pre(pos, css_root); >> + >> + if (pos != css_root) >> + css_put(pos); > > ... the ref is decreased internally? > Right. >> + >> + if (!next || !css_tryget(next)) { >> + /* We are done if next is NULL or not safe to continue >> + * the walk if next is dead. Return NULL and the caller >> + * determines whether to restart from root. >> + */ > > Incorrect comment style. > Will fix >> + rcu_read_unlock(); >> + return NULL; >> + } >> + >> + rcu_read_unlock(); >> + pos = next; > > There's no ref grab here, wouldn't the above ... > tryget() done to next > if (pos != css_root) > css_put(pos); > > ... decrease the ref w/o having it been increased? > >> + } >> + >> + return css_misc(next); > > Here AFAICT the ref isn't increased, but ... > We only return next when next && css_tryget(next), otherwise NULL. > [...] > > >> +/** >> + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page >> * @sgx_cg: The EPC cgroup to be charged for the page. >> + * @reclaim: Whether or not synchronous EPC reclaim is allowed. >> * Return: >> * * %0 - If successfully charged. >> * * -errno - for failures. >> */ >> -int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg) >> +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim >> reclaim) >> { >> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE); >> + int ret; >> + struct misc_cg *cg_next = NULL; >> + >> + for (;;) { >> + ret = __sgx_cgroup_try_charge(sgx_cg); >> + >> + if (ret != -EBUSY) >> + goto out; >> + >> + if (reclaim == SGX_NO_RECLAIM) { >> + ret = -ENOMEM; >> + goto out; >> + } >> + >> + cg_next = sgx_cgroup_reclaim_pages(sgx_cg->cg, cg_next); >> + cond_resched(); >> + } >> + >> +out: >> + if (cg_next != sgx_cg->cg) >> + put_misc_cg(cg_next); > > ... if I am reading correctly, here you does the put anyway. > cg_next ref is increased or it is NULL. Only puts the last one no longer to be passed back in to sgx_cgroup_reclaim_page() >> + return ret; >> } >> > > And when there are more than SGX_NR_TO_SCAN pages that need to reclaim, > the above ... Note, all sgx_cgroup_reclaim_pages() guarantees is scanning SGX_NR_TO_SCAN pages. > > for (;;) { > cg_next = sgx_cgroup_reclaim_pages(sgx_cg->cg, cg_next); > } > > ... actually tries to reclaim those pages from @sgx_cg _AND_ it's > descendants, and tries to do it _EQUALLY_. > > Is this desired, or should we always try to reclaim from the @sgx_cg > first, but only moves to the desendants when the @sgx_cg shouldn't be > reclaimed anymore? > we still reclaim in sgx_cg in first scan and attempt of reclaiming for SGX_NR_TOS_CAN pages, but if it turns out that did not satisfy caller needs, then caller goes on to reclaim from descendants by passing in 'next' as starting point. > Anyway, it's different from the previous behaviour. > Again, this is to fix the issue you raised. I consider it improved behavior :-) > [...] > >> -static bool sgx_should_reclaim(unsigned long watermark) >> +static bool sgx_should_reclaim_global(unsigned long watermark) >> { >> return atomic_long_read(&sgx_nr_free_pages) < watermark && >> !list_empty(&sgx_global_lru.reclaimable); >> } >> >> +static void sgx_reclaim_pages_global(void) >> +{ >> + sgx_reclaim_pages(&sgx_global_lru); >> +} >> + >> /* >> * sgx_reclaim_direct() should be called (without enclave's mutex >> held) >> * in locations where SGX memory resources might be low and might be >> @@ -394,8 +405,8 @@ static bool sgx_should_reclaim(unsigned long >> watermark) >> */ >> void sgx_reclaim_direct(void) >> { >> - if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) >> - sgx_reclaim_pages(); >> + if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES)) >> + sgx_reclaim_pages_global(); >> } >> > > I wish the rename was mentioned in the changelog too. > OK
> > > > > In other cases, the caller may invoke this function in a > > > loop to ensure enough pages reclaimed for its usage. To ensure all > > > descendant groups scanned in a round-robin fashion in those cases, > > > sgx_cgroup_reclaim_pages() takes in a starting cgroup and returns the > > > next cgroup that the caller can pass in as the new starting cgroup for a > > > subsequent call. > > > > > > AFAICT this part is new, and I believe this "round-robin" thing is just > > for the "global reclaim"? Or is it also for per-cgroup reclaim where > > more > > than SGX_NR_TO_SCAN pages needs to be reclaimed? > > > > I wish the changelog should just point out what consumers will use this > > new sgx_cgroup_reclaim_pages(), like: > > > > The sgx_cgroup_reclaim_pages() will be used in three cases: > > > > 1) direct/sync per-cgroup reclaim in try_charge() > > 2) indirect/async per-cgroup reclaim triggered in try_charge() > > 3) global reclaim > > > > And then describe how will they use sgx_cgroup_reclaim_pages(): > > > > Both 1) and 2) can result in needing to reclaim more than SGX_NR_TO_SCAN > > pages, in which case we should <fill in how to reclaim>. > > > > For 3), the new global reclaim should try tot match the existing global > > reclaim behaviour, that is to try to treat all EPC pages equally. > > <continue to explain how can sgx_cgroup_reclaim_pages() achieve this.> > > > > With above context, we can justify why to make sgx_cgroup_reclaim_pages() > > in this form. > > > This new part is only to address the issue you raised in this thread: > https://lore.kernel.org/lkml/op.2ndsydgywjvjmi@hhuan26-mobl.amr.corp.intel.com/ > > Really it has nothing to do whether global, direct/async, per-cgroup > contexts. They all should use the function the same way. This paragraph > describes the design and > I thought the above new statements justify the reason we return 'next' so > it can reclaim into descedant in round-robin fashion? No sure we need get > into details of different usages of the functions which are in code > actually? Please clearly define the behaviour of "per-cgroup reclaim" first. I can understand "global reclaim" means we essentially want to treats all EPC pages equally. But it's not obvious to me what is the desired behaviour of "per-cgroup reclaim", especially when the behaviour is different between this version and the previous versions (see below). > [...] > > And when there are more than SGX_NR_TO_SCAN pages that need to reclaim, > > the above ... > > Note, all sgx_cgroup_reclaim_pages() guarantees is scanning SGX_NR_TO_SCAN > pages. > > > > for (;;) { > > cg_next = sgx_cgroup_reclaim_pages(sgx_cg->cg, cg_next); > > } > > > > ... actually tries to reclaim those pages from @sgx_cg _AND_ it's > > descendants, and tries to do it _EQUALLY_. > > > > Is this desired, or should we always try to reclaim from the @sgx_cg > > first, but only moves to the desendants when the @sgx_cg shouldn't be > > reclaimed anymore? > > > > we still reclaim in sgx_cg in first scan and attempt of reclaiming for > SGX_NR_TOS_CAN pages, but if it turns out that did not satisfy caller > needs, then caller goes on to reclaim from descendants by passing in > 'next' as starting point. But why? > > > Anyway, it's different from the previous behaviour. > > > Again, this is to fix the issue you raised. I consider it improved > behavior :-) Please clearly define the _EXPECTED_ hebaviour of "per-cgroup reclaim" first. > We have two choices: 1) Always try to reclaim desired number of pages from the given cgroup, but only moves to reclaim from descendants when there's less than SGX_NR_TO_SCAN pages left; 2) Always try to reclaim desired number of pages _EQUALLY_ from the given cgroup _AND_ its descendants (in granularity of reclaiming SGX_NR_TO_SCAN pages each time). The 1) was the old behavour in the previous versions, 2) is the new behaviour in this version. I am not against any option, but the patch needs to be clear on which option to choose and why it is desired/better.
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index 73e4c0304b77..507926f37abf 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -8,16 +8,158 @@ static struct sgx_cgroup sgx_cg_root; /** - * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page + * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs + * @root: Root of the tree to check * + * Return: %true if all cgroups under the specified root have empty LRU lists. + */ +static bool sgx_cgroup_lru_empty(struct misc_cg *root) +{ + struct cgroup_subsys_state *css_root; + struct cgroup_subsys_state *pos; + struct sgx_cgroup *sgx_cg; + bool ret = true; + + /* + * Caller must ensure css_root ref acquired + */ + css_root = &root->css; + + rcu_read_lock(); + css_for_each_descendant_pre(pos, css_root) { + if (!css_tryget(pos)) + break; + + rcu_read_unlock(); + + sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos)); + + spin_lock(&sgx_cg->lru.lock); + ret = list_empty(&sgx_cg->lru.reclaimable); + spin_unlock(&sgx_cg->lru.lock); + + rcu_read_lock(); + css_put(pos); + if (!ret) + break; + } + + rcu_read_unlock(); + + return ret; +} + +/** + * sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree + * @root: The root of cgroup tree to reclaim from. + * @start: The descendant cgroup from which to start the tree walking. + * + * This function performs a pre-order walk in the cgroup tree under the given + * root, starting from the node %start, or from the root if %start is NULL. The + * function will attempt to reclaim pages at each node until a fixed number of + * pages (%SGX_NR_TO_SCAN) are attempted for reclamation. No guarantee of + * success on the actual reclamation process. In extreme cases, if all pages in + * front of the LRUs are recently accessed, i.e., considered "too young" to + * reclaim, no page will actually be reclaimed after walking the whole tree. + * + * In some cases, a caller may want to ensure enough reclamation until its + * specific need is met. In those cases, the caller should invoke this function + * in a loop, and at each iteration passes in the same root and the next node + * returned from the previous call as the new %start. + * + * Return: The next misc cgroup in the subtree to continue the scanning and + * attempt for more reclamation from this subtree if needed. Caller must + * release the reference if the returned is not used as %start for a subsequent + * call. + */ +static struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, struct misc_cg *start) +{ + struct cgroup_subsys_state *css_root, *pos; + struct cgroup_subsys_state *next = NULL; + struct sgx_cgroup *sgx_cg; + unsigned int cnt = 0; + + /* Caller must ensure css_root and start ref's acquired */ + css_root = &root->css; + if (start) + pos = &start->css; + else + pos = css_root; + + while (cnt < SGX_NR_TO_SCAN) { + sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos)); + cnt += sgx_reclaim_pages(&sgx_cg->lru); + + rcu_read_lock(); + + next = css_next_descendant_pre(pos, css_root); + + if (pos != css_root) + css_put(pos); + + if (!next || !css_tryget(next)) { + /* We are done if next is NULL or not safe to continue + * the walk if next is dead. Return NULL and the caller + * determines whether to restart from root. + */ + rcu_read_unlock(); + return NULL; + } + + rcu_read_unlock(); + pos = next; + } + + return css_misc(next); +} + +static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg) +{ + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE)) + return 0; + + /* No reclaimable pages left in the cgroup */ + if (sgx_cgroup_lru_empty(epc_cg->cg)) + return -ENOMEM; + + if (signal_pending(current)) + return -ERESTARTSYS; + + return -EBUSY; +} + +/** + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page * @sgx_cg: The EPC cgroup to be charged for the page. + * @reclaim: Whether or not synchronous EPC reclaim is allowed. * Return: * * %0 - If successfully charged. * * -errno - for failures. */ -int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg) +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim) { - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE); + int ret; + struct misc_cg *cg_next = NULL; + + for (;;) { + ret = __sgx_cgroup_try_charge(sgx_cg); + + if (ret != -EBUSY) + goto out; + + if (reclaim == SGX_NO_RECLAIM) { + ret = -ENOMEM; + goto out; + } + + cg_next = sgx_cgroup_reclaim_pages(sgx_cg->cg, cg_next); + cond_resched(); + } + +out: + if (cg_next != sgx_cg->cg) + put_misc_cg(cg_next); + return ret; } /** @@ -42,6 +184,7 @@ static void sgx_cgroup_free(struct misc_cg *cg) static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg) { + sgx_lru_init(&sgx_cg->lru); cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg; sgx_cg->cg = cg; } diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h index 5750eebc2f3f..98af9bde6087 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h @@ -20,7 +20,7 @@ static inline struct sgx_cgroup *sgx_get_current_cg(void) static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { } -static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg) +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim) { return 0; } @@ -33,6 +33,7 @@ static inline void sgx_cgroup_init(void) { } struct sgx_cgroup { struct misc_cg *cg; + struct sgx_epc_lru_list lru; }; static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg) @@ -63,7 +64,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) put_misc_cg(sgx_cg->cg); } -int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg); +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim); void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg); int __init sgx_cgroup_init(void); diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 1eadf4c39678..d272488c53bf 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -286,11 +286,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page, mutex_unlock(&encl->lock); } -/* - * Take a fixed number of pages from the head of the active page pool and - * reclaim them to the enclave's private shmem files. Skip the pages, which have - * been accessed since the last scan. Move those pages to the tail of active - * page pool so that the pages get scanned in LRU like fashion. +/** + * sgx_reclaim_pages() - Attempt to reclaim a fixed number of pages from an LRU + * @lru: The LRU from which pages are reclaimed. + * + * Take a fixed number of pages from the head of a given LRU and reclaim them to + * the enclave's private shmem files. Skip the pages, which have been accessed + * since the last scan. Move those pages to the tail of the list so that the + * pages get scanned in LRU like fashion. * * Batch process a chunk of pages (at the moment 16) in order to degrade amount * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit @@ -298,8 +301,10 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page, * + EWB) but not sufficiently. Reclaiming one page at a time would also be * problematic as it would increase the lock contention too much, which would * halt forward progress. + * + * Return: Number of pages attempted for reclamation. */ -static void sgx_reclaim_pages(void) +unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru) { struct sgx_epc_page *chunk[SGX_NR_TO_SCAN]; struct sgx_backing backing[SGX_NR_TO_SCAN]; @@ -310,10 +315,9 @@ static void sgx_reclaim_pages(void) int ret; int i; - spin_lock(&sgx_global_lru.lock); + spin_lock(&lru->lock); for (i = 0; i < SGX_NR_TO_SCAN; i++) { - epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable, - struct sgx_epc_page, list); + epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list); if (!epc_page) break; @@ -328,7 +332,7 @@ static void sgx_reclaim_pages(void) */ epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; } - spin_unlock(&sgx_global_lru.lock); + spin_unlock(&lru->lock); for (i = 0; i < cnt; i++) { epc_page = chunk[i]; @@ -351,9 +355,9 @@ static void sgx_reclaim_pages(void) continue; skip: - spin_lock(&sgx_global_lru.lock); - list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable); - spin_unlock(&sgx_global_lru.lock); + spin_lock(&lru->lock); + list_add_tail(&epc_page->list, &lru->reclaimable); + spin_unlock(&lru->lock); kref_put(&encl_page->encl->refcount, sgx_encl_release); @@ -379,14 +383,21 @@ static void sgx_reclaim_pages(void) sgx_free_epc_page(epc_page); } + + return cnt; } -static bool sgx_should_reclaim(unsigned long watermark) +static bool sgx_should_reclaim_global(unsigned long watermark) { return atomic_long_read(&sgx_nr_free_pages) < watermark && !list_empty(&sgx_global_lru.reclaimable); } +static void sgx_reclaim_pages_global(void) +{ + sgx_reclaim_pages(&sgx_global_lru); +} + /* * sgx_reclaim_direct() should be called (without enclave's mutex held) * in locations where SGX memory resources might be low and might be @@ -394,8 +405,8 @@ static bool sgx_should_reclaim(unsigned long watermark) */ void sgx_reclaim_direct(void) { - if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) - sgx_reclaim_pages(); + if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES)) + sgx_reclaim_pages_global(); } static int ksgxd(void *p) @@ -415,10 +426,10 @@ static int ksgxd(void *p) wait_event_freezable(ksgxd_waitq, kthread_should_stop() || - sgx_should_reclaim(SGX_NR_HIGH_PAGES)); + sgx_should_reclaim_global(SGX_NR_HIGH_PAGES)); - if (sgx_should_reclaim(SGX_NR_HIGH_PAGES)) - sgx_reclaim_pages(); + if (sgx_should_reclaim_global(SGX_NR_HIGH_PAGES)) + sgx_reclaim_pages_global(); cond_resched(); } @@ -572,7 +583,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim) int ret; sgx_cg = sgx_get_current_cg(); - ret = sgx_cgroup_try_charge(sgx_cg); + ret = sgx_cgroup_try_charge(sgx_cg, reclaim); if (ret) { sgx_put_cg(sgx_cg); return ERR_PTR(ret); @@ -600,7 +611,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim) break; } - sgx_reclaim_pages(); + sgx_reclaim_pages_global(); cond_resched(); } @@ -613,7 +624,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim) sgx_put_cg(sgx_cg); } - if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) + if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES)) wake_up(&ksgxd_waitq); return page; diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h index e21c5a134a5e..4dc04f4d046c 100644 --- a/arch/x86/kernel/cpu/sgx/sgx.h +++ b/arch/x86/kernel/cpu/sgx/sgx.h @@ -136,6 +136,7 @@ void sgx_reclaim_direct(void); void sgx_mark_page_reclaimable(struct sgx_epc_page *page); int sgx_unmark_page_reclaimable(struct sgx_epc_page *page); struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim); +unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru); void sgx_ipi_cb(void *info);