Message ID | 20210513064837.3949064-1-ying.huang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, swap: Remove unnecessary smp_rmb() in swap_type_to_swap_info() | expand |
On 2021/5/13 14:48, Huang Ying wrote: > Before commit c10d38cc8d3e ("mm, swap: bounds check swap_info array > accesses to avoid NULL derefs"), the typical code to reference the > swap_info[] is as follows, > > type = swp_type(swp_entry); > if (type >= nr_swapfiles) > /* handle invalid swp_entry */; > p = swap_info[type]; > /* access fields of *p. OOPS! p may be NULL! */ > > Because the ordering isn't guaranteed, it's possible that "p" is read > before checking "type". And that may result in NULL pointer > dereference. > > So in commit c10d38cc8d3e, the code becomes, > > struct swap_info_struct *swap_type_to_swap_info(int type) > { > if (type >= READ_ONCE(nr_swapfiles)) > return NULL; > smp_rmb(); > return READ_ONCE(swap_info[type]); > } > > /* users */ > type = swp_type(swp_entry); > p = swap_type_to_swap_info(type); > if (!p) > /* handle invalid swp_entry */; > /* access fields of *p */ > > Because "p" is checked to be non-zero before dereference, smp_rmb() > isn't needed anymore. > > We still need to guarantee swap_info[type] is read before dereference. > That can be satisfied via the data dependency ordering of > READ_ONCE(swap_info[type]). The corresponding smp_wmb() is adjusted > in alloc_swap_info() too. > > And, we don't need to read "nr_swapfiles" too. Because if > "type >= nr_swapfiles", swap_info[type] will be NULL. We just need > to make sure we will not access out of the boundary of the array. > With that change, nr_swapfiles will only be accessed with swap_lock > held, except in swapcache_free_entries(). Where the absolute > correctness of the value isn't needed, as described in the comments. > > Signed-off-by: "Huang, Ying" <ying.huang@intel.com> > Cc: Daniel Jordan <daniel.m.jordan@oracle.com> > Cc: Dan Carpenter <dan.carpenter@oracle.com> > Cc: Andrea Parri <andrea.parri@amarulasolutions.com> > Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > Cc: Andi Kleen <ak@linux.intel.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Omar Sandoval <osandov@fb.com> > Cc: Paul McKenney <paulmck@kernel.org> > Cc: Tejun Heo <tj@kernel.org> > Cc: Will Deacon <will.deacon@arm.com> > Cc: Miaohe Lin <linmiaohe@huawei.com> > --- > mm/swapfile.c | 18 +++++++++--------- > 1 file changed, 9 insertions(+), 9 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2aad85751991..4c1fb28bbe0e 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -100,10 +100,14 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0); > > static struct swap_info_struct *swap_type_to_swap_info(int type) > { > - if (type >= READ_ONCE(nr_swapfiles)) > + if (type >= MAX_SWAPFILES) > return NULL; > > - smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ > + /* > + * The data dependency ordering from the READ_ONCE() pairs > + * with smp_wmb() in alloc_swap_info() to guarantee the > + * swap_info_struct fields are read after swap_info[type]. > + */ > return READ_ONCE(swap_info[type]); > } > > @@ -2884,14 +2888,10 @@ static struct swap_info_struct *alloc_swap_info(void) > } > if (type >= nr_swapfiles) { > p->type = type; > - WRITE_ONCE(swap_info[type], p); > - /* > - * Write swap_info[type] before nr_swapfiles, in case a > - * racing procfs swap_start() or swap_next() is reading them. > - * (We never shrink nr_swapfiles, we never free this entry.) > - */ > + /* Paired with READ_ONCE() in swap_type_to_swap_info() */ > smp_wmb(); Many thank for your patch. The patch looks fine to me. There is one question: There is no smp_rmb() paired with above smp_wmb(). What is this smp_wmb() used for ? Could you please have a explanation ? Thanks again! > - WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); > + WRITE_ONCE(swap_info[type], p); > + nr_swapfiles++; > } else { > defer = p; > p = swap_info[type]; >
On Thu, May 13, 2021 at 5:11 PM Miaohe Lin <linmiaohe@huawei.com> wrote: > > On 2021/5/13 14:48, Huang Ying wrote: > > Before commit c10d38cc8d3e ("mm, swap: bounds check swap_info array > > accesses to avoid NULL derefs"), the typical code to reference the > > swap_info[] is as follows, > > > > type = swp_type(swp_entry); > > if (type >= nr_swapfiles) > > /* handle invalid swp_entry */; > > p = swap_info[type]; > > /* access fields of *p. OOPS! p may be NULL! */ > > > > Because the ordering isn't guaranteed, it's possible that "p" is read > > before checking "type". And that may result in NULL pointer > > dereference. > > > > So in commit c10d38cc8d3e, the code becomes, > > > > struct swap_info_struct *swap_type_to_swap_info(int type) > > { > > if (type >= READ_ONCE(nr_swapfiles)) > > return NULL; > > smp_rmb(); > > return READ_ONCE(swap_info[type]); > > } > > > > /* users */ > > type = swp_type(swp_entry); > > p = swap_type_to_swap_info(type); > > if (!p) > > /* handle invalid swp_entry */; > > /* access fields of *p */ > > > > Because "p" is checked to be non-zero before dereference, smp_rmb() > > isn't needed anymore. > > > > We still need to guarantee swap_info[type] is read before dereference. > > That can be satisfied via the data dependency ordering of > > READ_ONCE(swap_info[type]). The corresponding smp_wmb() is adjusted > > in alloc_swap_info() too. > > > > And, we don't need to read "nr_swapfiles" too. Because if > > "type >= nr_swapfiles", swap_info[type] will be NULL. We just need > > to make sure we will not access out of the boundary of the array. > > With that change, nr_swapfiles will only be accessed with swap_lock > > held, except in swapcache_free_entries(). Where the absolute > > correctness of the value isn't needed, as described in the comments. > > > > Signed-off-by: "Huang, Ying" <ying.huang@intel.com> > > Cc: Daniel Jordan <daniel.m.jordan@oracle.com> > > Cc: Dan Carpenter <dan.carpenter@oracle.com> > > Cc: Andrea Parri <andrea.parri@amarulasolutions.com> > > Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > > Cc: Andi Kleen <ak@linux.intel.com> > > Cc: Dave Hansen <dave.hansen@linux.intel.com> > > Cc: Omar Sandoval <osandov@fb.com> > > Cc: Paul McKenney <paulmck@kernel.org> > > Cc: Tejun Heo <tj@kernel.org> > > Cc: Will Deacon <will.deacon@arm.com> > > Cc: Miaohe Lin <linmiaohe@huawei.com> > > --- > > mm/swapfile.c | 18 +++++++++--------- > > 1 file changed, 9 insertions(+), 9 deletions(-) > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 2aad85751991..4c1fb28bbe0e 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -100,10 +100,14 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0); > > > > static struct swap_info_struct *swap_type_to_swap_info(int type) > > { > > - if (type >= READ_ONCE(nr_swapfiles)) > > + if (type >= MAX_SWAPFILES) > > return NULL; > > > > - smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ > > + /* > > + * The data dependency ordering from the READ_ONCE() pairs > > + * with smp_wmb() in alloc_swap_info() to guarantee the > > + * swap_info_struct fields are read after swap_info[type]. > > + */ > > return READ_ONCE(swap_info[type]); > > } > > > > @@ -2884,14 +2888,10 @@ static struct swap_info_struct *alloc_swap_info(void) > > } > > if (type >= nr_swapfiles) { > > p->type = type; > > - WRITE_ONCE(swap_info[type], p); > > - /* > > - * Write swap_info[type] before nr_swapfiles, in case a > > - * racing procfs swap_start() or swap_next() is reading them. > > - * (We never shrink nr_swapfiles, we never free this entry.) > > - */ > > + /* Paired with READ_ONCE() in swap_type_to_swap_info() */ > > smp_wmb(); > > Many thank for your patch. The patch looks fine to me. There is one question: > > There is no smp_rmb() paired with above smp_wmb(). What is this smp_wmb() used for ? > Could you please have a explanation ? The comment is very clear, it matches READ_ONCE() which implies a data dependence barrier on some archs. Thanks. > > Thanks again! > > > - WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); > > + WRITE_ONCE(swap_info[type], p); > > + nr_swapfiles++; > > } else { > > defer = p; > > p = swap_info[type]; > > >
On 2021/5/13 17:54, Muchun Song wrote: > On Thu, May 13, 2021 at 5:11 PM Miaohe Lin <linmiaohe@huawei.com> wrote: >> >> On 2021/5/13 14:48, Huang Ying wrote: >>> Before commit c10d38cc8d3e ("mm, swap: bounds check swap_info array >>> accesses to avoid NULL derefs"), the typical code to reference the >>> swap_info[] is as follows, >>> >>> type = swp_type(swp_entry); >>> if (type >= nr_swapfiles) >>> /* handle invalid swp_entry */; >>> p = swap_info[type]; >>> /* access fields of *p. OOPS! p may be NULL! */ >>> >>> Because the ordering isn't guaranteed, it's possible that "p" is read >>> before checking "type". And that may result in NULL pointer >>> dereference. >>> >>> So in commit c10d38cc8d3e, the code becomes, >>> >>> struct swap_info_struct *swap_type_to_swap_info(int type) >>> { >>> if (type >= READ_ONCE(nr_swapfiles)) >>> return NULL; >>> smp_rmb(); >>> return READ_ONCE(swap_info[type]); >>> } >>> >>> /* users */ >>> type = swp_type(swp_entry); >>> p = swap_type_to_swap_info(type); >>> if (!p) >>> /* handle invalid swp_entry */; >>> /* access fields of *p */ >>> >>> Because "p" is checked to be non-zero before dereference, smp_rmb() >>> isn't needed anymore. >>> >>> We still need to guarantee swap_info[type] is read before dereference. >>> That can be satisfied via the data dependency ordering of >>> READ_ONCE(swap_info[type]). The corresponding smp_wmb() is adjusted >>> in alloc_swap_info() too. >>> >>> And, we don't need to read "nr_swapfiles" too. Because if >>> "type >= nr_swapfiles", swap_info[type] will be NULL. We just need >>> to make sure we will not access out of the boundary of the array. >>> With that change, nr_swapfiles will only be accessed with swap_lock >>> held, except in swapcache_free_entries(). Where the absolute >>> correctness of the value isn't needed, as described in the comments. >>> >>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> >>> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> >>> Cc: Dan Carpenter <dan.carpenter@oracle.com> >>> Cc: Andrea Parri <andrea.parri@amarulasolutions.com> >>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >>> Cc: Andi Kleen <ak@linux.intel.com> >>> Cc: Dave Hansen <dave.hansen@linux.intel.com> >>> Cc: Omar Sandoval <osandov@fb.com> >>> Cc: Paul McKenney <paulmck@kernel.org> >>> Cc: Tejun Heo <tj@kernel.org> >>> Cc: Will Deacon <will.deacon@arm.com> >>> Cc: Miaohe Lin <linmiaohe@huawei.com> >>> --- >>> mm/swapfile.c | 18 +++++++++--------- >>> 1 file changed, 9 insertions(+), 9 deletions(-) >>> >>> diff --git a/mm/swapfile.c b/mm/swapfile.c >>> index 2aad85751991..4c1fb28bbe0e 100644 >>> --- a/mm/swapfile.c >>> +++ b/mm/swapfile.c >>> @@ -100,10 +100,14 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0); >>> >>> static struct swap_info_struct *swap_type_to_swap_info(int type) >>> { >>> - if (type >= READ_ONCE(nr_swapfiles)) >>> + if (type >= MAX_SWAPFILES) >>> return NULL; >>> >>> - smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ >>> + /* >>> + * The data dependency ordering from the READ_ONCE() pairs >>> + * with smp_wmb() in alloc_swap_info() to guarantee the >>> + * swap_info_struct fields are read after swap_info[type]. >>> + */ >>> return READ_ONCE(swap_info[type]); >>> } >>> >>> @@ -2884,14 +2888,10 @@ static struct swap_info_struct *alloc_swap_info(void) >>> } >>> if (type >= nr_swapfiles) { >>> p->type = type; >>> - WRITE_ONCE(swap_info[type], p); >>> - /* >>> - * Write swap_info[type] before nr_swapfiles, in case a >>> - * racing procfs swap_start() or swap_next() is reading them. >>> - * (We never shrink nr_swapfiles, we never free this entry.) >>> - */ >>> + /* Paired with READ_ONCE() in swap_type_to_swap_info() */ >>> smp_wmb(); >> >> Many thank for your patch. The patch looks fine to me. There is one question: >> >> There is no smp_rmb() paired with above smp_wmb(). What is this smp_wmb() used for ? >> Could you please have a explanation ? > > The comment is very clear, it matches READ_ONCE() which implies a > data dependence barrier on some archs. > > Thanks. Got it. I misunderstood it... Thanks! > >> >> Thanks again! >> >>> - WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); >>> + WRITE_ONCE(swap_info[type], p); >>> + nr_swapfiles++; >>> } else { >>> defer = p; >>> p = swap_info[type]; >>> >> > . >
On Thu, May 13, 2021 at 05:54:42PM +0800, Muchun Song wrote: > On Thu, May 13, 2021 at 5:11 PM Miaohe Lin <linmiaohe@huawei.com> wrote: > > On 2021/5/13 14:48, Huang Ying wrote: > > > mm/swapfile.c | 18 +++++++++--------- > > > 1 file changed, 9 insertions(+), 9 deletions(-) > > > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > index 2aad85751991..4c1fb28bbe0e 100644 > > > --- a/mm/swapfile.c > > > +++ b/mm/swapfile.c > > > @@ -100,10 +100,14 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0); > > > > > > static struct swap_info_struct *swap_type_to_swap_info(int type) > > > { > > > - if (type >= READ_ONCE(nr_swapfiles)) > > > + if (type >= MAX_SWAPFILES) > > > return NULL; > > > > > > - smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ > > > + /* > > > + * The data dependency ordering from the READ_ONCE() pairs > > > + * with smp_wmb() in alloc_swap_info() to guarantee the > > > + * swap_info_struct fields are read after swap_info[type]. > > > + */ > > > return READ_ONCE(swap_info[type]); > > > } > > > > > > @@ -2884,14 +2888,10 @@ static struct swap_info_struct *alloc_swap_info(void) > > > } > > > if (type >= nr_swapfiles) { > > > p->type = type; > > > - WRITE_ONCE(swap_info[type], p); > > > - /* > > > - * Write swap_info[type] before nr_swapfiles, in case a > > > - * racing procfs swap_start() or swap_next() is reading them. > > > - * (We never shrink nr_swapfiles, we never free this entry.) > > > - */ > > > + /* Paired with READ_ONCE() in swap_type_to_swap_info() */ > > > smp_wmb(); > > > > Many thank for your patch. The patch looks fine to me. There is one question: > > > > There is no smp_rmb() paired with above smp_wmb(). What is this smp_wmb() used for ? > > Could you please have a explanation ? > > The comment is very clear, it matches READ_ONCE() which implies a > data dependence barrier on some archs. This statement doesn't make sense; this isn't code that needs to be correct on 'some' archs, it needs to be unconditionally correct. Also, you cannot pair with a single memop, there is no order in a set of one element. And if you depend on a data dependency, you need a store order; but you just removed the store order. in which case the data dependency is also moot. All of this is utter confusion. Possibly correct, but a complete trainwreck non-the-less. Either you say ordering is irrelevant, because we only ever increase the number of swapfiles and therefore any load is either NULL or the correct pointer, as guaranteed by WRITE_ONCE()/READ_ONCE() avoiding load/store tearing. Or you need the data dependency, but then you also need the store order like: CPU0 CPU1 if (type >= READ_ONCE(nr_swapfiles)) WRITE_ONCE(swap_info[type], p); return NULL; /* data-dependency on type */ smp_wmb(); return READ_ONCE(swap_info[type]); WRITE_ONCE(nr_swapfiles, nr_swapfiles+1); But you cannot have half of both and expect any of it to make sense.
On Thu, May 13, 2021 at 02:48:37PM +0800, Huang Ying wrote: > mm/swapfile.c | 18 +++++++++--------- > 1 file changed, 9 insertions(+), 9 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2aad85751991..4c1fb28bbe0e 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -100,10 +100,14 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0); > > static struct swap_info_struct *swap_type_to_swap_info(int type) > { > - if (type >= READ_ONCE(nr_swapfiles)) > + if (type >= MAX_SWAPFILES) > return NULL; > > - smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ > + /* > + * The data dependency ordering from the READ_ONCE() pairs > + * with smp_wmb() in alloc_swap_info() to guarantee the > + * swap_info_struct fields are read after swap_info[type]. > + */ > return READ_ONCE(swap_info[type]); > } > > @@ -2884,14 +2888,10 @@ static struct swap_info_struct *alloc_swap_info(void) > } > if (type >= nr_swapfiles) { > p->type = type; > - WRITE_ONCE(swap_info[type], p); > - /* > - * Write swap_info[type] before nr_swapfiles, in case a > - * racing procfs swap_start() or swap_next() is reading them. > - * (We never shrink nr_swapfiles, we never free this entry.) > - */ > + /* Paired with READ_ONCE() in swap_type_to_swap_info() */ > smp_wmb(); > - WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); > + WRITE_ONCE(swap_info[type], p); > + nr_swapfiles++; Ah, I think I see what you meant to say, it would perhaps help if you write it like so: diff --git a/mm/swapfile.c b/mm/swapfile.c index 149e77454e3c..94735248dcd2 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -99,11 +99,10 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0); static struct swap_info_struct *swap_type_to_swap_info(int type) { - if (type >= READ_ONCE(nr_swapfiles)) + if (type >= MAX_SWAPFILES) return NULL; - smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ - return READ_ONCE(swap_info[type]); + return READ_ONCE(swap_info[type]); /* rcu_dereference() */ } static inline unsigned char swap_count(unsigned char ent) @@ -2869,14 +2868,11 @@ static struct swap_info_struct *alloc_swap_info(void) } if (type >= nr_swapfiles) { p->type = type; - WRITE_ONCE(swap_info[type], p); /* - * Write swap_info[type] before nr_swapfiles, in case a - * racing procfs swap_start() or swap_next() is reading them. - * (We never shrink nr_swapfiles, we never free this entry.) + * Publish the swap_info_struct. */ - smp_wmb(); - WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); + smp_store_release(&swap_info[type], p); /* rcu_assign_pointer() */ + nr_swapfiles++; } else { defer = p; p = swap_info[type];
On Thu, May 13, 2021 at 02:46:10PM +0200, Peter Zijlstra wrote: > Ah, I think I see what you meant to say, it would perhaps help if you > write it like so: > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 149e77454e3c..94735248dcd2 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -99,11 +99,10 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0); > > static struct swap_info_struct *swap_type_to_swap_info(int type) > { > - if (type >= READ_ONCE(nr_swapfiles)) > + if (type >= MAX_SWAPFILES) > return NULL; > > - smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ > - return READ_ONCE(swap_info[type]); > + return READ_ONCE(swap_info[type]); /* rcu_dereference() */ > } > > static inline unsigned char swap_count(unsigned char ent) > @@ -2869,14 +2868,11 @@ static struct swap_info_struct *alloc_swap_info(void) > } > if (type >= nr_swapfiles) { > p->type = type; > - WRITE_ONCE(swap_info[type], p); > /* > - * Write swap_info[type] before nr_swapfiles, in case a > - * racing procfs swap_start() or swap_next() is reading them. > - * (We never shrink nr_swapfiles, we never free this entry.) > + * Publish the swap_info_struct. > */ > - smp_wmb(); > - WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); > + smp_store_release(&swap_info[type], p); /* rcu_assign_pointer() */ > + nr_swapfiles++; Yes, this does help, I didn't understand why smp_wmb stayed around in the original post. I think the only access smp_store_release() orders is p->type. Wouldn't it be kinda inconsistent to only initialize that one field before publishing when many others would be done at the end of alloc_swap_info() after the fact? p->type doesn't seem special. For instance, get_swap_page_of_type() touches si->lock soon after it calls swap_type_to_swap_info(), so there could be a small window where there's a non-NULL si with an uninitialized lock. It's not as if this is likely to be a problem in practice, it would just make it harder to understand why smp_store_release is there. Maybe all we need is a WRITE_ONCE, or if it's really necessary for certain fields to be set before publication then move them up and explain?
Peter Zijlstra <peterz@infradead.org> writes: > On Thu, May 13, 2021 at 02:48:37PM +0800, Huang Ying wrote: >> mm/swapfile.c | 18 +++++++++--------- >> 1 file changed, 9 insertions(+), 9 deletions(-) >> >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index 2aad85751991..4c1fb28bbe0e 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -100,10 +100,14 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0); >> >> static struct swap_info_struct *swap_type_to_swap_info(int type) >> { >> - if (type >= READ_ONCE(nr_swapfiles)) >> + if (type >= MAX_SWAPFILES) >> return NULL; >> >> - smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ >> + /* >> + * The data dependency ordering from the READ_ONCE() pairs >> + * with smp_wmb() in alloc_swap_info() to guarantee the >> + * swap_info_struct fields are read after swap_info[type]. >> + */ >> return READ_ONCE(swap_info[type]); >> } >> >> @@ -2884,14 +2888,10 @@ static struct swap_info_struct *alloc_swap_info(void) >> } >> if (type >= nr_swapfiles) { >> p->type = type; >> - WRITE_ONCE(swap_info[type], p); >> - /* >> - * Write swap_info[type] before nr_swapfiles, in case a >> - * racing procfs swap_start() or swap_next() is reading them. >> - * (We never shrink nr_swapfiles, we never free this entry.) >> - */ >> + /* Paired with READ_ONCE() in swap_type_to_swap_info() */ >> smp_wmb(); >> - WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); >> + WRITE_ONCE(swap_info[type], p); >> + nr_swapfiles++; > > Ah, I think I see what you meant to say, it would perhaps help if you > write it like so: > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 149e77454e3c..94735248dcd2 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -99,11 +99,10 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0); > > static struct swap_info_struct *swap_type_to_swap_info(int type) > { > - if (type >= READ_ONCE(nr_swapfiles)) > + if (type >= MAX_SWAPFILES) > return NULL; > > - smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ > - return READ_ONCE(swap_info[type]); > + return READ_ONCE(swap_info[type]); /* rcu_dereference() */ > } > > static inline unsigned char swap_count(unsigned char ent) > @@ -2869,14 +2868,11 @@ static struct swap_info_struct *alloc_swap_info(void) > } > if (type >= nr_swapfiles) { > p->type = type; > - WRITE_ONCE(swap_info[type], p); > /* > - * Write swap_info[type] before nr_swapfiles, in case a > - * racing procfs swap_start() or swap_next() is reading them. > - * (We never shrink nr_swapfiles, we never free this entry.) > + * Publish the swap_info_struct. > */ > - smp_wmb(); > - WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); > + smp_store_release(&swap_info[type], p); /* rcu_assign_pointer() */ > + nr_swapfiles++; > } else { > defer = p; > p = swap_info[type]; OK. It seems that this helps people to understand. I will use this in the next version. Best Regards, Huang, Ying
Daniel Jordan <daniel.m.jordan@oracle.com> writes: > On Thu, May 13, 2021 at 02:46:10PM +0200, Peter Zijlstra wrote: >> Ah, I think I see what you meant to say, it would perhaps help if you >> write it like so: >> >> >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index 149e77454e3c..94735248dcd2 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -99,11 +99,10 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0); >> >> static struct swap_info_struct *swap_type_to_swap_info(int type) >> { >> - if (type >= READ_ONCE(nr_swapfiles)) >> + if (type >= MAX_SWAPFILES) >> return NULL; >> >> - smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ >> - return READ_ONCE(swap_info[type]); >> + return READ_ONCE(swap_info[type]); /* rcu_dereference() */ >> } >> >> static inline unsigned char swap_count(unsigned char ent) >> @@ -2869,14 +2868,11 @@ static struct swap_info_struct *alloc_swap_info(void) >> } >> if (type >= nr_swapfiles) { >> p->type = type; >> - WRITE_ONCE(swap_info[type], p); >> /* >> - * Write swap_info[type] before nr_swapfiles, in case a >> - * racing procfs swap_start() or swap_next() is reading them. >> - * (We never shrink nr_swapfiles, we never free this entry.) >> + * Publish the swap_info_struct. >> */ >> - smp_wmb(); >> - WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); >> + smp_store_release(&swap_info[type], p); /* rcu_assign_pointer() */ >> + nr_swapfiles++; > > Yes, this does help, I didn't understand why smp_wmb stayed around in > the original post. > > I think the only access smp_store_release() orders is p->type. Wouldn't > it be kinda inconsistent to only initialize that one field before > publishing when many others would be done at the end of > alloc_swap_info() after the fact? In addition to p->type, *p is zeroed via kvzalloc(). > p->type doesn't seem special. For > instance, get_swap_page_of_type() touches si->lock soon after it calls > swap_type_to_swap_info(), so there could be a small window where there's > a non-NULL si with an uninitialized lock. We usually check the state of swap_info_struct before other operations. For example, we check si->swap_map in swap_start(). > It's not as if this is likely to be a problem in practice, it would just > make it harder to understand why smp_store_release is there. Maybe all > we need is a WRITE_ONCE, or if it's really necessary for certain fields > to be set before publication then move them up and explain? I think we have initialized all fields before publication :-). Best Regards, Huang, Ying
On Thu, May 13, 2021 at 09:59:46PM -0400, Daniel Jordan wrote: > On Thu, May 13, 2021 at 02:46:10PM +0200, Peter Zijlstra wrote: > > Ah, I think I see what you meant to say, it would perhaps help if you > > write it like so: > > > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 149e77454e3c..94735248dcd2 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -99,11 +99,10 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0); > > > > static struct swap_info_struct *swap_type_to_swap_info(int type) > > { > > - if (type >= READ_ONCE(nr_swapfiles)) > > + if (type >= MAX_SWAPFILES) > > return NULL; > > > > - smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ > > - return READ_ONCE(swap_info[type]); > > + return READ_ONCE(swap_info[type]); /* rcu_dereference() */ > > } > > > > static inline unsigned char swap_count(unsigned char ent) > > @@ -2869,14 +2868,11 @@ static struct swap_info_struct *alloc_swap_info(void) > > } > > if (type >= nr_swapfiles) { > > p->type = type; > > - WRITE_ONCE(swap_info[type], p); > > /* > > - * Write swap_info[type] before nr_swapfiles, in case a > > - * racing procfs swap_start() or swap_next() is reading them. > > - * (We never shrink nr_swapfiles, we never free this entry.) > > + * Publish the swap_info_struct. > > */ > > - smp_wmb(); > > - WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); > > + smp_store_release(&swap_info[type], p); /* rcu_assign_pointer() */ > > + nr_swapfiles++; > > Yes, this does help, I didn't understand why smp_wmb stayed around in > the original post. > > I think the only access smp_store_release() orders is p->type. Wouldn't > it be kinda inconsistent to only initialize that one field before > publishing when many others would be done at the end of > alloc_swap_info() after the fact? p->type doesn't seem special. For > instance, get_swap_page_of_type() touches si->lock soon after it calls > swap_type_to_swap_info(), so there could be a small window where there's > a non-NULL si with an uninitialized lock. > > It's not as if this is likely to be a problem in practice, it would just > make it harder to understand why smp_store_release is there. Maybe all > we need is a WRITE_ONCE, or if it's really necessary for certain fields > to be set before publication then move them up and explain? You also care about the zero fill from kvzalloc(). Without the smp_store_release() the zero-fill from the memset() might only be visible 'late'. Unless that also isn't a problem?
On Fri, May 14, 2021 at 12:02:05PM +0800, Huang, Ying wrote: > Daniel Jordan <daniel.m.jordan@oracle.com> writes: > > Yes, this does help, I didn't understand why smp_wmb stayed around in > > the original post. > > > > I think the only access smp_store_release() orders is p->type. Wouldn't > > it be kinda inconsistent to only initialize that one field before > > publishing when many others would be done at the end of > > alloc_swap_info() after the fact? > > In addition to p->type, *p is zeroed via kvzalloc(). So it is, good point. > > p->type doesn't seem special. For > > instance, get_swap_page_of_type() touches si->lock soon after it calls > > swap_type_to_swap_info(), so there could be a small window where there's > > a non-NULL si with an uninitialized lock. > > We usually check the state of swap_info_struct before other operations. > For example, we check si->swap_map in swap_start(). Yes, we usually do. > > It's not as if this is likely to be a problem in practice, it would just > > make it harder to understand why smp_store_release is there. Maybe all > > we need is a WRITE_ONCE, or if it's really necessary for certain fields > > to be set before publication then move them up and explain? > > I think we have initialized all fields before publication :-). Probably all the ones that matter in practice, yes :-) Still feeling slightly uneasy about the theoretical p->lock, but that was possible before this change too so it's out of scope. A comment explaining the pairing and that we care mostly about the zero init would be nice.
On Fri, May 14, 2021 at 02:04:14PM +0200, Peter Zijlstra wrote: > On Thu, May 13, 2021 at 09:59:46PM -0400, Daniel Jordan wrote: > > Yes, this does help, I didn't understand why smp_wmb stayed around in > > the original post. > > > > I think the only access smp_store_release() orders is p->type. Wouldn't > > it be kinda inconsistent to only initialize that one field before > > publishing when many others would be done at the end of > > alloc_swap_info() after the fact? p->type doesn't seem special. For > > instance, get_swap_page_of_type() touches si->lock soon after it calls > > swap_type_to_swap_info(), so there could be a small window where there's > > a non-NULL si with an uninitialized lock. > > > > It's not as if this is likely to be a problem in practice, it would just > > make it harder to understand why smp_store_release is there. Maybe all > > we need is a WRITE_ONCE, or if it's really necessary for certain fields > > to be set before publication then move them up and explain? > > You also care about the zero fill from kvzalloc(). Without the > smp_store_release() the zero-fill from the memset() might only be > visible 'late'. Aha, yes, didn't consider that! > Unless that also isn't a problem? No, you're right, we need that for p->flags at least.
diff --git a/mm/swapfile.c b/mm/swapfile.c index 2aad85751991..4c1fb28bbe0e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -100,10 +100,14 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0); static struct swap_info_struct *swap_type_to_swap_info(int type) { - if (type >= READ_ONCE(nr_swapfiles)) + if (type >= MAX_SWAPFILES) return NULL; - smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ + /* + * The data dependency ordering from the READ_ONCE() pairs + * with smp_wmb() in alloc_swap_info() to guarantee the + * swap_info_struct fields are read after swap_info[type]. + */ return READ_ONCE(swap_info[type]); } @@ -2884,14 +2888,10 @@ static struct swap_info_struct *alloc_swap_info(void) } if (type >= nr_swapfiles) { p->type = type; - WRITE_ONCE(swap_info[type], p); - /* - * Write swap_info[type] before nr_swapfiles, in case a - * racing procfs swap_start() or swap_next() is reading them. - * (We never shrink nr_swapfiles, we never free this entry.) - */ + /* Paired with READ_ONCE() in swap_type_to_swap_info() */ smp_wmb(); - WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); + WRITE_ONCE(swap_info[type], p); + nr_swapfiles++; } else { defer = p; p = swap_info[type];
Before commit c10d38cc8d3e ("mm, swap: bounds check swap_info array accesses to avoid NULL derefs"), the typical code to reference the swap_info[] is as follows, type = swp_type(swp_entry); if (type >= nr_swapfiles) /* handle invalid swp_entry */; p = swap_info[type]; /* access fields of *p. OOPS! p may be NULL! */ Because the ordering isn't guaranteed, it's possible that "p" is read before checking "type". And that may result in NULL pointer dereference. So in commit c10d38cc8d3e, the code becomes, struct swap_info_struct *swap_type_to_swap_info(int type) { if (type >= READ_ONCE(nr_swapfiles)) return NULL; smp_rmb(); return READ_ONCE(swap_info[type]); } /* users */ type = swp_type(swp_entry); p = swap_type_to_swap_info(type); if (!p) /* handle invalid swp_entry */; /* access fields of *p */ Because "p" is checked to be non-zero before dereference, smp_rmb() isn't needed anymore. We still need to guarantee swap_info[type] is read before dereference. That can be satisfied via the data dependency ordering of READ_ONCE(swap_info[type]). The corresponding smp_wmb() is adjusted in alloc_swap_info() too. And, we don't need to read "nr_swapfiles" too. Because if "type >= nr_swapfiles", swap_info[type] will be NULL. We just need to make sure we will not access out of the boundary of the array. With that change, nr_swapfiles will only be accessed with swap_lock held, except in swapcache_free_entries(). Where the absolute correctness of the value isn't needed, as described in the comments. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Andrea Parri <andrea.parri@amarulasolutions.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Paul McKenney <paulmck@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Miaohe Lin <linmiaohe@huawei.com> --- mm/swapfile.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-)