Message ID | 20220126183429.1840447-2-pasha.tatashin@soleen.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Hardening page _refcount | expand |
On Wed, Jan 26, 2022 at 06:34:21PM +0000, Pasha Tatashin wrote: > The problems with page->_refcount are hard to debug, because usually > when they are detected, the damage has occurred a long time ago. Yet, > the problems with invalid page refcount may be catastrophic and lead to > memory corruptions. > > Reduce the scope of when the _refcount problems manifest themselves by > adding checks for underflows and overflows into functions that modify > _refcount. If you're chasing a bug like this, presumably you turn on page tracepoints. So could we reduce the cost of this by putting the VM_BUG_ON_PAGE parts into __page_ref_mod() et al? Yes, we'd need to change the arguments to those functions to pass in old & new, but that should be a cheap change compared to embedding the VM_BUG_ON_PAGE. > static inline void page_ref_add(struct page *page, int nr) > { > - atomic_add(nr, &page->_refcount); > + int old_val = atomic_fetch_add(nr, &page->_refcount); > + int new_val = old_val + nr; > + > + VM_BUG_ON_PAGE((unsigned int)new_val < (unsigned int)old_val, page); > if (page_ref_tracepoint_active(page_ref_mod)) > __page_ref_mod(page, nr); > }
On Wed, Jan 26, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Jan 26, 2022 at 06:34:21PM +0000, Pasha Tatashin wrote: > > The problems with page->_refcount are hard to debug, because usually > > when they are detected, the damage has occurred a long time ago. Yet, > > the problems with invalid page refcount may be catastrophic and lead to > > memory corruptions. > > > > Reduce the scope of when the _refcount problems manifest themselves by > > adding checks for underflows and overflows into functions that modify > > _refcount. > > If you're chasing a bug like this, presumably you turn on page > tracepoints. So could we reduce the cost of this by putting the > VM_BUG_ON_PAGE parts into __page_ref_mod() et al? Yes, we'd need to > change the arguments to those functions to pass in old & new, but that > should be a cheap change compared to embedding the VM_BUG_ON_PAGE. This is not only about chasing a bug. This also about preventing memory corruption and information leaking that are caused by ref_count bugs from happening. Several months ago a memory corruption bug was discovered by accident: an engineer was studying a process core from a production system and noticed that some memory does not look like it belongs to the original process. We tried to manually reproduce that bug but failed. However, later analysis by our team, explained that the problem occured due to ref_count bug in Linux, and the bug itself was root caused and fixed (mentioned in the cover letter). This work would have prevented similar ref_count bugs from yielding to the memory corruption situation. Pasha
On Wed, Jan 26, 2022 at 02:22:26PM -0500, Pasha Tatashin wrote: > On Wed, Jan 26, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Wed, Jan 26, 2022 at 06:34:21PM +0000, Pasha Tatashin wrote: > > > The problems with page->_refcount are hard to debug, because usually > > > when they are detected, the damage has occurred a long time ago. Yet, > > > the problems with invalid page refcount may be catastrophic and lead to > > > memory corruptions. > > > > > > Reduce the scope of when the _refcount problems manifest themselves by > > > adding checks for underflows and overflows into functions that modify > > > _refcount. > > > > If you're chasing a bug like this, presumably you turn on page > > tracepoints. So could we reduce the cost of this by putting the > > VM_BUG_ON_PAGE parts into __page_ref_mod() et al? Yes, we'd need to > > change the arguments to those functions to pass in old & new, but that > > should be a cheap change compared to embedding the VM_BUG_ON_PAGE. > > This is not only about chasing a bug. This also about preventing > memory corruption and information leaking that are caused by ref_count > bugs from happening. > Several months ago a memory corruption bug was discovered by accident: > an engineer was studying a process core from a production system and > noticed that some memory does not look like it belongs to the original > process. We tried to manually reproduce that bug but failed. However, > later analysis by our team, explained that the problem occured due to > ref_count bug in Linux, and the bug itself was root caused and fixed > (mentioned in the cover letter). This work would have prevented > similar ref_count bugs from yielding to the memory corruption > situation. But the VM_BUG_ON_PAGE tells us next to nothing useful. To take your first example [1] as the kind of thing you say this is going to help fix: 1. Page p is allocated by thread a (refcount 1) 2. Thread b gets mistaken pointer to p 3. Thread b calls put_page(), __put_page(), page goes to memory allocator. 4. Thread c calls alloc_page(), also gets page p (refcount 1 again). 5. Thread a calls put_page(), __put_page() 6. Thread c calls put_page() and gets a VM_BUG_ON_PAGE. How do we find thread b's involvement? I don't think we can even see thread a's involvement in all of this! All we know is a backtrace pointing to thread c, who is a completely innocent bystander. I think you have to enable page tracepoints to have any shot at finding thread b's involvement. [1] https://lore.kernel.org/stable/20211122171825.1582436-1-gthelen@google.com/
On Wed, Jan 26, 2022 at 2:45 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Jan 26, 2022 at 02:22:26PM -0500, Pasha Tatashin wrote: > > On Wed, Jan 26, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Wed, Jan 26, 2022 at 06:34:21PM +0000, Pasha Tatashin wrote: > > > > The problems with page->_refcount are hard to debug, because usually > > > > when they are detected, the damage has occurred a long time ago. Yet, > > > > the problems with invalid page refcount may be catastrophic and lead to > > > > memory corruptions. > > > > > > > > Reduce the scope of when the _refcount problems manifest themselves by > > > > adding checks for underflows and overflows into functions that modify > > > > _refcount. > > > > > > If you're chasing a bug like this, presumably you turn on page > > > tracepoints. So could we reduce the cost of this by putting the > > > VM_BUG_ON_PAGE parts into __page_ref_mod() et al? Yes, we'd need to > > > change the arguments to those functions to pass in old & new, but that > > > should be a cheap change compared to embedding the VM_BUG_ON_PAGE. > > > > This is not only about chasing a bug. This also about preventing > > memory corruption and information leaking that are caused by ref_count > > bugs from happening. > > Several months ago a memory corruption bug was discovered by accident: > > an engineer was studying a process core from a production system and > > noticed that some memory does not look like it belongs to the original > > process. We tried to manually reproduce that bug but failed. However, > > later analysis by our team, explained that the problem occured due to > > ref_count bug in Linux, and the bug itself was root caused and fixed > > (mentioned in the cover letter). This work would have prevented > > similar ref_count bugs from yielding to the memory corruption > > situation. > > But the VM_BUG_ON_PAGE tells us next to nothing useful. To take > your first example [1] as the kind of thing you say this is going to > help fix: > > 1. Page p is allocated by thread a (refcount 1) > 2. Thread b gets mistaken pointer to p Thread b gets a mistaken pointer to p because of a bug in the kernel. The different types of bugs can lead to such scenarios, and it is probably not feasible to prevent all of them. However, one of such scenarios is that we lost control of ref_count, and the page was then incorrectly remapped or even copied (perhaps migrated) into another address space. While studying the logs of the machine on which the double mapping occured, we noticed that ref_count was underflowed. This was the smoking gun for the problem, and that is why we concentrated our search for the root cause of memory leak around places where ref_count can be incorrectly modified. This patch series ensures that once we get to a situation where ref_count is for some reason becomes negative we panic immediately as there is a possibility that a leak can occur. The second benefit of this series is that it makes the ref_count changes contiguous, with this series we never reset the value to 0, instead we only operate using offsets and add/sub operations. This helps with tracing the history of ref_count via tracepoints. > 3. Thread b calls put_page(), __put_page(), page goes to memory > allocator. > 4. Thread c calls alloc_page(), also gets page p (refcount 1 again). > 5. Thread a calls put_page(), __put_page() > 6. Thread c calls put_page() and gets a VM_BUG_ON_PAGE. > > How do we find thread b's involvement? I don't think we can even see > thread a's involvement in all of this! All we know is a backtrace > pointing to thread c, who is a completely innocent bystander. I think > you have to enable page tracepoints to have any shot at finding thread > b's involvement. You are right, we cannot get to see thread's involvement, we only get a panic closer to the damage and hopefully prior to leak occurs. Again, this is just one of the mitigation techniques. Another one is this page table check [2]. [2] https://lore.kernel.org/all/20211221154650.1047963-1-pasha.tatashin@soleen.com > > [1] https://lore.kernel.org/stable/20211122171825.1582436-1-gthelen@google.com/
On 1/26/22 20:22, Pasha Tatashin wrote: > On Wed, Jan 26, 2022 at 1:59 PM Matthew Wilcox <willy@infradead.org> wrote: >> >> On Wed, Jan 26, 2022 at 06:34:21PM +0000, Pasha Tatashin wrote: >> > The problems with page->_refcount are hard to debug, because usually >> > when they are detected, the damage has occurred a long time ago. Yet, >> > the problems with invalid page refcount may be catastrophic and lead to >> > memory corruptions. >> > >> > Reduce the scope of when the _refcount problems manifest themselves by >> > adding checks for underflows and overflows into functions that modify >> > _refcount. >> >> If you're chasing a bug like this, presumably you turn on page >> tracepoints. So could we reduce the cost of this by putting the >> VM_BUG_ON_PAGE parts into __page_ref_mod() et al? Yes, we'd need to >> change the arguments to those functions to pass in old & new, but that >> should be a cheap change compared to embedding the VM_BUG_ON_PAGE. > > This is not only about chasing a bug. This also about preventing > memory corruption and information leaking that are caused by ref_count > bugs from happening. So you mean it like a security hardening feature, not just debugging? To me it's dubious to put security hardening under CONFIG_DEBUG_VM. I think it's just Fedora that uses DEBUG_VM in general production kernels? > Several months ago a memory corruption bug was discovered by accident: > an engineer was studying a process core from a production system and > noticed that some memory does not look like it belongs to the original > process. We tried to manually reproduce that bug but failed. However, > later analysis by our team, explained that the problem occured due to > ref_count bug in Linux, and the bug itself was root caused and fixed > (mentioned in the cover letter). This work would have prevented > similar ref_count bugs from yielding to the memory corruption > situation. > > Pasha >
On 1/26/22 19:34, Pasha Tatashin wrote: > The problems with page->_refcount are hard to debug, because usually > when they are detected, the damage has occurred a long time ago. Yet, > the problems with invalid page refcount may be catastrophic and lead to > memory corruptions. > > Reduce the scope of when the _refcount problems manifest themselves by > adding checks for underflows and overflows into functions that modify > _refcount. > > Use atomic_fetch_* functions to get the old values of the _refcount, > and use it to check for overflow/underflow. > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> > --- > include/linux/page_ref.h | 59 +++++++++++++++++++++++++++++----------- > 1 file changed, 43 insertions(+), 16 deletions(-) > > diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h > index 2e677e6ad09f..fe4864f7f69c 100644 > --- a/include/linux/page_ref.h > +++ b/include/linux/page_ref.h > @@ -117,7 +117,10 @@ static inline void init_page_count(struct page *page) > > static inline void page_ref_add(struct page *page, int nr) > { > - atomic_add(nr, &page->_refcount); > + int old_val = atomic_fetch_add(nr, &page->_refcount); > + int new_val = old_val + nr; > + > + VM_BUG_ON_PAGE((unsigned int)new_val < (unsigned int)old_val, page); This seems somewhat weird, as it will trigger not just on overflow, but also if nr is negative. Which I think is valid usage, even though the function has 'add' in name, because 'nr' is signed?
> > This is not only about chasing a bug. This also about preventing > > memory corruption and information leaking that are caused by ref_count > > bugs from happening. > > So you mean it like a security hardening feature, not just debugging? To me > it's dubious to put security hardening under CONFIG_DEBUG_VM. I think it's > just Fedora that uses DEBUG_VM in general production kernels? In our (Google) internal kernel, I added another macro: PAGE_REF_BUG(cond, page) to replace VM_BUG_ON_PAGE() in page_ref.h. The new macro keeps the asserts always enabled. I was thinking of adding something like this to the upstream kernel as well, however, I am worried about performance implications of having extra conditions in these routines, so I think we would need yet another config which decouples DEBUG_VM and some security crucial VM asserts. However, to reduce controversial discussions, I decided not to do this as part of this series, and perhaps do it as a follow-up work. Pasha
> > diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h > > index 2e677e6ad09f..fe4864f7f69c 100644 > > --- a/include/linux/page_ref.h > > +++ b/include/linux/page_ref.h > > @@ -117,7 +117,10 @@ static inline void init_page_count(struct page *page) > > > > static inline void page_ref_add(struct page *page, int nr) > > { > > - atomic_add(nr, &page->_refcount); > > + int old_val = atomic_fetch_add(nr, &page->_refcount); > > + int new_val = old_val + nr; > > + > > + VM_BUG_ON_PAGE((unsigned int)new_val < (unsigned int)old_val, page); > > This seems somewhat weird, as it will trigger not just on overflow, but also > if nr is negative. Which I think is valid usage, even though the function > has 'add' in name, because 'nr' is signed? I have not found any places in the mainline kernel where nr is negative in page_ref_add(). I think, by adding this assert we ensure that when 'add' shows up in backtraces it can be assured that the ref count has increased, and if page_ref_sub() showed up it means it decreased. It is strange to have both functions, and yet allow them to do the opposite. We can also change the type to unsigned. Pasha
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 2e677e6ad09f..fe4864f7f69c 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -117,7 +117,10 @@ static inline void init_page_count(struct page *page) static inline void page_ref_add(struct page *page, int nr) { - atomic_add(nr, &page->_refcount); + int old_val = atomic_fetch_add(nr, &page->_refcount); + int new_val = old_val + nr; + + VM_BUG_ON_PAGE((unsigned int)new_val < (unsigned int)old_val, page); if (page_ref_tracepoint_active(page_ref_mod)) __page_ref_mod(page, nr); } @@ -129,7 +132,10 @@ static inline void folio_ref_add(struct folio *folio, int nr) static inline void page_ref_sub(struct page *page, int nr) { - atomic_sub(nr, &page->_refcount); + int old_val = atomic_fetch_sub(nr, &page->_refcount); + int new_val = old_val - nr; + + VM_BUG_ON_PAGE((unsigned int)new_val > (unsigned int)old_val, page); if (page_ref_tracepoint_active(page_ref_mod)) __page_ref_mod(page, -nr); } @@ -141,11 +147,13 @@ static inline void folio_ref_sub(struct folio *folio, int nr) static inline int page_ref_sub_return(struct page *page, int nr) { - int ret = atomic_sub_return(nr, &page->_refcount); + int old_val = atomic_fetch_sub(nr, &page->_refcount); + int new_val = old_val - nr; + VM_BUG_ON_PAGE((unsigned int)new_val > (unsigned int)old_val, page); if (page_ref_tracepoint_active(page_ref_mod_and_return)) - __page_ref_mod_and_return(page, -nr, ret); - return ret; + __page_ref_mod_and_return(page, -nr, new_val); + return new_val; } static inline int folio_ref_sub_return(struct folio *folio, int nr) @@ -155,7 +163,10 @@ static inline int folio_ref_sub_return(struct folio *folio, int nr) static inline void page_ref_inc(struct page *page) { - atomic_inc(&page->_refcount); + int old_val = atomic_fetch_inc(&page->_refcount); + int new_val = old_val + 1; + + VM_BUG_ON_PAGE((unsigned int)new_val < (unsigned int)old_val, page); if (page_ref_tracepoint_active(page_ref_mod)) __page_ref_mod(page, 1); } @@ -167,7 +178,10 @@ static inline void folio_ref_inc(struct folio *folio) static inline void page_ref_dec(struct page *page) { - atomic_dec(&page->_refcount); + int old_val = atomic_fetch_dec(&page->_refcount); + int new_val = old_val - 1; + + VM_BUG_ON_PAGE((unsigned int)new_val > (unsigned int)old_val, page); if (page_ref_tracepoint_active(page_ref_mod)) __page_ref_mod(page, -1); } @@ -179,8 +193,11 @@ static inline void folio_ref_dec(struct folio *folio) static inline int page_ref_sub_and_test(struct page *page, int nr) { - int ret = atomic_sub_and_test(nr, &page->_refcount); + int old_val = atomic_fetch_sub(nr, &page->_refcount); + int new_val = old_val - nr; + int ret = new_val == 0; + VM_BUG_ON_PAGE((unsigned int)new_val > (unsigned int)old_val, page); if (page_ref_tracepoint_active(page_ref_mod_and_test)) __page_ref_mod_and_test(page, -nr, ret); return ret; @@ -193,11 +210,13 @@ static inline int folio_ref_sub_and_test(struct folio *folio, int nr) static inline int page_ref_inc_return(struct page *page) { - int ret = atomic_inc_return(&page->_refcount); + int old_val = atomic_fetch_inc(&page->_refcount); + int new_val = old_val + 1; + VM_BUG_ON_PAGE((unsigned int)new_val < (unsigned int)old_val, page); if (page_ref_tracepoint_active(page_ref_mod_and_return)) - __page_ref_mod_and_return(page, 1, ret); - return ret; + __page_ref_mod_and_return(page, 1, new_val); + return new_val; } static inline int folio_ref_inc_return(struct folio *folio) @@ -207,8 +226,11 @@ static inline int folio_ref_inc_return(struct folio *folio) static inline int page_ref_dec_and_test(struct page *page) { - int ret = atomic_dec_and_test(&page->_refcount); + int old_val = atomic_fetch_dec(&page->_refcount); + int new_val = old_val - 1; + int ret = new_val == 0; + VM_BUG_ON_PAGE((unsigned int)new_val > (unsigned int)old_val, page); if (page_ref_tracepoint_active(page_ref_mod_and_test)) __page_ref_mod_and_test(page, -1, ret); return ret; @@ -221,11 +243,13 @@ static inline int folio_ref_dec_and_test(struct folio *folio) static inline int page_ref_dec_return(struct page *page) { - int ret = atomic_dec_return(&page->_refcount); + int old_val = atomic_fetch_dec(&page->_refcount); + int new_val = old_val - 1; + VM_BUG_ON_PAGE((unsigned int)new_val > (unsigned int)old_val, page); if (page_ref_tracepoint_active(page_ref_mod_and_return)) - __page_ref_mod_and_return(page, -1, ret); - return ret; + __page_ref_mod_and_return(page, -1, new_val); + return new_val; } static inline int folio_ref_dec_return(struct folio *folio) @@ -235,8 +259,11 @@ static inline int folio_ref_dec_return(struct folio *folio) static inline bool page_ref_add_unless(struct page *page, int nr, int u) { - bool ret = atomic_add_unless(&page->_refcount, nr, u); + int old_val = atomic_fetch_add_unless(&page->_refcount, nr, u); + int new_val = old_val + nr; + int ret = old_val != u; + VM_BUG_ON_PAGE(ret && (unsigned int)new_val < (unsigned int)old_val, page); if (page_ref_tracepoint_active(page_ref_mod_unless)) __page_ref_mod_unless(page, nr, ret); return ret;
The problems with page->_refcount are hard to debug, because usually when they are detected, the damage has occurred a long time ago. Yet, the problems with invalid page refcount may be catastrophic and lead to memory corruptions. Reduce the scope of when the _refcount problems manifest themselves by adding checks for underflows and overflows into functions that modify _refcount. Use atomic_fetch_* functions to get the old values of the _refcount, and use it to check for overflow/underflow. Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> --- include/linux/page_ref.h | 59 +++++++++++++++++++++++++++++----------- 1 file changed, 43 insertions(+), 16 deletions(-)