Message ID | 20230329151121.949896-1-jiaqiyan@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Memory poison recovery in khugepaged collapsing | expand |
Friendly ping for review :) On Wed, Mar 29, 2023 at 8:11 AM Jiaqi Yan <jiaqiyan@google.com> wrote: > Problem > ======= > Memory DIMMs are subject to multi-bit flips, i.e. memory errors. > As memory size and density increase, the chances of and number of > memory errors increase. The increasing size and density of server > RAM in the data center and cloud have shown increased uncorrectable > memory errors. There are already mechanisms in the kernel to recover > from uncorrectable memory errors. This series of patches provides > the recovery mechanism for the particular kernel agent khugepaged > when it collapses memory pages. > > Impact > ====== > The main reason we chose to make khugepaged collapsing tolerant of > memory failures was its high possibility of accessing poisoned memory > while performing functionally optional compaction actions. > Standard applications typically don't have strict requirements on > the size of its pages. So they are given 4K pages by the kernel. > The kernel is able to improve application performance by either > > 1) giving applications 2M pages to begin with, or > 2) collapsing 4K pages into 2M pages when possible. > > This collapsing operation is done by khugepaged, a kernel agent that > is constantly scanning memory. When collapsing 4K pages into a 2M page, > it must copy the data from the 4K pages into a physically contiguous > 2M page. Therefore, as long as there exists one poisoned cache line in > collapsible 4K pages, khugepaged will eventually access it. The current > impact to users is a machine check exception triggered kernel panic. > However, khugepaged’s compaction operations are not functionally required > kernel actions. Therefore making khugepaged tolerant to poisoned memory > will greatly improve user experience. > > This patch series is for cases where khugepaged is the first guy > that detects the memory errors on the poisoned pages. IOW, the pages > are not known to have memory errors when khugepaged collapsing gets to > them. In our observation, this happens frequently when the huge page > ratio of the system is relatively low, which is fairly common in > virtual machines running on cloud. > > Solution > ======== > As stated before, it is less desirable to crash the system only because > khugepaged accesses poisoned pages while it is collapsing 4K pages. > The high level idea of this patch series is to skip the group of pages > (usually 512 4K-size pages) once khugepaged finds one of them is poisoned, > as these pages have become ineligible to be collapsed. > > We are also careful to unwind operations khuagepaged has performed before > it detects memory failures. For example, before copying and collapsing > a group of anonymous pages into a huge page, the source pages will be > isolated and their page table is unlinked from their PMD. These operations > need to be undone in order to ensure these pages are not changed/lost from > the perspective of other threads (both user and kernel space). As for > file backed memory pages, there already exists a rollback case. This > patch just extends it so that khugepaged also correctly rolls back when > it fails to copy poisoned 4K pages. > > Changelog > ========= > v12 changes > - Incorporate feedbacks from Shi Yang <shy828301@gmail.com>. > - Drop unused pmd from __collapse_huge_page_copy_succeeded. > - Drop unused address from __collapse_huge_page_copy_failed. > - smp_mb() should be after filemap_nr_thps_dec. > - This revision is rebased to mm-unstable at commit 9b175ce664d33 > ("mm: move free_area_empty() to mm/internal.h") > > v11 changes > - Incorporate feedbacks from Shi Yang <shy828301@gmail.com> and Hugh > Dickins <hughd@google.com> > - Replace releasing pages for-loop with release_pte_pages in > __collapse_huge_page_copy_failed. > - Rename pte_ptl to ptl in __collapse_huge_page_copy_succeeded. > - Fix a bug in __collapse_huge_page_copy_succeeded: ptep_clear should be > used instead of pte_clear. > - Drop _address in __collapse_huge_page_copy_succeeded. > - Add smp_mb() before updating filemap_nr_thps_dec. > - Move `nr = thp_nr_pages()` closer to its references. > - Remove an unnecessary goto statement. > - This revision is rebased to mm-unstable at commit b4e1277ee31db > ("xtensa: reword ARCH_FORCE_MAX_ORDER prompt and help text") > > v10 changes > - Incorporate feedbacks from Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> > - Refactor the 2nd loop (after the loop for copying memory) into 2 helper > functions, one for actions to take when copying succeeded, one for when > copying failed due to #MC. > - Use copy_mc_user_highpage for anonymous memory. > - Introduce copy_mc_highpage and use it for file-backed memory. > - Rename the original PMD from `rollback` to `orig_pmd`. > - Some minor changes in comments, e.g. `normal page` to `raw page`. > - This revision is rebased to mm-unstable at commit df3ae4347aff9 > ("dma-buf: system_heap: avoid reclaim for order 4") > > v9 changes > - Incorporate feedback from Andrew Morton <akpm@linux-foundation.org> > - Move copy_mc_highpage into khugepage.c as a static out-of-line > function copy_mc_page. > > v8 changes > - Incorporate feedbacks from Tony Luck <tony.luck@intel.com> > - Rename copy_highpage_mc to copy_mc_highpage. > - Update copy_mc_highpage with kmsan changes. > - Code style changes: > 1) copy_mc_highpage returns int as "copy" is an action and is consistent > with copy_mc_user_highpage. > 2) __collapse_huge_page_copy returns scan_result(int) and is consistent > with __collapse_huge_page_isolate/swapin. > 3) variables are declared in separate lines in collapse_file. > > v7 changes > - Fix a bug "KASAN: stack-out-of-bounds Read in collapse_file". After > copying all pages into the huge page, clear_highpage should use index > instead of page->index. > > v6 changes > - Address comments from Kirill Shutemov <kirill@shutemov.name> > - Rewrite __collapse_huge_page_copy to make rollback operations more > clear to its reader. > - Add detailed test steps in each commit message. > > v5 changes > - Rebase patches to mm-unstable at > commit ffb39098bf87 ("Merge tag 'linux-kselftest-kunit-6.1-rc1' of > git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest"). > - Resolves conflicts with: > commit 2f55f070e5b8 ("mm/khugepaged: minor cleanup for collapse_file") > commit 1baec203b77c ("mm/khugepaged: try to free transhuge swapcache > when possible") > > v4 changes > - Incorporate feedbacks from Yang Shi <shy828301@gmail.com> > - Remove tracepoint for __collapse_huge_page_copy, just keep SCAN_COPY_MC > and let trace_mm_collapse_huge_page it > - Remove unnecessary comments > > v3 changes > - Incorporate feedbacks from Yang Shi <shy828301@gmail.com> > - Add tracepoint for __collapse_huge_page_copy > - Restore PMD in collapse_huge_page > - Correct comment about mmap_read_lock > > v2 changes > - Incorporate feedbacks from Yang Shi <shy828301@gmail.com> > - Only keep copy_highpage_mc > - Adding new scan_result SCAN_COPY_MC > - Defer NR_FILE_THPS update until copying succeeded > > Jiaqi Yan (3): > mm/khugepaged: recover from poisoned anonymous memory > mm/hwpoison: introduce copy_mc_highpage > mm/khugepaged: recover from poisoned file-backed memory > > include/linux/highmem.h | 54 ++++++-- > include/trace/events/huge_memory.h | 3 +- > mm/khugepaged.c | 200 ++++++++++++++++++++++------- > 3 files changed, 198 insertions(+), 59 deletions(-) > > -- > 2.40.0.348.gf938b09366-goog > >
On Tue, 4 Apr 2023 11:44:16 -0700 Jiaqi Yan <jiaqiyan@google.com> wrote:
> Friendly ping for review :)
That would be nice. This series has been under test for over a month
without incident. I'll be moving it into mm-stable soon, unless
someone has a reason for not doing that.
On Tue, Apr 4, 2023 at 11:44 AM Jiaqi Yan <jiaqiyan@google.com> wrote: > > Friendly ping for review :) Both I and Hugh already gave reviewed/acked for the previous version. Since there were just some minor changes so you could keep the reviewed/acked from the previous version. > > On Wed, Mar 29, 2023 at 8:11 AM Jiaqi Yan <jiaqiyan@google.com> wrote: >> >> Problem >> ======= >> Memory DIMMs are subject to multi-bit flips, i.e. memory errors. >> As memory size and density increase, the chances of and number of >> memory errors increase. The increasing size and density of server >> RAM in the data center and cloud have shown increased uncorrectable >> memory errors. There are already mechanisms in the kernel to recover >> from uncorrectable memory errors. This series of patches provides >> the recovery mechanism for the particular kernel agent khugepaged >> when it collapses memory pages. >> >> Impact >> ====== >> The main reason we chose to make khugepaged collapsing tolerant of >> memory failures was its high possibility of accessing poisoned memory >> while performing functionally optional compaction actions. >> Standard applications typically don't have strict requirements on >> the size of its pages. So they are given 4K pages by the kernel. >> The kernel is able to improve application performance by either >> >> 1) giving applications 2M pages to begin with, or >> 2) collapsing 4K pages into 2M pages when possible. >> >> This collapsing operation is done by khugepaged, a kernel agent that >> is constantly scanning memory. When collapsing 4K pages into a 2M page, >> it must copy the data from the 4K pages into a physically contiguous >> 2M page. Therefore, as long as there exists one poisoned cache line in >> collapsible 4K pages, khugepaged will eventually access it. The current >> impact to users is a machine check exception triggered kernel panic. >> However, khugepaged’s compaction operations are not functionally required >> kernel actions. Therefore making khugepaged tolerant to poisoned memory >> will greatly improve user experience. >> >> This patch series is for cases where khugepaged is the first guy >> that detects the memory errors on the poisoned pages. IOW, the pages >> are not known to have memory errors when khugepaged collapsing gets to >> them. In our observation, this happens frequently when the huge page >> ratio of the system is relatively low, which is fairly common in >> virtual machines running on cloud. >> >> Solution >> ======== >> As stated before, it is less desirable to crash the system only because >> khugepaged accesses poisoned pages while it is collapsing 4K pages. >> The high level idea of this patch series is to skip the group of pages >> (usually 512 4K-size pages) once khugepaged finds one of them is poisoned, >> as these pages have become ineligible to be collapsed. >> >> We are also careful to unwind operations khuagepaged has performed before >> it detects memory failures. For example, before copying and collapsing >> a group of anonymous pages into a huge page, the source pages will be >> isolated and their page table is unlinked from their PMD. These operations >> need to be undone in order to ensure these pages are not changed/lost from >> the perspective of other threads (both user and kernel space). As for >> file backed memory pages, there already exists a rollback case. This >> patch just extends it so that khugepaged also correctly rolls back when >> it fails to copy poisoned 4K pages. >> >> Changelog >> ========= >> v12 changes >> - Incorporate feedbacks from Shi Yang <shy828301@gmail.com>. >> - Drop unused pmd from __collapse_huge_page_copy_succeeded. >> - Drop unused address from __collapse_huge_page_copy_failed. >> - smp_mb() should be after filemap_nr_thps_dec. >> - This revision is rebased to mm-unstable at commit 9b175ce664d33 >> ("mm: move free_area_empty() to mm/internal.h") >> >> v11 changes >> - Incorporate feedbacks from Shi Yang <shy828301@gmail.com> and Hugh >> Dickins <hughd@google.com> >> - Replace releasing pages for-loop with release_pte_pages in >> __collapse_huge_page_copy_failed. >> - Rename pte_ptl to ptl in __collapse_huge_page_copy_succeeded. >> - Fix a bug in __collapse_huge_page_copy_succeeded: ptep_clear should be >> used instead of pte_clear. >> - Drop _address in __collapse_huge_page_copy_succeeded. >> - Add smp_mb() before updating filemap_nr_thps_dec. >> - Move `nr = thp_nr_pages()` closer to its references. >> - Remove an unnecessary goto statement. >> - This revision is rebased to mm-unstable at commit b4e1277ee31db >> ("xtensa: reword ARCH_FORCE_MAX_ORDER prompt and help text") >> >> v10 changes >> - Incorporate feedbacks from Kirill A. Shutemov >> <kirill.shutemov@linux.intel.com> >> - Refactor the 2nd loop (after the loop for copying memory) into 2 helper >> functions, one for actions to take when copying succeeded, one for when >> copying failed due to #MC. >> - Use copy_mc_user_highpage for anonymous memory. >> - Introduce copy_mc_highpage and use it for file-backed memory. >> - Rename the original PMD from `rollback` to `orig_pmd`. >> - Some minor changes in comments, e.g. `normal page` to `raw page`. >> - This revision is rebased to mm-unstable at commit df3ae4347aff9 >> ("dma-buf: system_heap: avoid reclaim for order 4") >> >> v9 changes >> - Incorporate feedback from Andrew Morton <akpm@linux-foundation.org> >> - Move copy_mc_highpage into khugepage.c as a static out-of-line >> function copy_mc_page. >> >> v8 changes >> - Incorporate feedbacks from Tony Luck <tony.luck@intel.com> >> - Rename copy_highpage_mc to copy_mc_highpage. >> - Update copy_mc_highpage with kmsan changes. >> - Code style changes: >> 1) copy_mc_highpage returns int as "copy" is an action and is consistent >> with copy_mc_user_highpage. >> 2) __collapse_huge_page_copy returns scan_result(int) and is consistent >> with __collapse_huge_page_isolate/swapin. >> 3) variables are declared in separate lines in collapse_file. >> >> v7 changes >> - Fix a bug "KASAN: stack-out-of-bounds Read in collapse_file". After >> copying all pages into the huge page, clear_highpage should use index >> instead of page->index. >> >> v6 changes >> - Address comments from Kirill Shutemov <kirill@shutemov.name> >> - Rewrite __collapse_huge_page_copy to make rollback operations more >> clear to its reader. >> - Add detailed test steps in each commit message. >> >> v5 changes >> - Rebase patches to mm-unstable at >> commit ffb39098bf87 ("Merge tag 'linux-kselftest-kunit-6.1-rc1' of >> git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest"). >> - Resolves conflicts with: >> commit 2f55f070e5b8 ("mm/khugepaged: minor cleanup for collapse_file") >> commit 1baec203b77c ("mm/khugepaged: try to free transhuge swapcache >> when possible") >> >> v4 changes >> - Incorporate feedbacks from Yang Shi <shy828301@gmail.com> >> - Remove tracepoint for __collapse_huge_page_copy, just keep SCAN_COPY_MC >> and let trace_mm_collapse_huge_page it >> - Remove unnecessary comments >> >> v3 changes >> - Incorporate feedbacks from Yang Shi <shy828301@gmail.com> >> - Add tracepoint for __collapse_huge_page_copy >> - Restore PMD in collapse_huge_page >> - Correct comment about mmap_read_lock >> >> v2 changes >> - Incorporate feedbacks from Yang Shi <shy828301@gmail.com> >> - Only keep copy_highpage_mc >> - Adding new scan_result SCAN_COPY_MC >> - Defer NR_FILE_THPS update until copying succeeded >> >> Jiaqi Yan (3): >> mm/khugepaged: recover from poisoned anonymous memory >> mm/hwpoison: introduce copy_mc_highpage >> mm/khugepaged: recover from poisoned file-backed memory >> >> include/linux/highmem.h | 54 ++++++-- >> include/trace/events/huge_memory.h | 3 +- >> mm/khugepaged.c | 200 ++++++++++++++++++++++------- >> 3 files changed, 198 insertions(+), 59 deletions(-) >> >> -- >> 2.40.0.348.gf938b09366-goog >>
On Tue, Apr 4, 2023 at 8:57 PM Yang Shi <shy828301@gmail.com> wrote: > > On Tue, Apr 4, 2023 at 11:44 AM Jiaqi Yan <jiaqiyan@google.com> wrote: > > > > Friendly ping for review :) > > Both I and Hugh already gave reviewed/acked for the previous version. > Since there were just some minor changes so you could keep the > reviewed/acked from the previous version. Thanks Yang! Andrew, is there still anything I need to do at this point (e.g. resent V12 with reviewed/acked tags in commits)? Or are you fine with this V12 to be merged? > > > > > On Wed, Mar 29, 2023 at 8:11 AM Jiaqi Yan <jiaqiyan@google.com> wrote: > >> > >> Problem > >> ======= > >> Memory DIMMs are subject to multi-bit flips, i.e. memory errors. > >> As memory size and density increase, the chances of and number of > >> memory errors increase. The increasing size and density of server > >> RAM in the data center and cloud have shown increased uncorrectable > >> memory errors. There are already mechanisms in the kernel to recover > >> from uncorrectable memory errors. This series of patches provides > >> the recovery mechanism for the particular kernel agent khugepaged > >> when it collapses memory pages. > >> > >> Impact > >> ====== > >> The main reason we chose to make khugepaged collapsing tolerant of > >> memory failures was its high possibility of accessing poisoned memory > >> while performing functionally optional compaction actions. > >> Standard applications typically don't have strict requirements on > >> the size of its pages. So they are given 4K pages by the kernel. > >> The kernel is able to improve application performance by either > >> > >> 1) giving applications 2M pages to begin with, or > >> 2) collapsing 4K pages into 2M pages when possible. > >> > >> This collapsing operation is done by khugepaged, a kernel agent that > >> is constantly scanning memory. When collapsing 4K pages into a 2M page, > >> it must copy the data from the 4K pages into a physically contiguous > >> 2M page. Therefore, as long as there exists one poisoned cache line in > >> collapsible 4K pages, khugepaged will eventually access it. The current > >> impact to users is a machine check exception triggered kernel panic. > >> However, khugepaged’s compaction operations are not functionally required > >> kernel actions. Therefore making khugepaged tolerant to poisoned memory > >> will greatly improve user experience. > >> > >> This patch series is for cases where khugepaged is the first guy > >> that detects the memory errors on the poisoned pages. IOW, the pages > >> are not known to have memory errors when khugepaged collapsing gets to > >> them. In our observation, this happens frequently when the huge page > >> ratio of the system is relatively low, which is fairly common in > >> virtual machines running on cloud. > >> > >> Solution > >> ======== > >> As stated before, it is less desirable to crash the system only because > >> khugepaged accesses poisoned pages while it is collapsing 4K pages. > >> The high level idea of this patch series is to skip the group of pages > >> (usually 512 4K-size pages) once khugepaged finds one of them is poisoned, > >> as these pages have become ineligible to be collapsed. > >> > >> We are also careful to unwind operations khuagepaged has performed before > >> it detects memory failures. For example, before copying and collapsing > >> a group of anonymous pages into a huge page, the source pages will be > >> isolated and their page table is unlinked from their PMD. These operations > >> need to be undone in order to ensure these pages are not changed/lost from > >> the perspective of other threads (both user and kernel space). As for > >> file backed memory pages, there already exists a rollback case. This > >> patch just extends it so that khugepaged also correctly rolls back when > >> it fails to copy poisoned 4K pages. > >> > >> Changelog > >> ========= > >> v12 changes > >> - Incorporate feedbacks from Shi Yang <shy828301@gmail.com>. > >> - Drop unused pmd from __collapse_huge_page_copy_succeeded. > >> - Drop unused address from __collapse_huge_page_copy_failed. > >> - smp_mb() should be after filemap_nr_thps_dec. > >> - This revision is rebased to mm-unstable at commit 9b175ce664d33 > >> ("mm: move free_area_empty() to mm/internal.h") > >> > >> v11 changes > >> - Incorporate feedbacks from Shi Yang <shy828301@gmail.com> and Hugh > >> Dickins <hughd@google.com> > >> - Replace releasing pages for-loop with release_pte_pages in > >> __collapse_huge_page_copy_failed. > >> - Rename pte_ptl to ptl in __collapse_huge_page_copy_succeeded. > >> - Fix a bug in __collapse_huge_page_copy_succeeded: ptep_clear should be > >> used instead of pte_clear. > >> - Drop _address in __collapse_huge_page_copy_succeeded. > >> - Add smp_mb() before updating filemap_nr_thps_dec. > >> - Move `nr = thp_nr_pages()` closer to its references. > >> - Remove an unnecessary goto statement. > >> - This revision is rebased to mm-unstable at commit b4e1277ee31db > >> ("xtensa: reword ARCH_FORCE_MAX_ORDER prompt and help text") > >> > >> v10 changes > >> - Incorporate feedbacks from Kirill A. Shutemov > >> <kirill.shutemov@linux.intel.com> > >> - Refactor the 2nd loop (after the loop for copying memory) into 2 helper > >> functions, one for actions to take when copying succeeded, one for when > >> copying failed due to #MC. > >> - Use copy_mc_user_highpage for anonymous memory. > >> - Introduce copy_mc_highpage and use it for file-backed memory. > >> - Rename the original PMD from `rollback` to `orig_pmd`. > >> - Some minor changes in comments, e.g. `normal page` to `raw page`. > >> - This revision is rebased to mm-unstable at commit df3ae4347aff9 > >> ("dma-buf: system_heap: avoid reclaim for order 4") > >> > >> v9 changes > >> - Incorporate feedback from Andrew Morton <akpm@linux-foundation.org> > >> - Move copy_mc_highpage into khugepage.c as a static out-of-line > >> function copy_mc_page. > >> > >> v8 changes > >> - Incorporate feedbacks from Tony Luck <tony.luck@intel.com> > >> - Rename copy_highpage_mc to copy_mc_highpage. > >> - Update copy_mc_highpage with kmsan changes. > >> - Code style changes: > >> 1) copy_mc_highpage returns int as "copy" is an action and is consistent > >> with copy_mc_user_highpage. > >> 2) __collapse_huge_page_copy returns scan_result(int) and is consistent > >> with __collapse_huge_page_isolate/swapin. > >> 3) variables are declared in separate lines in collapse_file. > >> > >> v7 changes > >> - Fix a bug "KASAN: stack-out-of-bounds Read in collapse_file". After > >> copying all pages into the huge page, clear_highpage should use index > >> instead of page->index. > >> > >> v6 changes > >> - Address comments from Kirill Shutemov <kirill@shutemov.name> > >> - Rewrite __collapse_huge_page_copy to make rollback operations more > >> clear to its reader. > >> - Add detailed test steps in each commit message. > >> > >> v5 changes > >> - Rebase patches to mm-unstable at > >> commit ffb39098bf87 ("Merge tag 'linux-kselftest-kunit-6.1-rc1' of > >> git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest"). > >> - Resolves conflicts with: > >> commit 2f55f070e5b8 ("mm/khugepaged: minor cleanup for collapse_file") > >> commit 1baec203b77c ("mm/khugepaged: try to free transhuge swapcache > >> when possible") > >> > >> v4 changes > >> - Incorporate feedbacks from Yang Shi <shy828301@gmail.com> > >> - Remove tracepoint for __collapse_huge_page_copy, just keep SCAN_COPY_MC > >> and let trace_mm_collapse_huge_page it > >> - Remove unnecessary comments > >> > >> v3 changes > >> - Incorporate feedbacks from Yang Shi <shy828301@gmail.com> > >> - Add tracepoint for __collapse_huge_page_copy > >> - Restore PMD in collapse_huge_page > >> - Correct comment about mmap_read_lock > >> > >> v2 changes > >> - Incorporate feedbacks from Yang Shi <shy828301@gmail.com> > >> - Only keep copy_highpage_mc > >> - Adding new scan_result SCAN_COPY_MC > >> - Defer NR_FILE_THPS update until copying succeeded > >> > >> Jiaqi Yan (3): > >> mm/khugepaged: recover from poisoned anonymous memory > >> mm/hwpoison: introduce copy_mc_highpage > >> mm/khugepaged: recover from poisoned file-backed memory > >> > >> include/linux/highmem.h | 54 ++++++-- > >> include/trace/events/huge_memory.h | 3 +- > >> mm/khugepaged.c | 200 ++++++++++++++++++++++------- > >> 3 files changed, 198 insertions(+), 59 deletions(-) > >> > >> -- > >> 2.40.0.348.gf938b09366-goog > >>
On Thu, 6 Apr 2023 11:12:14 -0700 Jiaqi Yan <jiaqiyan@google.com> wrote: > On Tue, Apr 4, 2023 at 8:57 PM Yang Shi <shy828301@gmail.com> wrote: > > > > On Tue, Apr 4, 2023 at 11:44 AM Jiaqi Yan <jiaqiyan@google.com> wrote: > > > > > > Friendly ping for review :) > > > > Both I and Hugh already gave reviewed/acked for the previous version. > > Since there were just some minor changes so you could keep the > > reviewed/acked from the previous version. > > Thanks Yang! > Andrew, is there still anything I need to do at this point (e.g. > resent V12 with reviewed/acked tags in commits)? > Or are you fine with this V12 to be merged? We're good. I'll move this series from mm-unstable into mm-stable next week.