Message ID | 20241116175922.3265872-1-pasha.tatashin@soleen.com (mailing list archive) |
---|---|
Headers | show |
Series | Page Detective | expand |
On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote: > Page Detective is a new kernel debugging tool that provides detailed > information about the usage and mapping of physical memory pages. > > It is often known that a particular page is corrupted, but it is hard to > extract more information about such a page from live system. Examples > are: > > - Checksum failure during live migration > - Filesystem journal failure > - dump_page warnings on the console log > - Unexcpected segfaults > > Page Detective helps to extract more information from the kernel, so it > can be used by developers to root cause the associated problem. I like the _concept_ of providing more information like this. But you've bizarrely gone to great lengths to expose mm internal implementation details to drivers in order to implement this as a driver. This is _very clearly_ an mm thing, and _very clearly_ subject to change depending on how mm changes. It should live under mm/ and not be a loadable driver. I am also very very much not in favour of re-implementing yet another page table walker, this time in driver code (!). Please no. So NACK in its current form. This has to be implemented within mm if we are to take it. I'm also concerned about its scalability and impact on the system, as it takes every single mm lock in the system at once, which seems kinda unwise or at least problematic, and not something we want happening outside of mm, at any rate. > > It operates through the Linux debugfs interface, with two files: "virt" > and "phys". > > The "virt" file takes a virtual address and PID and outputs information > about the corresponding page. > > The "phys" file takes a physical address and outputs information about > that page. > > The output is presented via kernel log messages (can be accessed with > dmesg), and includes information such as the page's reference count, > mapping, flags, and memory cgroup. It also shows whether the page is > mapped in the kernel page table, and if so, how many times. I mean, even though I'm not a huge fan of kernel pointer hashing etc. this is obviously leaking as much information as you might want about kernel internal state to the point of maybe making the whole kernel pointer hashing thing moot. I know this requires CAP_SYS_ADMIN, but there are things that also require that which _still_ obscure kernel pointers. And you're outputting it all to dmesg. So yeah, a security person (Jann?) would be better placed to comment on this than me, but are we sure we want to do this when not in a CONFIG_DEBUG_VM* kernel? > > Pasha Tatashin (6): > mm: Make get_vma_name() function public > pagewalk: Add a page table walker for init_mm page table > mm: Add a dump_page variant that accept log level argument > misc/page_detective: Introduce Page Detective > misc/page_detective: enable loadable module > selftests/page_detective: Introduce self tests for Page Detective > > Documentation/misc-devices/index.rst | 1 + > Documentation/misc-devices/page_detective.rst | 78 ++ > MAINTAINERS | 8 + > drivers/misc/Kconfig | 11 + > drivers/misc/Makefile | 1 + > drivers/misc/page_detective.c | 808 ++++++++++++++++++ > fs/inode.c | 18 +- > fs/kernfs/dir.c | 1 + > fs/proc/task_mmu.c | 61 -- > include/linux/fs.h | 5 +- > include/linux/mmdebug.h | 1 + > include/linux/pagewalk.h | 2 + > kernel/pid.c | 1 + > mm/debug.c | 53 +- > mm/memcontrol.c | 1 + > mm/oom_kill.c | 1 + > mm/pagewalk.c | 32 + > mm/vma.c | 60 ++ > tools/testing/selftests/Makefile | 1 + > .../selftests/page_detective/.gitignore | 1 + > .../testing/selftests/page_detective/Makefile | 7 + > tools/testing/selftests/page_detective/config | 4 + > .../page_detective/page_detective_test.c | 727 ++++++++++++++++ > 23 files changed, 1787 insertions(+), 96 deletions(-) > create mode 100644 Documentation/misc-devices/page_detective.rst > create mode 100644 drivers/misc/page_detective.c > create mode 100644 tools/testing/selftests/page_detective/.gitignore > create mode 100644 tools/testing/selftests/page_detective/Makefile > create mode 100644 tools/testing/selftests/page_detective/config > create mode 100644 tools/testing/selftests/page_detective/page_detective_test.c > > -- > 2.47.0.338.g60cca15819-goog >
On Mon, Nov 18, 2024 at 12:17 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote: > > It operates through the Linux debugfs interface, with two files: "virt" > > and "phys". > > > > The "virt" file takes a virtual address and PID and outputs information > > about the corresponding page. > > > > The "phys" file takes a physical address and outputs information about > > that page. > > > > The output is presented via kernel log messages (can be accessed with > > dmesg), and includes information such as the page's reference count, > > mapping, flags, and memory cgroup. It also shows whether the page is > > mapped in the kernel page table, and if so, how many times. > > I mean, even though I'm not a huge fan of kernel pointer hashing etc. this > is obviously leaking as much information as you might want about kernel > internal state to the point of maybe making the whole kernel pointer > hashing thing moot. > > I know this requires CAP_SYS_ADMIN, but there are things that also require > that which _still_ obscure kernel pointers. > > And you're outputting it all to dmesg. > > So yeah, a security person (Jann?) would be better placed to comment on > this than me, but are we sure we want to do this when not in a > CONFIG_DEBUG_VM* kernel? I guess there are two parts to this - what root is allowed to do, and what information we're fine with exposing to dmesg. If the lockdown LSM is not set to LOCKDOWN_CONFIDENTIALITY_MAX, the kernel allows root to read kernel memory through some interfaces - in particular, BPF allows reading arbitrary kernel memory, and perf allows reading at least some stuff (like kernel register states). With lockdown in the most restrictive mode, the kernel tries to prevent root from reading arbitrary kernel memory, but we don't really change how much information goes into dmesg. (And I imagine you could probably still get kernel pointers out of BPF somehow even in the most restrictive lockdown mode, but that's probably not relevant.) The main issue with dmesg is that some systems make its contents available to code that is not running with root privileges; and I think it is also sometimes stored persistently in unencrypted form (like in EFI pstore) even when everything else on the system is encrypted. So on one hand, we definitely shouldn't print the contents of random chunks of memory into dmesg without a good reason; on the other hand, for example we do already print kernel register state on WARN() (which often includes kernel pointers and could theoretically include more sensitive data too). So I think showing page metadata to root when requested is probably okay as a tradeoff? And dumping that data into dmesg is maybe not great, but acceptable as long as only root can actually trigger this? I don't really have a strong opinion on this... To me, a bigger issue is that dump_page() looks like it might be racy, which is maybe not terrible in debugging code that only runs when something has already gone wrong, but bad if it is in code that root can trigger on demand? __dump_page() copies the given page with memcpy(), which I don't think guarantees enough atomicity with concurrent updates of page->mapping or such, so dump_mapping() could probably run on a bogus pointer. Even without torn pointers, I think there could be a UAF if the page's mapping is destroyed while we're going through dump_page(), since the page might not be locked. And in dump_mapping(), the strncpy_from_kernel_nofault() also doesn't guard against concurrent renaming of the dentry, which I think again would probably result in UAF. So I think dump_page() in its current form is not something we should expose to a userspace-reachable API.
On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote: > Page Detective is a new kernel debugging tool that provides detailed > information about the usage and mapping of physical memory pages. > > It is often known that a particular page is corrupted, but it is hard to > extract more information about such a page from live system. Examples > are: > > - Checksum failure during live migration > - Filesystem journal failure > - dump_page warnings on the console log > - Unexcpected segfaults > > Page Detective helps to extract more information from the kernel, so it > can be used by developers to root cause the associated problem. > > It operates through the Linux debugfs interface, with two files: "virt" > and "phys". > > The "virt" file takes a virtual address and PID and outputs information > about the corresponding page. > > The "phys" file takes a physical address and outputs information about > that page. > > The output is presented via kernel log messages (can be accessed with > dmesg), and includes information such as the page's reference count, > mapping, flags, and memory cgroup. It also shows whether the page is > mapped in the kernel page table, and if so, how many times. This looks questionable both from the security and convenience points of view. Given the request-response nature of the interface, the output can be provided using a "normal" seq-based pseudo-file. But I have a more generic question: doesn't it make sense to implement it as a set of drgn scripts instead of kernel code? This provides more flexibility, is safer (even if it's buggy, you won't crash the host) and should be at least in theory equally powerful. Thanks!
On Mon, Nov 18, 2024 at 2:11 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote: > > Page Detective is a new kernel debugging tool that provides detailed > > information about the usage and mapping of physical memory pages. > > > > It is often known that a particular page is corrupted, but it is hard to > > extract more information about such a page from live system. Examples > > are: > > > > - Checksum failure during live migration > > - Filesystem journal failure > > - dump_page warnings on the console log > > - Unexcpected segfaults > > > > Page Detective helps to extract more information from the kernel, so it > > can be used by developers to root cause the associated problem. > > > > It operates through the Linux debugfs interface, with two files: "virt" > > and "phys". > > > > The "virt" file takes a virtual address and PID and outputs information > > about the corresponding page. > > > > The "phys" file takes a physical address and outputs information about > > that page. > > > > The output is presented via kernel log messages (can be accessed with > > dmesg), and includes information such as the page's reference count, > > mapping, flags, and memory cgroup. It also shows whether the page is > > mapped in the kernel page table, and if so, how many times. > > This looks questionable both from the security and convenience points of view. > Given the request-response nature of the interface, the output can be > provided using a "normal" seq-based pseudo-file. We opted to use dmesg for output because it's the standard method for capturing kernel information and is commonly included in bug reports. Introducing a new file would require modifying existing data collection scripts used for reporting, so this approach minimizes disruption to existing workflows. > But I have a more generic question: > doesn't it make sense to implement it as a set of drgn scripts instead > of kernel code? This provides more flexibility, is safer (even if it's buggy, > you won't crash the host) and should be at least in theory equally > powerful. Regarding your suggestion, our plan is to perform reverse lookups in all page tables: kernel, user, IOMMU, and KVM. Currently, we only traverse the kernel and user page tables, but we intend to extend this functionality to IOMMU and KVM tables in future updates, I am not sure if drgn can provide this level of details within a reasonable amount of time. This approach will be helpful for debugging memory corruption scenarios. Often, external mechanisms detect corruption but require kernel-level information for root cause analysis. In our experience, invalid mappings persist in page tables for a period after corruption, providing a window to identify other users of the corrupted page via timely reverse lookup. Additionally, using crash/drgn is not feasible for us at this time, it requires keeping external tools on our hosts, also it requires approval and a security review for each script before deployment in our fleet. Thanks, Pasha
On Mon, Nov 18, 2024 at 7:54 AM Jann Horn <jannh@google.com> wrote: > > On Mon, Nov 18, 2024 at 12:17 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote: > > > It operates through the Linux debugfs interface, with two files: "virt" > > > and "phys". > > > > > > The "virt" file takes a virtual address and PID and outputs information > > > about the corresponding page. > > > > > > The "phys" file takes a physical address and outputs information about > > > that page. > > > > > > The output is presented via kernel log messages (can be accessed with > > > dmesg), and includes information such as the page's reference count, > > > mapping, flags, and memory cgroup. It also shows whether the page is > > > mapped in the kernel page table, and if so, how many times. > > > > I mean, even though I'm not a huge fan of kernel pointer hashing etc. this > > is obviously leaking as much information as you might want about kernel > > internal state to the point of maybe making the whole kernel pointer > > hashing thing moot. > > > > I know this requires CAP_SYS_ADMIN, but there are things that also require > > that which _still_ obscure kernel pointers. > > > > And you're outputting it all to dmesg. > > > > So yeah, a security person (Jann?) would be better placed to comment on > > this than me, but are we sure we want to do this when not in a > > CONFIG_DEBUG_VM* kernel? > > I guess there are two parts to this - what root is allowed to do, and > what information we're fine with exposing to dmesg. > > If the lockdown LSM is not set to LOCKDOWN_CONFIDENTIALITY_MAX, the > kernel allows root to read kernel memory through some interfaces - in > particular, BPF allows reading arbitrary kernel memory, and perf > allows reading at least some stuff (like kernel register states). With > lockdown in the most restrictive mode, the kernel tries to prevent > root from reading arbitrary kernel memory, but we don't really change > how much information goes into dmesg. (And I imagine you could > probably still get kernel pointers out of BPF somehow even in the most > restrictive lockdown mode, but that's probably not relevant.) > > The main issue with dmesg is that some systems make its contents > available to code that is not running with root privileges; and I > think it is also sometimes stored persistently in unencrypted form > (like in EFI pstore) even when everything else on the system is > encrypted. > So on one hand, we definitely shouldn't print the contents of random > chunks of memory into dmesg without a good reason; on the other hand, > for example we do already print kernel register state on WARN() (which > often includes kernel pointers and could theoretically include more > sensitive data too). > > So I think showing page metadata to root when requested is probably > okay as a tradeoff? And dumping that data into dmesg is maybe not > great, but acceptable as long as only root can actually trigger this? > > I don't really have a strong opinion on this... > > > To me, a bigger issue is that dump_page() looks like it might be racy, > which is maybe not terrible in debugging code that only runs when > something has already gone wrong, but bad if it is in code that root > can trigger on demand? Hi Jann, thank you for reviewing this proposal. Presumably, the interface should be used only when something has gone wrong but has not been noticed by the kernel. That something is usually checksums failures that are outside of the kernel: i.e. during live migration, snapshotting, filesystem journaling, etc. We already have interfaces that provide data from the live kernel that could be racy, i.e. crash utility. > __dump_page() copies the given page with > memcpy(), which I don't think guarantees enough atomicity with > concurrent updates of page->mapping or such, so dump_mapping() could > probably run on a bogus pointer. Even without torn pointers, I think > there could be a UAF if the page's mapping is destroyed while we're > going through dump_page(), since the page might not be locked. And in > dump_mapping(), the strncpy_from_kernel_nofault() also doesn't guard > against concurrent renaming of the dentry, which I think again would > probably result in UAF. Since we are holding a reference on the page at the time of dump_page(), the identity of the page should not really change, but dentry can be renamed. > So I think dump_page() in its current form is not something we should > expose to a userspace-reachable API. We use dump_page() all over WARN_ONs in MM code where pages might not be locked, but this is a good point, that while even the existing usage might be racy, providing a user-reachable API potentially makes it worse. I will see if I could add some locking before dump_page(), or make a dump_page variant that does not do dump_mapping().
On Mon, Nov 18, 2024 at 11:24 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > On Mon, Nov 18, 2024 at 7:54 AM Jann Horn <jannh@google.com> wrote: > > > > On Mon, Nov 18, 2024 at 12:17 PM Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > > > On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote: > > > > It operates through the Linux debugfs interface, with two files: "virt" > > > > and "phys". > > > > > > > > The "virt" file takes a virtual address and PID and outputs information > > > > about the corresponding page. > > > > > > > > The "phys" file takes a physical address and outputs information about > > > > that page. > > > > > > > > The output is presented via kernel log messages (can be accessed with > > > > dmesg), and includes information such as the page's reference count, > > > > mapping, flags, and memory cgroup. It also shows whether the page is > > > > mapped in the kernel page table, and if so, how many times. > > > > > > I mean, even though I'm not a huge fan of kernel pointer hashing etc. this > > > is obviously leaking as much information as you might want about kernel > > > internal state to the point of maybe making the whole kernel pointer > > > hashing thing moot. > > > > > > I know this requires CAP_SYS_ADMIN, but there are things that also require > > > that which _still_ obscure kernel pointers. > > > > > > And you're outputting it all to dmesg. > > > > > > So yeah, a security person (Jann?) would be better placed to comment on > > > this than me, but are we sure we want to do this when not in a > > > CONFIG_DEBUG_VM* kernel? > > > > I guess there are two parts to this - what root is allowed to do, and > > what information we're fine with exposing to dmesg. > > > > If the lockdown LSM is not set to LOCKDOWN_CONFIDENTIALITY_MAX, the > > kernel allows root to read kernel memory through some interfaces - in > > particular, BPF allows reading arbitrary kernel memory, and perf > > allows reading at least some stuff (like kernel register states). With > > lockdown in the most restrictive mode, the kernel tries to prevent > > root from reading arbitrary kernel memory, but we don't really change > > how much information goes into dmesg. (And I imagine you could > > probably still get kernel pointers out of BPF somehow even in the most > > restrictive lockdown mode, but that's probably not relevant.) > > > > The main issue with dmesg is that some systems make its contents > > available to code that is not running with root privileges; and I > > think it is also sometimes stored persistently in unencrypted form > > (like in EFI pstore) even when everything else on the system is > > encrypted. > > So on one hand, we definitely shouldn't print the contents of random > > chunks of memory into dmesg without a good reason; on the other hand, > > for example we do already print kernel register state on WARN() (which > > often includes kernel pointers and could theoretically include more > > sensitive data too). > > > > So I think showing page metadata to root when requested is probably > > okay as a tradeoff? And dumping that data into dmesg is maybe not > > great, but acceptable as long as only root can actually trigger this? > > > > I don't really have a strong opinion on this... > > > > > > To me, a bigger issue is that dump_page() looks like it might be racy, > > which is maybe not terrible in debugging code that only runs when > > something has already gone wrong, but bad if it is in code that root > > can trigger on demand? > > Hi Jann, thank you for reviewing this proposal. > > Presumably, the interface should be used only when something has gone > wrong but has not been noticed by the kernel. That something is > usually checksums failures that are outside of the kernel: i.e. during > live migration, snapshotting, filesystem journaling, etc. We already > have interfaces that provide data from the live kernel that could be > racy, i.e. crash utility. Ah, yes, I'm drawing a distinction here between "something has gone wrong internally in the kernel and the kernel does some kinda-broken best-effort self-diagnostics" and "userspace thinks something is broken and asks the kernel". > > __dump_page() copies the given page with > > memcpy(), which I don't think guarantees enough atomicity with > > concurrent updates of page->mapping or such, so dump_mapping() could > > probably run on a bogus pointer. Even without torn pointers, I think > > there could be a UAF if the page's mapping is destroyed while we're > > going through dump_page(), since the page might not be locked. And in > > dump_mapping(), the strncpy_from_kernel_nofault() also doesn't guard > > against concurrent renaming of the dentry, which I think again would > > probably result in UAF. > > Since we are holding a reference on the page at the time of > dump_page(), the identity of the page should not really change, but > dentry can be renamed. Can you point me to where a refcounted reference to the page comes from when page_detective_metadata() calls dump_page_lvl()? > > So I think dump_page() in its current form is not something we should > > expose to a userspace-reachable API. > > We use dump_page() all over WARN_ONs in MM code where pages might not > be locked, but this is a good point, that while even the existing > usage might be racy, providing a user-reachable API potentially makes > it worse. I will see if I could add some locking before dump_page(), > or make a dump_page variant that does not do dump_mapping(). To be clear, I am not that strongly opposed to racily reading data such that the data may not be internally consistent or such; but this is a case of racy use-after-free reads that might end up dumping entirely unrelated memory contents into dmesg. I think we should properly protect against that in an API that userspace can invoke. Otherwise, if we race, we might end up writing random memory contents into dmesg; and if we are particularly unlucky, those random memory contents could be PII or authentication tokens or such. I'm not entirely sure what the right approach is here; I guess it makes sense that when the kernel internally detects corruption, dump_page doesn't take references on pages it accesses to avoid corrupting things further. If you are looking at a page based on a userspace request, I guess you could access the page with the necessary locking to access its properties under the normal locking rules? (If anyone else has opinions either way on this line I'm trying to draw between kernel-internal debug paths and userspace-triggerable debugging, feel free to share; I hope my mental model makes sense but I could imagine other folks having a different model of this?)
On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote: > Additionally, using crash/drgn is not feasible for us at this time, it > requires keeping external tools on our hosts, also it requires > approval and a security review for each script before deployment in > our fleet. So it's ok to add a totally insecure kernel feature to your fleet instead? You might want to reconsider that policy decision :) good luck! greg k-h
> Can you point me to where a refcounted reference to the page comes > from when page_detective_metadata() calls dump_page_lvl()? I am sorry, I remembered incorrectly, we are getting reference right after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I will move the folio_try_get() to before dump_page_lvl(). > > > So I think dump_page() in its current form is not something we should > > > expose to a userspace-reachable API. > > > > We use dump_page() all over WARN_ONs in MM code where pages might not > > be locked, but this is a good point, that while even the existing > > usage might be racy, providing a user-reachable API potentially makes > > it worse. I will see if I could add some locking before dump_page(), > > or make a dump_page variant that does not do dump_mapping(). > > To be clear, I am not that strongly opposed to racily reading data > such that the data may not be internally consistent or such; but this > is a case of racy use-after-free reads that might end up dumping > entirely unrelated memory contents into dmesg. I think we should > properly protect against that in an API that userspace can invoke. > Otherwise, if we race, we might end up writing random memory contents > into dmesg; and if we are particularly unlucky, those random memory > contents could be PII or authentication tokens or such. > > I'm not entirely sure what the right approach is here; I guess it > makes sense that when the kernel internally detects corruption, > dump_page doesn't take references on pages it accesses to avoid > corrupting things further. If you are looking at a page based on a > userspace request, I guess you could access the page with the > necessary locking to access its properties under the normal locking > rules? I will take reference, as we already do that for memcg purpose, but have not included dump_page(). Thank you, Pasha
On Tue, Nov 19, 2024 at 2:30 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > Can you point me to where a refcounted reference to the page comes > > from when page_detective_metadata() calls dump_page_lvl()? > > I am sorry, I remembered incorrectly, we are getting reference right > after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I > will move the folio_try_get() to before dump_page_lvl(). > > > > > So I think dump_page() in its current form is not something we should > > > > expose to a userspace-reachable API. > > > > > > We use dump_page() all over WARN_ONs in MM code where pages might not > > > be locked, but this is a good point, that while even the existing > > > usage might be racy, providing a user-reachable API potentially makes > > > it worse. I will see if I could add some locking before dump_page(), > > > or make a dump_page variant that does not do dump_mapping(). > > > > To be clear, I am not that strongly opposed to racily reading data > > such that the data may not be internally consistent or such; but this > > is a case of racy use-after-free reads that might end up dumping > > entirely unrelated memory contents into dmesg. I think we should > > properly protect against that in an API that userspace can invoke. > > Otherwise, if we race, we might end up writing random memory contents > > into dmesg; and if we are particularly unlucky, those random memory > > contents could be PII or authentication tokens or such. > > > > I'm not entirely sure what the right approach is here; I guess it > > makes sense that when the kernel internally detects corruption, > > dump_page doesn't take references on pages it accesses to avoid > > corrupting things further. If you are looking at a page based on a > > userspace request, I guess you could access the page with the > > necessary locking to access its properties under the normal locking > > rules? > > I will take reference, as we already do that for memcg purpose, but > have not included dump_page(). Note that taking a reference on the page does not make all of dump_page() fine; in particular, my understanding is that folio_mapping() requires that the page is locked in order to return a stable pointer, and some of the code in dump_mapping() would probably also require some other locks - probably at least on the inode and maybe also on the dentry, I think? Otherwise the inode's dentry list can probably change concurrently, and the dentry's name pointer can change too.
On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote: > > Additionally, using crash/drgn is not feasible for us at this time, it > > requires keeping external tools on our hosts, also it requires > > approval and a security review for each script before deployment in > > our fleet. > > So it's ok to add a totally insecure kernel feature to your fleet > instead? You might want to reconsider that policy decision :) Hi Greg, While some risk is inherent, we believe the potential for abuse here is limited, especially given the existing CAP_SYS_ADMIN requirement. But, even with root access compromised, this tool presents a smaller attack surface than alternatives like crash/drgn. It exposes less sensitive information, unlike crash/drgn, which could potentially allow reading all of kernel memory. Pasha
On Tue, Nov 19, 2024 at 7:52 AM Jann Horn <jannh@google.com> wrote: > > On Tue, Nov 19, 2024 at 2:30 AM Pasha Tatashin > <pasha.tatashin@soleen.com> wrote: > > > Can you point me to where a refcounted reference to the page comes > > > from when page_detective_metadata() calls dump_page_lvl()? > > > > I am sorry, I remembered incorrectly, we are getting reference right > > after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I > > will move the folio_try_get() to before dump_page_lvl(). > > > > > > > So I think dump_page() in its current form is not something we should > > > > > expose to a userspace-reachable API. > > > > > > > > We use dump_page() all over WARN_ONs in MM code where pages might not > > > > be locked, but this is a good point, that while even the existing > > > > usage might be racy, providing a user-reachable API potentially makes > > > > it worse. I will see if I could add some locking before dump_page(), > > > > or make a dump_page variant that does not do dump_mapping(). > > > > > > To be clear, I am not that strongly opposed to racily reading data > > > such that the data may not be internally consistent or such; but this > > > is a case of racy use-after-free reads that might end up dumping > > > entirely unrelated memory contents into dmesg. I think we should > > > properly protect against that in an API that userspace can invoke. > > > Otherwise, if we race, we might end up writing random memory contents > > > into dmesg; and if we are particularly unlucky, those random memory > > > contents could be PII or authentication tokens or such. > > > > > > I'm not entirely sure what the right approach is here; I guess it > > > makes sense that when the kernel internally detects corruption, > > > dump_page doesn't take references on pages it accesses to avoid > > > corrupting things further. If you are looking at a page based on a > > > userspace request, I guess you could access the page with the > > > necessary locking to access its properties under the normal locking > > > rules? > > > > I will take reference, as we already do that for memcg purpose, but > > have not included dump_page(). > > Note that taking a reference on the page does not make all of > dump_page() fine; in particular, my understanding is that > folio_mapping() requires that the page is locked in order to return a > stable pointer, and some of the code in dump_mapping() would probably > also require some other locks - probably at least on the inode and > maybe also on the dentry, I think? Otherwise the inode's dentry list > can probably change concurrently, and the dentry's name pointer can > change too. Agreed, once reference is taken, the page identity cannot change (i.e. if it is a named page it will stay a named page), but dentry can be renamed. I will look into what can be done to guarantee consistency in the next version. There is also a fallback if locking cannot be reliably resolved (i.e. for performance reasons) where we can make dump_mapping() optionally disabled from dump_page_lvl() with a new argument flag. Thank you, Pasha
On Tue, Nov 19, 2024 at 4:14 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > On Tue, Nov 19, 2024 at 7:52 AM Jann Horn <jannh@google.com> wrote: > > On Tue, Nov 19, 2024 at 2:30 AM Pasha Tatashin > > <pasha.tatashin@soleen.com> wrote: > > > > Can you point me to where a refcounted reference to the page comes > > > > from when page_detective_metadata() calls dump_page_lvl()? > > > > > > I am sorry, I remembered incorrectly, we are getting reference right > > > after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I > > > will move the folio_try_get() to before dump_page_lvl(). > > > > > > > > > So I think dump_page() in its current form is not something we should > > > > > > expose to a userspace-reachable API. > > > > > > > > > > We use dump_page() all over WARN_ONs in MM code where pages might not > > > > > be locked, but this is a good point, that while even the existing > > > > > usage might be racy, providing a user-reachable API potentially makes > > > > > it worse. I will see if I could add some locking before dump_page(), > > > > > or make a dump_page variant that does not do dump_mapping(). > > > > > > > > To be clear, I am not that strongly opposed to racily reading data > > > > such that the data may not be internally consistent or such; but this > > > > is a case of racy use-after-free reads that might end up dumping > > > > entirely unrelated memory contents into dmesg. I think we should > > > > properly protect against that in an API that userspace can invoke. > > > > Otherwise, if we race, we might end up writing random memory contents > > > > into dmesg; and if we are particularly unlucky, those random memory > > > > contents could be PII or authentication tokens or such. > > > > > > > > I'm not entirely sure what the right approach is here; I guess it > > > > makes sense that when the kernel internally detects corruption, > > > > dump_page doesn't take references on pages it accesses to avoid > > > > corrupting things further. If you are looking at a page based on a > > > > userspace request, I guess you could access the page with the > > > > necessary locking to access its properties under the normal locking > > > > rules? > > > > > > I will take reference, as we already do that for memcg purpose, but > > > have not included dump_page(). > > > > Note that taking a reference on the page does not make all of > > dump_page() fine; in particular, my understanding is that > > folio_mapping() requires that the page is locked in order to return a > > stable pointer, and some of the code in dump_mapping() would probably > > also require some other locks - probably at least on the inode and > > maybe also on the dentry, I think? Otherwise the inode's dentry list > > can probably change concurrently, and the dentry's name pointer can > > change too. > > Agreed, once reference is taken, the page identity cannot change (i.e. > if it is a named page it will stay a named page), but dentry can be > renamed. I will look into what can be done to guarantee consistency in > the next version. There is also a fallback if locking cannot be > reliably resolved (i.e. for performance reasons) where we can make > dump_mapping() optionally disabled from dump_page_lvl() with a new > argument flag. Yeah, I think if you don't need the details that dump_mapping() shows, skipping that for user-requested dumps might be a reasonable option.
On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote: > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote: > > > Additionally, using crash/drgn is not feasible for us at this time, it > > > requires keeping external tools on our hosts, also it requires > > > approval and a security review for each script before deployment in > > > our fleet. > > > > So it's ok to add a totally insecure kernel feature to your fleet > > instead? You might want to reconsider that policy decision :) > > Hi Greg, > > While some risk is inherent, we believe the potential for abuse here > is limited, especially given the existing CAP_SYS_ADMIN requirement. > But, even with root access compromised, this tool presents a smaller > attack surface than alternatives like crash/drgn. It exposes less > sensitive information, unlike crash/drgn, which could potentially > allow reading all of kernel memory. The problem here is with using dmesg for output. No security-sensitive information should go there. Even exposing raw kernel pointers is not considered safe. I'm also not sure about what presents a bigger attack surface. Yes, drgn allows to read more, but it's using /proc/kcore, so the in-kernel code is much simpler. But I don't think it's a relevant discussion, if a malicious user has a root access, there are better options than both drgn and page detective.
On Tue, Nov 19, 2024 at 01:52:00PM +0100, Jann Horn wrote: > > I will take reference, as we already do that for memcg purpose, but > > have not included dump_page(). > > Note that taking a reference on the page does not make all of > dump_page() fine; in particular, my understanding is that > folio_mapping() requires that the page is locked in order to return a > stable pointer, and some of the code in dump_mapping() would probably > also require some other locks - probably at least on the inode and > maybe also on the dentry, I think? Otherwise the inode's dentry list > can probably change concurrently, and the dentry's name pointer can > change too. First important thing is that we snapshot the page. So while we may have a torn snapshot of the page, it can't change under us any more, so we don't have to worry about it being swizzled one way and then swizzled back. Second thing is that I think using folio_mapping() is actually wrong. We don't want the swap mapping if it's an anon page that's in the swapcache. We'd be fine just doing mapping = folio->mapping (we'd need to add a check for movable, but I think that's fine). Anyway, we know the folio isn't ksm or anon at the point that we call dump_mapping() because there's a chain of "else" statements. So I think we're fine because we can't switch between anon & file while holding a refcount. Having a refcount on the folio will prevent the folio from being allocated to anything else again. It will not protect the mapping from being torn down (the folio can be truncated from the mapping, then the mapping can be freed, and the memory reused). As you say, the dentry can be renamed as well. This patch series makes me nervous. I'd rather see it done as a bpf script or drgn script, but if it is going to be done in C, I'd really like to see more auditing of the safety here. It feels like the kind of hack that one deploys internally to debug a hard-to-hit condition, rather than the kind of code that we like to ship upstream.
On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote: > > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote: > > > > Additionally, using crash/drgn is not feasible for us at this time, it > > > > requires keeping external tools on our hosts, also it requires > > > > approval and a security review for each script before deployment in > > > > our fleet. > > > > > > So it's ok to add a totally insecure kernel feature to your fleet > > > instead? You might want to reconsider that policy decision :) > > > > Hi Greg, > > > > While some risk is inherent, we believe the potential for abuse here > > is limited, especially given the existing CAP_SYS_ADMIN requirement. > > But, even with root access compromised, this tool presents a smaller > > attack surface than alternatives like crash/drgn. It exposes less > > sensitive information, unlike crash/drgn, which could potentially > > allow reading all of kernel memory. > > The problem here is with using dmesg for output. No security-sensitive > information should go there. Even exposing raw kernel pointers is not > considered safe. I am OK in writing the output to a debugfs file in the next version, the only concern I have is that implies that dump_page() would need to be basically duplicated, as it now outputs everything via printk's.
On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote: > > > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote: > > > > > Additionally, using crash/drgn is not feasible for us at this time, it > > > > > requires keeping external tools on our hosts, also it requires > > > > > approval and a security review for each script before deployment in > > > > > our fleet. > > > > > > > > So it's ok to add a totally insecure kernel feature to your fleet > > > > instead? You might want to reconsider that policy decision :) > > > > > > Hi Greg, > > > > > > While some risk is inherent, we believe the potential for abuse here > > > is limited, especially given the existing CAP_SYS_ADMIN requirement. > > > But, even with root access compromised, this tool presents a smaller > > > attack surface than alternatives like crash/drgn. It exposes less > > > sensitive information, unlike crash/drgn, which could potentially > > > allow reading all of kernel memory. > > > > The problem here is with using dmesg for output. No security-sensitive > > information should go there. Even exposing raw kernel pointers is not > > considered safe. > > I am OK in writing the output to a debugfs file in the next version, > the only concern I have is that implies that dump_page() would need to > be basically duplicated, as it now outputs everything via printk's. Perhaps you can refactor the code in dump_page() to use a seq_buf, then have dump_page() printk that seq_buf using seq_buf_do_printk(), and have page detective output that seq_buf to the debugfs file? We do something very similar with memory_stat_format(). We use the same function to generate the memcg stats in a seq_buf, then we use that seq_buf to output the stats to memory.stat as well as the OOM log.
On Tue, Nov 19, 2024 at 11:35:47AM -0800, Yosry Ahmed wrote: > On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin > <pasha.tatashin@soleen.com> wrote: > > > > On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote: > > > > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote: > > > > > > Additionally, using crash/drgn is not feasible for us at this time, it > > > > > > requires keeping external tools on our hosts, also it requires > > > > > > approval and a security review for each script before deployment in > > > > > > our fleet. > > > > > > > > > > So it's ok to add a totally insecure kernel feature to your fleet > > > > > instead? You might want to reconsider that policy decision :) > > > > > > > > Hi Greg, > > > > > > > > While some risk is inherent, we believe the potential for abuse here > > > > is limited, especially given the existing CAP_SYS_ADMIN requirement. > > > > But, even with root access compromised, this tool presents a smaller > > > > attack surface than alternatives like crash/drgn. It exposes less > > > > sensitive information, unlike crash/drgn, which could potentially > > > > allow reading all of kernel memory. > > > > > > The problem here is with using dmesg for output. No security-sensitive > > > information should go there. Even exposing raw kernel pointers is not > > > considered safe. > > > > I am OK in writing the output to a debugfs file in the next version, > > the only concern I have is that implies that dump_page() would need to > > be basically duplicated, as it now outputs everything via printk's. > > Perhaps you can refactor the code in dump_page() to use a seq_buf, > then have dump_page() printk that seq_buf using seq_buf_do_printk(), > and have page detective output that seq_buf to the debugfs file? > > We do something very similar with memory_stat_format(). We use the > same function to generate the memcg stats in a seq_buf, then we use > that seq_buf to output the stats to memory.stat as well as the OOM > log. +1 Thanks!
Pasha Tatashin <pasha.tatashin@soleen.com> writes: > Page Detective is a new kernel debugging tool that provides detailed > information about the usage and mapping of physical memory pages. > > It is often known that a particular page is corrupted, but it is hard to > extract more information about such a page from live system. Examples > are: > > - Checksum failure during live migration > - Filesystem journal failure > - dump_page warnings on the console log > - Unexcpected segfaults > > Page Detective helps to extract more information from the kernel, so it > can be used by developers to root cause the associated problem. > > It operates through the Linux debugfs interface, with two files: "virt" > and "phys". > > The "virt" file takes a virtual address and PID and outputs information > about the corresponding page. > > The "phys" file takes a physical address and outputs information about > that page. > > The output is presented via kernel log messages (can be accessed with > dmesg), and includes information such as the page's reference count, > mapping, flags, and memory cgroup. It also shows whether the page is > mapped in the kernel page table, and if so, how many times. A lot of all that is already covered in /proc/kpage{flags,cgroup,count) Also we already have /proc/pid/pagemap to resolve virtual addresses. At a minimum you need to discuss why these existing mechanisms are not suitable for you and how your new one is better. If something particular is missing perhaps the existing mechanisms can be extended? Outputting in the dmesg seems rather clumpsy for a production mechanism. I personally would just use live crash or live gdb on /proc/kcore to get extra information, although I can see that might have races. -Andi
On Tue, Nov 19, 2024 at 2:36 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin > <pasha.tatashin@soleen.com> wrote: > > > > On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote: > > > > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote: > > > > > > Additionally, using crash/drgn is not feasible for us at this time, it > > > > > > requires keeping external tools on our hosts, also it requires > > > > > > approval and a security review for each script before deployment in > > > > > > our fleet. > > > > > > > > > > So it's ok to add a totally insecure kernel feature to your fleet > > > > > instead? You might want to reconsider that policy decision :) > > > > > > > > Hi Greg, > > > > > > > > While some risk is inherent, we believe the potential for abuse here > > > > is limited, especially given the existing CAP_SYS_ADMIN requirement. > > > > But, even with root access compromised, this tool presents a smaller > > > > attack surface than alternatives like crash/drgn. It exposes less > > > > sensitive information, unlike crash/drgn, which could potentially > > > > allow reading all of kernel memory. > > > > > > The problem here is with using dmesg for output. No security-sensitive > > > information should go there. Even exposing raw kernel pointers is not > > > considered safe. > > > > I am OK in writing the output to a debugfs file in the next version, > > the only concern I have is that implies that dump_page() would need to > > be basically duplicated, as it now outputs everything via printk's. > > Perhaps you can refactor the code in dump_page() to use a seq_buf, > then have dump_page() printk that seq_buf using seq_buf_do_printk(), > and have page detective output that seq_buf to the debugfs file? Good idea, I will look into modifying it this way. > We do something very similar with memory_stat_format(). We use the void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) { /* Use static buffer, for the caller is holding oom_lock. */ static char buf[PAGE_SIZE]; .... seq_buf_init(&s, buf, sizeof(buf)); memory_stat_format(memcg, &s); seq_buf_do_printk(&s, KERN_INFO); } This is a callosal stack allocation, given that our fleet only has 8K stacks. :-) > same function to generate the memcg stats in a seq_buf, then we use > that seq_buf to output the stats to memory.stat as well as the OOM > log.
On Wed, Nov 20, 2024 at 10:29 AM Andi Kleen <ak@linux.intel.com> wrote: > > Pasha Tatashin <pasha.tatashin@soleen.com> writes: > > > Page Detective is a new kernel debugging tool that provides detailed > > information about the usage and mapping of physical memory pages. > > > > It is often known that a particular page is corrupted, but it is hard to > > extract more information about such a page from live system. Examples > > are: > > > > - Checksum failure during live migration > > - Filesystem journal failure > > - dump_page warnings on the console log > > - Unexcpected segfaults > > > > Page Detective helps to extract more information from the kernel, so it > > can be used by developers to root cause the associated problem. > > > > It operates through the Linux debugfs interface, with two files: "virt" > > and "phys". > > > > The "virt" file takes a virtual address and PID and outputs information > > about the corresponding page. > > > > The "phys" file takes a physical address and outputs information about > > that page. > > > > The output is presented via kernel log messages (can be accessed with > > dmesg), and includes information such as the page's reference count, > > mapping, flags, and memory cgroup. It also shows whether the page is > > mapped in the kernel page table, and if so, how many times. > > A lot of all that is already covered in /proc/kpage{flags,cgroup,count) > Also we already have /proc/pid/pagemap to resolve virtual addresses. > > At a minimum you need to discuss why these existing mechanisms are not > suitable for you and how your new one is better. Hi Andi, Thanks for your feedback! I will extend the cover letter in the next version to address your comment about comparing with the existing methods. We periodically receive rare reports of page corruptions detected through various methods (journaling, live migrations, crashes, etc.) from userland. To effectively root cause these corruptions, we need to automatically and quickly gather comprehensive data about the affected pages from the kernel. This includes: - Obtain all metadata associated with a page. - Quickly identify all user processes mapping a given page. - Determine if and where the kernel maps the page, which is also important given the opportunity to remove guest memory from the kernel direct map (as discussed at LPC'24). We also plan to extend this functionality to include KVM and IOMMU page tables in the future. <pagemap> provides an interface to traversing through user page tables, but the other information cannot be extracted using the existing interfaces. To ensure data integrity, even when dealing with potential memory corruptions, Page Detective minimizes reliance on kernel data structures. Instead, it leverages direct access to hardware structures like page tables, providing a more reliable view of page mappings. > If something particular is missing perhaps the existing mechanisms > can be extended? > Outputting in the dmesg seems rather clumpsy for a production mechanism. I am going to change the output to a file in the next version. > I personally would just use live crash or live gdb on /proc/kcore to get > extra information, although I can see that might have races. For security reasons crash is currently not available on our production fleet machines as it potentially provides access to all kernel memory. Thank you, Pasha
On Wed, Nov 20, 2024 at 8:14 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > On Tue, Nov 19, 2024 at 2:36 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin > > <pasha.tatashin@soleen.com> wrote: > > > > > > On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote: > > > > > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote: > > > > > > > Additionally, using crash/drgn is not feasible for us at this time, it > > > > > > > requires keeping external tools on our hosts, also it requires > > > > > > > approval and a security review for each script before deployment in > > > > > > > our fleet. > > > > > > > > > > > > So it's ok to add a totally insecure kernel feature to your fleet > > > > > > instead? You might want to reconsider that policy decision :) > > > > > > > > > > Hi Greg, > > > > > > > > > > While some risk is inherent, we believe the potential for abuse here > > > > > is limited, especially given the existing CAP_SYS_ADMIN requirement. > > > > > But, even with root access compromised, this tool presents a smaller > > > > > attack surface than alternatives like crash/drgn. It exposes less > > > > > sensitive information, unlike crash/drgn, which could potentially > > > > > allow reading all of kernel memory. > > > > > > > > The problem here is with using dmesg for output. No security-sensitive > > > > information should go there. Even exposing raw kernel pointers is not > > > > considered safe. > > > > > > I am OK in writing the output to a debugfs file in the next version, > > > the only concern I have is that implies that dump_page() would need to > > > be basically duplicated, as it now outputs everything via printk's. > > > > Perhaps you can refactor the code in dump_page() to use a seq_buf, > > then have dump_page() printk that seq_buf using seq_buf_do_printk(), > > and have page detective output that seq_buf to the debugfs file? > > Good idea, I will look into modifying it this way. > > > We do something very similar with memory_stat_format(). We use the > > void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) > { > /* Use static buffer, for the caller is holding oom_lock. */ > static char buf[PAGE_SIZE]; > .... > seq_buf_init(&s, buf, sizeof(buf)); > memory_stat_format(memcg, &s); > seq_buf_do_printk(&s, KERN_INFO); > } > > This is a callosal stack allocation, given that our fleet only has 8K > stacks. :-) That's a static allocation though :)
> > /* Use static buffer, for the caller is holding oom_lock. */ > > static char buf[PAGE_SIZE]; > > .... > > seq_buf_init(&s, buf, sizeof(buf)); > > memory_stat_format(memcg, &s); > > seq_buf_do_printk(&s, KERN_INFO); > > } > > > > This is a callosal stack allocation, given that our fleet only has 8K > > stacks. :-) > > That's a static allocation though :) Ah right, did not notice it was static (and ignored the comment) Pasha
> - Quickly identify all user processes mapping a given page. Can be done with /proc/*/pagemap today. Maybe it's not "quick" because it won't use the rmap chains, but is that a serious issue? > - Determine if and where the kernel maps the page, which is also > important given the opportunity to remove guest memory from the kernel > direct map (as discussed at LPC'24). At least x86 already has a kernel page table dumper in debugfs that can be used for this. The value of a second redundant one seems low. > We also plan to extend this functionality to include KVM and IOMMU > page tables in the future. Yes dumpers for those would likely be useful. (at least for the case when one hand is tied behind your back by security policies forbidding /proc/kcore access) > <pagemap> provides an interface to traversing through user page > tables, but the other information cannot be extracted using the > existing interfaces. Like what? You mean the reference counts? /proc/k* doesn't have any reference counts, and no space for full counts, but I suspect usually all you need to know is a few states like (>1, 1, 0, maybe negative) which could be mapped to a few spare kpageflags bits. That said I thought Willy wanted to move a lot of these elsewhere anyways with the folio revolution, so it might be a short lived interface anyways. -Andi