Message ID | 20190807171559.182301-1-joel@joelfernandes.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v5,1/6] mm/page_idle: Add per-pid idle page tracking using virtual index | expand |
On Wed, 7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote: > In Android, we are using this for the heap profiler (heapprofd) which > profiles and pin points code paths which allocates and leaves memory > idle for long periods of time. This method solves the security issue > with userspace learning the PFN, and while at it is also shown to yield > better results than the pagemap lookup, the theory being that the window > where the address space can change is reduced by eliminating the > intermediate pagemap look up stage. In virtual address indexing, the > process's mmap_sem is held for the duration of the access. So is heapprofd a developer-only thing? Is heapprofd included in end-user android loads? If not then, again, wouldn't it be better to make the feature Kconfigurable so that Android developers can enable it during development then disable it for production kernels?
On Wed, Aug 07, 2019 at 01:04:02PM -0700, Andrew Morton wrote: > On Wed, 7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote: > > > In Android, we are using this for the heap profiler (heapprofd) which > > profiles and pin points code paths which allocates and leaves memory > > idle for long periods of time. This method solves the security issue > > with userspace learning the PFN, and while at it is also shown to yield > > better results than the pagemap lookup, the theory being that the window > > where the address space can change is reduced by eliminating the > > intermediate pagemap look up stage. In virtual address indexing, the > > process's mmap_sem is held for the duration of the access. > > So is heapprofd a developer-only thing? Is heapprofd included in > end-user android loads? If not then, again, wouldn't it be better to > make the feature Kconfigurable so that Android developers can enable it > during development then disable it for production kernels? Almost all of this code is already configurable with CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets disabled. Or are you referring to something else that needs to be made configurable? thanks, - Joel
On Wed, 7 Aug 2019 16:45:30 -0400 Joel Fernandes <joel@joelfernandes.org> wrote: > On Wed, Aug 07, 2019 at 01:04:02PM -0700, Andrew Morton wrote: > > On Wed, 7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote: > > > > > In Android, we are using this for the heap profiler (heapprofd) which > > > profiles and pin points code paths which allocates and leaves memory > > > idle for long periods of time. This method solves the security issue > > > with userspace learning the PFN, and while at it is also shown to yield > > > better results than the pagemap lookup, the theory being that the window > > > where the address space can change is reduced by eliminating the > > > intermediate pagemap look up stage. In virtual address indexing, the > > > process's mmap_sem is held for the duration of the access. > > > > So is heapprofd a developer-only thing? Is heapprofd included in > > end-user android loads? If not then, again, wouldn't it be better to > > make the feature Kconfigurable so that Android developers can enable it > > during development then disable it for production kernels? > > Almost all of this code is already configurable with > CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets > disabled. > > Or are you referring to something else that needs to be made configurable? Yes - the 300+ lines of code which this patchset adds! The impacted people will be those who use the existing idle-page-tracking feature but who will not use the new feature. I guess we can assume this set is small...
On Wed, Aug 07, 2019 at 01:58:40PM -0700, Andrew Morton wrote: > On Wed, 7 Aug 2019 16:45:30 -0400 Joel Fernandes <joel@joelfernandes.org> wrote: > > > On Wed, Aug 07, 2019 at 01:04:02PM -0700, Andrew Morton wrote: > > > On Wed, 7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote: > > > > > > > In Android, we are using this for the heap profiler (heapprofd) which > > > > profiles and pin points code paths which allocates and leaves memory > > > > idle for long periods of time. This method solves the security issue > > > > with userspace learning the PFN, and while at it is also shown to yield > > > > better results than the pagemap lookup, the theory being that the window > > > > where the address space can change is reduced by eliminating the > > > > intermediate pagemap look up stage. In virtual address indexing, the > > > > process's mmap_sem is held for the duration of the access. > > > > > > So is heapprofd a developer-only thing? Is heapprofd included in > > > end-user android loads? If not then, again, wouldn't it be better to > > > make the feature Kconfigurable so that Android developers can enable it > > > during development then disable it for production kernels? > > > > Almost all of this code is already configurable with > > CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets > > disabled. > > > > Or are you referring to something else that needs to be made configurable? > > Yes - the 300+ lines of code which this patchset adds! > > The impacted people will be those who use the existing > idle-page-tracking feature but who will not use the new feature. I > guess we can assume this set is small... Yes, I think this set should be small. The code size increase of page_idle.o is from ~1KB to ~2KB. Most of the extra space is consumed by page_idle_proc_generic() function which this patch adds. I don't think adding another CONFIG option to disable this while keeping existing CONFIG_IDLE_PAGE_TRACKING enabled, is worthwhile but I am open to the addition of such an option if anyone feels strongly about it. I believe that once this patch is merged, most like this new interface being added is what will be used more than the old interface (for some of the usecases) so it makes sense to keep it alive with CONFIG_IDLE_PAGE_TRACKING. thanks, - Joel
On Wed, Aug 07, 2019 at 05:31:05PM -0400, Joel Fernandes wrote: > On Wed, Aug 07, 2019 at 01:58:40PM -0700, Andrew Morton wrote: > > On Wed, 7 Aug 2019 16:45:30 -0400 Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > On Wed, Aug 07, 2019 at 01:04:02PM -0700, Andrew Morton wrote: > > > > On Wed, 7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote: > > > > > > > > > In Android, we are using this for the heap profiler (heapprofd) which > > > > > profiles and pin points code paths which allocates and leaves memory > > > > > idle for long periods of time. This method solves the security issue > > > > > with userspace learning the PFN, and while at it is also shown to yield > > > > > better results than the pagemap lookup, the theory being that the window > > > > > where the address space can change is reduced by eliminating the > > > > > intermediate pagemap look up stage. In virtual address indexing, the > > > > > process's mmap_sem is held for the duration of the access. > > > > > > > > So is heapprofd a developer-only thing? Is heapprofd included in > > > > end-user android loads? If not then, again, wouldn't it be better to > > > > make the feature Kconfigurable so that Android developers can enable it > > > > during development then disable it for production kernels? > > > > > > Almost all of this code is already configurable with > > > CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets > > > disabled. > > > > > > Or are you referring to something else that needs to be made configurable? > > > > Yes - the 300+ lines of code which this patchset adds! > > > > The impacted people will be those who use the existing > > idle-page-tracking feature but who will not use the new feature. I > > guess we can assume this set is small... > > Yes, I think this set should be small. The code size increase of page_idle.o > is from ~1KB to ~2KB. Most of the extra space is consumed by > page_idle_proc_generic() function which this patch adds. I don't think adding > another CONFIG option to disable this while keeping existing > CONFIG_IDLE_PAGE_TRACKING enabled, is worthwhile but I am open to the > addition of such an option if anyone feels strongly about it. I believe that > once this patch is merged, most like this new interface being added is what s/most like/most likely/ > will be used more than the old interface (for some of the usecases) so it > makes sense to keep it alive with CONFIG_IDLE_PAGE_TRACKING. > > thanks, > > - Joel >
On Wed 07-08-19 17:31:05, Joel Fernandes wrote: > On Wed, Aug 07, 2019 at 01:58:40PM -0700, Andrew Morton wrote: > > On Wed, 7 Aug 2019 16:45:30 -0400 Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > On Wed, Aug 07, 2019 at 01:04:02PM -0700, Andrew Morton wrote: > > > > On Wed, 7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote: > > > > > > > > > In Android, we are using this for the heap profiler (heapprofd) which > > > > > profiles and pin points code paths which allocates and leaves memory > > > > > idle for long periods of time. This method solves the security issue > > > > > with userspace learning the PFN, and while at it is also shown to yield > > > > > better results than the pagemap lookup, the theory being that the window > > > > > where the address space can change is reduced by eliminating the > > > > > intermediate pagemap look up stage. In virtual address indexing, the > > > > > process's mmap_sem is held for the duration of the access. > > > > > > > > So is heapprofd a developer-only thing? Is heapprofd included in > > > > end-user android loads? If not then, again, wouldn't it be better to > > > > make the feature Kconfigurable so that Android developers can enable it > > > > during development then disable it for production kernels? > > > > > > Almost all of this code is already configurable with > > > CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets > > > disabled. > > > > > > Or are you referring to something else that needs to be made configurable? > > > > Yes - the 300+ lines of code which this patchset adds! > > > > The impacted people will be those who use the existing > > idle-page-tracking feature but who will not use the new feature. I > > guess we can assume this set is small... > > Yes, I think this set should be small. The code size increase of page_idle.o > is from ~1KB to ~2KB. Most of the extra space is consumed by > page_idle_proc_generic() function which this patch adds. I don't think adding > another CONFIG option to disable this while keeping existing > CONFIG_IDLE_PAGE_TRACKING enabled, is worthwhile but I am open to the > addition of such an option if anyone feels strongly about it. I believe that > once this patch is merged, most like this new interface being added is what > will be used more than the old interface (for some of the usecases) so it > makes sense to keep it alive with CONFIG_IDLE_PAGE_TRACKING. I would tend to agree with Joel here. The functionality falls into an existing IDLE_PAGE_TRACKING config option quite nicely. If there really are users who want to save some space and this is standing in the way then they can easily add a new config option with some justification so the savings are clear. Without that an additional config simply adds to the already existing configurability complexity and balkanization.
On Thu, Aug 08, 2019 at 10:00:44AM +0200, Michal Hocko wrote: > On Wed 07-08-19 17:31:05, Joel Fernandes wrote: > > On Wed, Aug 07, 2019 at 01:58:40PM -0700, Andrew Morton wrote: > > > On Wed, 7 Aug 2019 16:45:30 -0400 Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > On Wed, Aug 07, 2019 at 01:04:02PM -0700, Andrew Morton wrote: > > > > > On Wed, 7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote: > > > > > > > > > > > In Android, we are using this for the heap profiler (heapprofd) which > > > > > > profiles and pin points code paths which allocates and leaves memory > > > > > > idle for long periods of time. This method solves the security issue > > > > > > with userspace learning the PFN, and while at it is also shown to yield > > > > > > better results than the pagemap lookup, the theory being that the window > > > > > > where the address space can change is reduced by eliminating the > > > > > > intermediate pagemap look up stage. In virtual address indexing, the > > > > > > process's mmap_sem is held for the duration of the access. > > > > > > > > > > So is heapprofd a developer-only thing? Is heapprofd included in > > > > > end-user android loads? If not then, again, wouldn't it be better to > > > > > make the feature Kconfigurable so that Android developers can enable it > > > > > during development then disable it for production kernels? > > > > > > > > Almost all of this code is already configurable with > > > > CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets > > > > disabled. > > > > > > > > Or are you referring to something else that needs to be made configurable? > > > > > > Yes - the 300+ lines of code which this patchset adds! > > > > > > The impacted people will be those who use the existing > > > idle-page-tracking feature but who will not use the new feature. I > > > guess we can assume this set is small... > > > > Yes, I think this set should be small. The code size increase of page_idle.o > > is from ~1KB to ~2KB. Most of the extra space is consumed by > > page_idle_proc_generic() function which this patch adds. I don't think adding > > another CONFIG option to disable this while keeping existing > > CONFIG_IDLE_PAGE_TRACKING enabled, is worthwhile but I am open to the > > addition of such an option if anyone feels strongly about it. I believe that > > once this patch is merged, most like this new interface being added is what > > will be used more than the old interface (for some of the usecases) so it > > makes sense to keep it alive with CONFIG_IDLE_PAGE_TRACKING. > > I would tend to agree with Joel here. The functionality falls into an > existing IDLE_PAGE_TRACKING config option quite nicely. If there really > are users who want to save some space and this is standing in the way > then they can easily add a new config option with some justification so > the savings are clear. Without that an additional config simply adds to > the already existing configurability complexity and balkanization. Michal, Andrew, Minchan, Would you have any other review comments on the v5 series? This is just a new interface that does not disrupt existing users of the older page-idle tracking, so as such it is a safe change (as in, doesn't change existing functionality except for the draining bug fix). thanks, - Joel
On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google) <joel@joelfernandes.org> wrote: > The page_idle tracking feature currently requires looking up the pagemap > for a process followed by interacting with /sys/kernel/mm/page_idle. > Looking up PFN from pagemap in Android devices is not supported by > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN. > > This patch adds support to directly interact with page_idle tracking at > the PID level by introducing a /proc/<pid>/page_idle file. It follows > the exact same semantics as the global /sys/kernel/mm/page_idle, but now > looking up PFN through pagemap is not needed since the interface uses > virtual frame numbers, and at the same time also does not require > SYS_ADMIN. > > In Android, we are using this for the heap profiler (heapprofd) which > profiles and pin points code paths which allocates and leaves memory > idle for long periods of time. This method solves the security issue > with userspace learning the PFN, and while at it is also shown to yield > better results than the pagemap lookup, the theory being that the window > where the address space can change is reduced by eliminating the > intermediate pagemap look up stage. In virtual address indexing, the > process's mmap_sem is held for the duration of the access. What happens when you use this interface on shared pages, like memory inherited from the zygote, library file mappings and so on? If two profilers ran concurrently for two different processes that both map the same libraries, would they end up messing up each other's data? Can this be used to observe which library pages other processes are accessing, even if you don't have access to those processes, as long as you can map the same libraries? I realize that there are already a bunch of ways to do that with side channels and such; but if you're adding an interface that allows this by design, it seems to me like something that should be gated behind some sort of privilege check. If the heap profiler is only interested in anonymous, process-private memory, that might be an easy way out? Limit (unprivileged) use of this interface to pages that aren't shared with any other processes? > +/* Helper to get the start and end frame given a pos and count */ > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm, > + unsigned long *start, unsigned long *end) > +{ > + unsigned long max_frame; > + > + /* If an mm is not given, assume we want physical frames */ > + max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn; > + > + if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE) > + return -EINVAL; > + > + *start = pos * BITS_PER_BYTE; > + if (*start >= max_frame) > + return -ENXIO; > + > + *end = *start + count * BITS_PER_BYTE; > + if (*end > max_frame) > + *end = max_frame; > + return 0; > +} You could add some overflow checks for the multiplications. I haven't seen any place where it actually matters, but it seems unclean; and in particular, on a 32-bit architecture where the maximum user address is very high (like with a 4G:4G split), it looks like this function might theoretically return with `*start > *end`, which could be confusing to callers. [...] > for (; pfn < end_pfn; pfn++) { > bit = pfn % BITMAP_CHUNK_BITS; > if (!bit) > *out = 0ULL; > - page = page_idle_get_page(pfn); > - if (page) { > - if (page_is_idle(page)) { > - /* > - * The page might have been referenced via a > - * pte, in which case it is not idle. Clear > - * refs and recheck. > - */ > - page_idle_clear_pte_refs(page); > - if (page_is_idle(page)) > - *out |= 1ULL << bit; > - } > + page = page_idle_get_page_pfn(pfn); > + if (page && page_idle_pte_check(page)) { > + *out |= 1ULL << bit; > put_page(page); > } The `page && !page_idle_pte_check(page)` case looks like it's missing a put_page(); you probably intended to write something like this? page = page_idle_get_page_pfn(pfn); if (page) { if (page_idle_pte_check(page)) *out |= 1ULL << bit; put_page(page); } [...] > +/* page_idle tracking for /proc/<pid>/page_idle */ > + > +struct page_node { > + struct page *page; > + unsigned long addr; > + struct list_head list; > +}; > + > +struct page_idle_proc_priv { > + unsigned long start_addr; > + char *buffer; > + int write; > + > + /* Pre-allocate and provide nodes to pte_page_idle_proc_add() */ > + struct page_node *page_nodes; > + int cur_page_node; > + struct list_head *idle_page_list; > +}; A linked list is a weird data structure to use if the list elements are just consecutive array elements. > +/* > + * Add page to list to be set as idle later. > + */ > +static void pte_page_idle_proc_add(struct page *page, > + unsigned long addr, struct mm_walk *walk) > +{ > + struct page *page_get = NULL; > + struct page_node *pn; > + int bit; > + unsigned long frames; > + struct page_idle_proc_priv *priv = walk->private; > + u64 *chunk = (u64 *)priv->buffer; > + > + if (priv->write) { > + VM_BUG_ON(!page); > + > + /* Find whether this page was asked to be marked */ > + frames = (addr - priv->start_addr) >> PAGE_SHIFT; > + bit = frames % BITMAP_CHUNK_BITS; > + chunk = &chunk[frames / BITMAP_CHUNK_BITS]; > + if (((*chunk >> bit) & 1) == 0) > + return; This means that BITMAP_CHUNK_SIZE is UAPI on big-endian systems, right? My opinion is that it would be slightly nicer to design the UAPI such that incrementing virtual addresses are mapped to incrementing offsets in the buffer (iow, either use bytewise access or use little-endian), but I'm not going to ask you to redesign the UAPI this late. [...] > +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff, > + size_t count, loff_t *pos, int write) > +{ [...] > + down_read(&mm->mmap_sem); [...] > + > + if (!write && !walk_error) > + ret = copy_to_user(ubuff, buffer, count); > + > + up_read(&mm->mmap_sem); I'd move the up_read() above the copy_to_user(); copy_to_user() can block, and there's no reason to hold the mmap_sem across copy_to_user(). Sorry about only chiming in at v5 with all this.
On Mon 12-08-19 10:56:20, Joel Fernandes wrote: > On Thu, Aug 08, 2019 at 10:00:44AM +0200, Michal Hocko wrote: > > On Wed 07-08-19 17:31:05, Joel Fernandes wrote: > > > On Wed, Aug 07, 2019 at 01:58:40PM -0700, Andrew Morton wrote: > > > > On Wed, 7 Aug 2019 16:45:30 -0400 Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > On Wed, Aug 07, 2019 at 01:04:02PM -0700, Andrew Morton wrote: > > > > > > On Wed, 7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > In Android, we are using this for the heap profiler (heapprofd) which > > > > > > > profiles and pin points code paths which allocates and leaves memory > > > > > > > idle for long periods of time. This method solves the security issue > > > > > > > with userspace learning the PFN, and while at it is also shown to yield > > > > > > > better results than the pagemap lookup, the theory being that the window > > > > > > > where the address space can change is reduced by eliminating the > > > > > > > intermediate pagemap look up stage. In virtual address indexing, the > > > > > > > process's mmap_sem is held for the duration of the access. > > > > > > > > > > > > So is heapprofd a developer-only thing? Is heapprofd included in > > > > > > end-user android loads? If not then, again, wouldn't it be better to > > > > > > make the feature Kconfigurable so that Android developers can enable it > > > > > > during development then disable it for production kernels? > > > > > > > > > > Almost all of this code is already configurable with > > > > > CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets > > > > > disabled. > > > > > > > > > > Or are you referring to something else that needs to be made configurable? > > > > > > > > Yes - the 300+ lines of code which this patchset adds! > > > > > > > > The impacted people will be those who use the existing > > > > idle-page-tracking feature but who will not use the new feature. I > > > > guess we can assume this set is small... > > > > > > Yes, I think this set should be small. The code size increase of page_idle.o > > > is from ~1KB to ~2KB. Most of the extra space is consumed by > > > page_idle_proc_generic() function which this patch adds. I don't think adding > > > another CONFIG option to disable this while keeping existing > > > CONFIG_IDLE_PAGE_TRACKING enabled, is worthwhile but I am open to the > > > addition of such an option if anyone feels strongly about it. I believe that > > > once this patch is merged, most like this new interface being added is what > > > will be used more than the old interface (for some of the usecases) so it > > > makes sense to keep it alive with CONFIG_IDLE_PAGE_TRACKING. > > > > I would tend to agree with Joel here. The functionality falls into an > > existing IDLE_PAGE_TRACKING config option quite nicely. If there really > > are users who want to save some space and this is standing in the way > > then they can easily add a new config option with some justification so > > the savings are clear. Without that an additional config simply adds to > > the already existing configurability complexity and balkanization. > > Michal, Andrew, Minchan, > > Would you have any other review comments on the v5 series? This is just a new > interface that does not disrupt existing users of the older page-idle > tracking, so as such it is a safe change (as in, doesn't change existing > functionality except for the draining bug fix). I hope to find some more time to finish the review but let me point out that "it's new it is regression safe" is not really a great argument for a new user visible API. If the API is flawed then this is likely going to kick us later and will be hard to fix. I am still not convinced about the swap part of the thing TBH.
On Mon 12-08-19 20:14:38, Jann Horn wrote: > On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google) > <joel@joelfernandes.org> wrote: > > The page_idle tracking feature currently requires looking up the pagemap > > for a process followed by interacting with /sys/kernel/mm/page_idle. > > Looking up PFN from pagemap in Android devices is not supported by > > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN. > > > > This patch adds support to directly interact with page_idle tracking at > > the PID level by introducing a /proc/<pid>/page_idle file. It follows > > the exact same semantics as the global /sys/kernel/mm/page_idle, but now > > looking up PFN through pagemap is not needed since the interface uses > > virtual frame numbers, and at the same time also does not require > > SYS_ADMIN. > > > > In Android, we are using this for the heap profiler (heapprofd) which > > profiles and pin points code paths which allocates and leaves memory > > idle for long periods of time. This method solves the security issue > > with userspace learning the PFN, and while at it is also shown to yield > > better results than the pagemap lookup, the theory being that the window > > where the address space can change is reduced by eliminating the > > intermediate pagemap look up stage. In virtual address indexing, the > > process's mmap_sem is held for the duration of the access. > > What happens when you use this interface on shared pages, like memory > inherited from the zygote, library file mappings and so on? If two > profilers ran concurrently for two different processes that both map > the same libraries, would they end up messing up each other's data? Yup PageIdle state is shared. That is the page_idle semantic even now IIRC. > Can this be used to observe which library pages other processes are > accessing, even if you don't have access to those processes, as long > as you can map the same libraries? I realize that there are already a > bunch of ways to do that with side channels and such; but if you're > adding an interface that allows this by design, it seems to me like > something that should be gated behind some sort of privilege check. Hmm, you need to be priviledged to get the pfn now and without that you cannot get to any page so the new interface is weakening the rules. Maybe we should limit setting the idle state to processes with the write status. Or do you think that even observing idle status is useful for practical side channel attacks? If yes, is that a problem of the profiler which does potentially dangerous things?
On Tue, Aug 13, 2019 at 11:14:30AM +0200, Michal Hocko wrote: > On Mon 12-08-19 10:56:20, Joel Fernandes wrote: > > On Thu, Aug 08, 2019 at 10:00:44AM +0200, Michal Hocko wrote: > > > On Wed 07-08-19 17:31:05, Joel Fernandes wrote: > > > > On Wed, Aug 07, 2019 at 01:58:40PM -0700, Andrew Morton wrote: > > > > > On Wed, 7 Aug 2019 16:45:30 -0400 Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > On Wed, Aug 07, 2019 at 01:04:02PM -0700, Andrew Morton wrote: > > > > > > > On Wed, 7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > In Android, we are using this for the heap profiler (heapprofd) which > > > > > > > > profiles and pin points code paths which allocates and leaves memory > > > > > > > > idle for long periods of time. This method solves the security issue > > > > > > > > with userspace learning the PFN, and while at it is also shown to yield > > > > > > > > better results than the pagemap lookup, the theory being that the window > > > > > > > > where the address space can change is reduced by eliminating the > > > > > > > > intermediate pagemap look up stage. In virtual address indexing, the > > > > > > > > process's mmap_sem is held for the duration of the access. > > > > > > > > > > > > > > So is heapprofd a developer-only thing? Is heapprofd included in > > > > > > > end-user android loads? If not then, again, wouldn't it be better to > > > > > > > make the feature Kconfigurable so that Android developers can enable it > > > > > > > during development then disable it for production kernels? > > > > > > > > > > > > Almost all of this code is already configurable with > > > > > > CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets > > > > > > disabled. > > > > > > > > > > > > Or are you referring to something else that needs to be made configurable? > > > > > > > > > > Yes - the 300+ lines of code which this patchset adds! > > > > > > > > > > The impacted people will be those who use the existing > > > > > idle-page-tracking feature but who will not use the new feature. I > > > > > guess we can assume this set is small... > > > > > > > > Yes, I think this set should be small. The code size increase of page_idle.o > > > > is from ~1KB to ~2KB. Most of the extra space is consumed by > > > > page_idle_proc_generic() function which this patch adds. I don't think adding > > > > another CONFIG option to disable this while keeping existing > > > > CONFIG_IDLE_PAGE_TRACKING enabled, is worthwhile but I am open to the > > > > addition of such an option if anyone feels strongly about it. I believe that > > > > once this patch is merged, most like this new interface being added is what > > > > will be used more than the old interface (for some of the usecases) so it > > > > makes sense to keep it alive with CONFIG_IDLE_PAGE_TRACKING. > > > > > > I would tend to agree with Joel here. The functionality falls into an > > > existing IDLE_PAGE_TRACKING config option quite nicely. If there really > > > are users who want to save some space and this is standing in the way > > > then they can easily add a new config option with some justification so > > > the savings are clear. Without that an additional config simply adds to > > > the already existing configurability complexity and balkanization. > > > > Michal, Andrew, Minchan, > > > > Would you have any other review comments on the v5 series? This is just a new > > interface that does not disrupt existing users of the older page-idle > > tracking, so as such it is a safe change (as in, doesn't change existing > > functionality except for the draining bug fix). > > I hope to find some more time to finish the review but let me point out > that "it's new it is regression safe" is not really a great argument for > a new user visible API. Actually, I think you misunderstood me and took it out of context. I never intended to say "it is regression safe". I meant to say it is "low risk", as in that in all likelihood should not be hurting *existing users* of the *old interface*. Also as you know, it has been tested. > If the API is flawed then this is likely going > to kick us later and will be hard to fix. I am still not convinced about > the swap part of the thing TBH. Ok, then let us discuss it. As I mentioned before, without this we lose the access information due to MADVISE or swapping. Minchan and Konstantin both suggested it that's why I also added it (other than me also realizing that it is neeed). For x86, it uses existing bits in pte so it is not adding any more bits. For arm64, it uses unused bits that the hardware cannot use. So I don't see why this is an issue to you. thanks, - Joel
On Tue 13-08-19 09:51:52, Joel Fernandes wrote: > On Tue, Aug 13, 2019 at 11:14:30AM +0200, Michal Hocko wrote: > > On Mon 12-08-19 10:56:20, Joel Fernandes wrote: > > > On Thu, Aug 08, 2019 at 10:00:44AM +0200, Michal Hocko wrote: > > > > On Wed 07-08-19 17:31:05, Joel Fernandes wrote: > > > > > On Wed, Aug 07, 2019 at 01:58:40PM -0700, Andrew Morton wrote: > > > > > > On Wed, 7 Aug 2019 16:45:30 -0400 Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > On Wed, Aug 07, 2019 at 01:04:02PM -0700, Andrew Morton wrote: > > > > > > > > On Wed, 7 Aug 2019 13:15:54 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > > In Android, we are using this for the heap profiler (heapprofd) which > > > > > > > > > profiles and pin points code paths which allocates and leaves memory > > > > > > > > > idle for long periods of time. This method solves the security issue > > > > > > > > > with userspace learning the PFN, and while at it is also shown to yield > > > > > > > > > better results than the pagemap lookup, the theory being that the window > > > > > > > > > where the address space can change is reduced by eliminating the > > > > > > > > > intermediate pagemap look up stage. In virtual address indexing, the > > > > > > > > > process's mmap_sem is held for the duration of the access. > > > > > > > > > > > > > > > > So is heapprofd a developer-only thing? Is heapprofd included in > > > > > > > > end-user android loads? If not then, again, wouldn't it be better to > > > > > > > > make the feature Kconfigurable so that Android developers can enable it > > > > > > > > during development then disable it for production kernels? > > > > > > > > > > > > > > Almost all of this code is already configurable with > > > > > > > CONFIG_IDLE_PAGE_TRACKING. If you disable it, then all of this code gets > > > > > > > disabled. > > > > > > > > > > > > > > Or are you referring to something else that needs to be made configurable? > > > > > > > > > > > > Yes - the 300+ lines of code which this patchset adds! > > > > > > > > > > > > The impacted people will be those who use the existing > > > > > > idle-page-tracking feature but who will not use the new feature. I > > > > > > guess we can assume this set is small... > > > > > > > > > > Yes, I think this set should be small. The code size increase of page_idle.o > > > > > is from ~1KB to ~2KB. Most of the extra space is consumed by > > > > > page_idle_proc_generic() function which this patch adds. I don't think adding > > > > > another CONFIG option to disable this while keeping existing > > > > > CONFIG_IDLE_PAGE_TRACKING enabled, is worthwhile but I am open to the > > > > > addition of such an option if anyone feels strongly about it. I believe that > > > > > once this patch is merged, most like this new interface being added is what > > > > > will be used more than the old interface (for some of the usecases) so it > > > > > makes sense to keep it alive with CONFIG_IDLE_PAGE_TRACKING. > > > > > > > > I would tend to agree with Joel here. The functionality falls into an > > > > existing IDLE_PAGE_TRACKING config option quite nicely. If there really > > > > are users who want to save some space and this is standing in the way > > > > then they can easily add a new config option with some justification so > > > > the savings are clear. Without that an additional config simply adds to > > > > the already existing configurability complexity and balkanization. > > > > > > Michal, Andrew, Minchan, > > > > > > Would you have any other review comments on the v5 series? This is just a new > > > interface that does not disrupt existing users of the older page-idle > > > tracking, so as such it is a safe change (as in, doesn't change existing > > > functionality except for the draining bug fix). > > > > I hope to find some more time to finish the review but let me point out > > that "it's new it is regression safe" is not really a great argument for > > a new user visible API. > > Actually, I think you misunderstood me and took it out of context. I never > intended to say "it is regression safe". I meant to say it is "low risk", as > in that in all likelihood should not be hurting *existing users* of the *old > interface*. Also as you know, it has been tested. Yeah, misreading on my end. > > If the API is flawed then this is likely going > > to kick us later and will be hard to fix. I am still not convinced about > > the swap part of the thing TBH. > > Ok, then let us discuss it. As I mentioned before, without this we lose the > access information due to MADVISE or swapping. Minchan and Konstantin both > suggested it that's why I also added it (other than me also realizing that it > is neeed). I have described my concerns about the general idle bit behavior after unmapping pointing to discrepancy with !anon pages. And I believe those haven't been addressed yet. Besides that I am still not seeing any description of the usecase that would suffer from the lack of the functionality in changelogs.
On Tue, Aug 13, 2019 at 12:08:56PM +0200, Michal Hocko wrote: > On Mon 12-08-19 20:14:38, Jann Horn wrote: > > On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google) > > <joel@joelfernandes.org> wrote: > > > The page_idle tracking feature currently requires looking up the pagemap > > > for a process followed by interacting with /sys/kernel/mm/page_idle. > > > Looking up PFN from pagemap in Android devices is not supported by > > > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN. > > > > > > This patch adds support to directly interact with page_idle tracking at > > > the PID level by introducing a /proc/<pid>/page_idle file. It follows > > > the exact same semantics as the global /sys/kernel/mm/page_idle, but now > > > looking up PFN through pagemap is not needed since the interface uses > > > virtual frame numbers, and at the same time also does not require > > > SYS_ADMIN. > > > > > > In Android, we are using this for the heap profiler (heapprofd) which > > > profiles and pin points code paths which allocates and leaves memory > > > idle for long periods of time. This method solves the security issue > > > with userspace learning the PFN, and while at it is also shown to yield > > > better results than the pagemap lookup, the theory being that the window > > > where the address space can change is reduced by eliminating the > > > intermediate pagemap look up stage. In virtual address indexing, the > > > process's mmap_sem is held for the duration of the access. > > > > What happens when you use this interface on shared pages, like memory > > inherited from the zygote, library file mappings and so on? If two > > profilers ran concurrently for two different processes that both map > > the same libraries, would they end up messing up each other's data? > > Yup PageIdle state is shared. That is the page_idle semantic even now > IIRC. Yes, that's right. This patch doesn't change that semantic. Idle page tracking at the core is a global procedure which is based on pages that can be shared. One of the usecases of the heap profiler is to enable profiling of pages that are shared between zygote and any processes that are forked. In this case, I am told by our team working on the heap profiler, that the monitoring of shared pages will help. > > Can this be used to observe which library pages other processes are > > accessing, even if you don't have access to those processes, as long > > as you can map the same libraries? I realize that there are already a > > bunch of ways to do that with side channels and such; but if you're > > adding an interface that allows this by design, it seems to me like > > something that should be gated behind some sort of privilege check. > > Hmm, you need to be priviledged to get the pfn now and without that you > cannot get to any page so the new interface is weakening the rules. > Maybe we should limit setting the idle state to processes with the write > status. Or do you think that even observing idle status is useful for > practical side channel attacks? If yes, is that a problem of the > profiler which does potentially dangerous things? The heap profiler is currently unprivileged. Would it help the concern Jann raised, if the new interface was limited to only anonymous private/shared pages and not to file pages? Or, is this even a real concern? thanks, - Joel
On Tue, Aug 13, 2019 at 4:25 PM Joel Fernandes <joel@joelfernandes.org> wrote: > On Tue, Aug 13, 2019 at 12:08:56PM +0200, Michal Hocko wrote: > > On Mon 12-08-19 20:14:38, Jann Horn wrote: > > > On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google) > > > <joel@joelfernandes.org> wrote: > > > > The page_idle tracking feature currently requires looking up the pagemap > > > > for a process followed by interacting with /sys/kernel/mm/page_idle. > > > > Looking up PFN from pagemap in Android devices is not supported by > > > > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN. > > > > > > > > This patch adds support to directly interact with page_idle tracking at > > > > the PID level by introducing a /proc/<pid>/page_idle file. It follows > > > > the exact same semantics as the global /sys/kernel/mm/page_idle, but now > > > > looking up PFN through pagemap is not needed since the interface uses > > > > virtual frame numbers, and at the same time also does not require > > > > SYS_ADMIN. > > > > > > > > In Android, we are using this for the heap profiler (heapprofd) which > > > > profiles and pin points code paths which allocates and leaves memory > > > > idle for long periods of time. This method solves the security issue > > > > with userspace learning the PFN, and while at it is also shown to yield > > > > better results than the pagemap lookup, the theory being that the window > > > > where the address space can change is reduced by eliminating the > > > > intermediate pagemap look up stage. In virtual address indexing, the > > > > process's mmap_sem is held for the duration of the access. > > > > > > What happens when you use this interface on shared pages, like memory > > > inherited from the zygote, library file mappings and so on? If two > > > profilers ran concurrently for two different processes that both map > > > the same libraries, would they end up messing up each other's data? > > > > Yup PageIdle state is shared. That is the page_idle semantic even now > > IIRC. > > Yes, that's right. This patch doesn't change that semantic. Idle page > tracking at the core is a global procedure which is based on pages that can > be shared. > > One of the usecases of the heap profiler is to enable profiling of pages that > are shared between zygote and any processes that are forked. In this case, > I am told by our team working on the heap profiler, that the monitoring of > shared pages will help. > > > > Can this be used to observe which library pages other processes are > > > accessing, even if you don't have access to those processes, as long > > > as you can map the same libraries? I realize that there are already a > > > bunch of ways to do that with side channels and such; but if you're > > > adding an interface that allows this by design, it seems to me like > > > something that should be gated behind some sort of privilege check. > > > > Hmm, you need to be priviledged to get the pfn now and without that you > > cannot get to any page so the new interface is weakening the rules. > > Maybe we should limit setting the idle state to processes with the write > > status. Or do you think that even observing idle status is useful for > > practical side channel attacks? If yes, is that a problem of the > > profiler which does potentially dangerous things? > > The heap profiler is currently unprivileged. Would it help the concern Jann > raised, if the new interface was limited to only anonymous private/shared > pages and not to file pages? Or, is this even a real concern? +Daniel Gruss in case he wants to provide some more detail; he has been involved in a lot of the public research around this topic. It is a bit of a concern when code that wasn't hardened as rigorously as cryptographic library code operates on secret values. A paper was published this year that abused mincore() in combination with tricks for flushing the page cache to obtain information about when shared read-only memory was accessed: <https://arxiv.org/pdf/1901.01161.pdf>. In response to that, the semantics of mincore() were changed to prevent that information from leaking (see commit 134fca9063ad4851de767d1768180e5dede9a881). On the other hand, an attacker could also use things like cache timing attacks instead of abusing operating system features; that is more hardware-specific, but it has a higher spatial granularity (typically 64 bytes instead of 4096 bytes). Timing-granularity-wise, I'm not sure whether the proposed interface would be more or less bad than existing cache side-channels on common architectures. There are papers that demonstrate things like being able to distinguish some classes of keyboard keys from others on an Android phone: <https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_lipp.pdf> I don't think limiting it to anonymous pages is necessarily enough to completely solve this; in a normal Linux environment, it might be good enough, but on Android, I'm worried about the CoW private memory from the zygote.
On Tue, Aug 13, 2019 at 12:09 PM Michal Hocko <mhocko@kernel.org> wrote: > On Mon 12-08-19 20:14:38, Jann Horn wrote: > > On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google) > > <joel@joelfernandes.org> wrote: > > > The page_idle tracking feature currently requires looking up the pagemap > > > for a process followed by interacting with /sys/kernel/mm/page_idle. > > > Looking up PFN from pagemap in Android devices is not supported by > > > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN. > > > > > > This patch adds support to directly interact with page_idle tracking at > > > the PID level by introducing a /proc/<pid>/page_idle file. It follows > > > the exact same semantics as the global /sys/kernel/mm/page_idle, but now > > > looking up PFN through pagemap is not needed since the interface uses > > > virtual frame numbers, and at the same time also does not require > > > SYS_ADMIN. > > > > > > In Android, we are using this for the heap profiler (heapprofd) which > > > profiles and pin points code paths which allocates and leaves memory > > > idle for long periods of time. This method solves the security issue > > > with userspace learning the PFN, and while at it is also shown to yield > > > better results than the pagemap lookup, the theory being that the window > > > where the address space can change is reduced by eliminating the > > > intermediate pagemap look up stage. In virtual address indexing, the > > > process's mmap_sem is held for the duration of the access. > > > > What happens when you use this interface on shared pages, like memory > > inherited from the zygote, library file mappings and so on? If two > > profilers ran concurrently for two different processes that both map > > the same libraries, would they end up messing up each other's data? > > Yup PageIdle state is shared. That is the page_idle semantic even now > IIRC. > > > Can this be used to observe which library pages other processes are > > accessing, even if you don't have access to those processes, as long > > as you can map the same libraries? I realize that there are already a > > bunch of ways to do that with side channels and such; but if you're > > adding an interface that allows this by design, it seems to me like > > something that should be gated behind some sort of privilege check. > > Hmm, you need to be priviledged to get the pfn now and without that you > cannot get to any page so the new interface is weakening the rules. > Maybe we should limit setting the idle state to processes with the write > status. Or do you think that even observing idle status is useful for > practical side channel attacks? If yes, is that a problem of the > profiler which does potentially dangerous things? I suppose read-only access isn't a real problem as long as the profiler isn't writing the idle state in a very tight loop... but I don't see a usecase where you'd actually want that? As far as I can tell, if you can't write the idle state, being able to read it is pretty much useless. If the profiler only wants to profile process-private memory, then that should be implementable in a safe way in principle, I think, but since Joel said that they want to profile CoW memory as well, I think that's inherently somewhat dangerous.
On Mon, Aug 12, 2019 at 08:14:38PM +0200, Jann Horn wrote: [snip] > > +/* Helper to get the start and end frame given a pos and count */ > > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm, > > + unsigned long *start, unsigned long *end) > > +{ > > + unsigned long max_frame; > > + > > + /* If an mm is not given, assume we want physical frames */ > > + max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn; > > + > > + if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE) > > + return -EINVAL; > > + > > + *start = pos * BITS_PER_BYTE; > > + if (*start >= max_frame) > > + return -ENXIO; > > + > > + *end = *start + count * BITS_PER_BYTE; > > + if (*end > max_frame) > > + *end = max_frame; > > + return 0; > > +} > > You could add some overflow checks for the multiplications. I haven't > seen any place where it actually matters, but it seems unclean; and in > particular, on a 32-bit architecture where the maximum user address is > very high (like with a 4G:4G split), it looks like this function might > theoretically return with `*start > *end`, which could be confusing to > callers. I could store the multiplication result in unsigned long long (since we are bounds checking with max_frame, start > end would not occur). Something like the following (with extraneous casts). But I'll think some more about the point you are raising. diff --git a/mm/page_idle.c b/mm/page_idle.c index 1ea21bbb36cb..8fd7a5559986 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -135,6 +135,7 @@ static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm, unsigned long *start, unsigned long *end) { unsigned long max_frame; + unsigned long long tmp; /* If an mm is not given, assume we want physical frames */ max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn; @@ -142,13 +143,16 @@ static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm, if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE) return -EINVAL; - *start = pos * BITS_PER_BYTE; - if (*start >= max_frame) + tmp = pos * BITS_PER_BYTE; + if (tmp >= (unsigned long long)max_frame) return -ENXIO; + *start = (unsigned long)tmp; - *end = *start + count * BITS_PER_BYTE; - if (*end > max_frame) + tmp = *start + count * BITS_PER_BYTE; + if (tmp > (unsigned long long)max_frame) *end = max_frame; + else + *end = (unsigned long)tmp; return 0; } > [...] > > for (; pfn < end_pfn; pfn++) { > > bit = pfn % BITMAP_CHUNK_BITS; > > if (!bit) > > *out = 0ULL; > > - page = page_idle_get_page(pfn); > > - if (page) { > > - if (page_is_idle(page)) { > > - /* > > - * The page might have been referenced via a > > - * pte, in which case it is not idle. Clear > > - * refs and recheck. > > - */ > > - page_idle_clear_pte_refs(page); > > - if (page_is_idle(page)) > > - *out |= 1ULL << bit; > > - } > > + page = page_idle_get_page_pfn(pfn); > > + if (page && page_idle_pte_check(page)) { > > + *out |= 1ULL << bit; > > put_page(page); > > } > > The `page && !page_idle_pte_check(page)` case looks like it's missing > a put_page(); you probably intended to write something like this? > > page = page_idle_get_page_pfn(pfn); > if (page) { > if (page_idle_pte_check(page)) > *out |= 1ULL << bit; > put_page(page); > } Ah, you're right. Will fix, good eyes and thank you! > [...] > > +/* page_idle tracking for /proc/<pid>/page_idle */ > > + > > +struct page_node { > > + struct page *page; > > + unsigned long addr; > > + struct list_head list; > > +}; > > + > > +struct page_idle_proc_priv { > > + unsigned long start_addr; > > + char *buffer; > > + int write; > > + > > + /* Pre-allocate and provide nodes to pte_page_idle_proc_add() */ > > + struct page_node *page_nodes; > > + int cur_page_node; > > + struct list_head *idle_page_list; > > +}; > > A linked list is a weird data structure to use if the list elements > are just consecutive array elements. Not all of the pages will be considered in the later parts of the code, some pages are skipped. However, in v4 I added an array to allocate all the page_node structures, since Andrew did not want GFP_ATOMIC allocations of individual list nodes. So I think I could get rid of the linked list and leave the unused page_node(s) as NULL, but I don't know I have to check if that is possible. I could be missing a corner case, regardless I'll make a note of this and try! > > +/* > > + * Add page to list to be set as idle later. > > + */ > > +static void pte_page_idle_proc_add(struct page *page, > > + unsigned long addr, struct mm_walk *walk) > > +{ > > + struct page *page_get = NULL; > > + struct page_node *pn; > > + int bit; > > + unsigned long frames; > > + struct page_idle_proc_priv *priv = walk->private; > > + u64 *chunk = (u64 *)priv->buffer; > > + > > + if (priv->write) { > > + VM_BUG_ON(!page); > > + > > + /* Find whether this page was asked to be marked */ > > + frames = (addr - priv->start_addr) >> PAGE_SHIFT; > > + bit = frames % BITMAP_CHUNK_BITS; > > + chunk = &chunk[frames / BITMAP_CHUNK_BITS]; > > + if (((*chunk >> bit) & 1) == 0) > > + return; > > This means that BITMAP_CHUNK_SIZE is UAPI on big-endian systems, > right? My opinion is that it would be slightly nicer to design the > UAPI such that incrementing virtual addresses are mapped to > incrementing offsets in the buffer (iow, either use bytewise access or > use little-endian), but I'm not going to ask you to redesign the UAPI > this late. That would also be slow and consume more memory in userspace buffers. Currently, a 64-bit (8 byte) chunk accounts for 64 pages worth or 256KB. Also I wanted to keep the interface consistent with the global /sys/kernel/mm/page_idle interface. > [...] > > +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff, > > + size_t count, loff_t *pos, int write) > > +{ > [...] > > + down_read(&mm->mmap_sem); > [...] > > + > > + if (!write && !walk_error) > > + ret = copy_to_user(ubuff, buffer, count); > > + > > + up_read(&mm->mmap_sem); > > I'd move the up_read() above the copy_to_user(); copy_to_user() can > block, and there's no reason to hold the mmap_sem across > copy_to_user(). Will do. > Sorry about only chiming in at v5 with all this. No problem, I'm glad you did! thanks, - Joel
On 8/13/19 5:29 PM, Jann Horn wrote: > On Tue, Aug 13, 2019 at 12:09 PM Michal Hocko <mhocko@kernel.org> wrote: >> On Mon 12-08-19 20:14:38, Jann Horn wrote: >>> On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google) >>> <joel@joelfernandes.org> wrote: >>>> The page_idle tracking feature currently requires looking up the pagemap >>>> for a process followed by interacting with /sys/kernel/mm/page_idle. >>>> Looking up PFN from pagemap in Android devices is not supported by >>>> unprivileged process and requires SYS_ADMIN and gives 0 for the PFN. >>>> >>>> This patch adds support to directly interact with page_idle tracking at >>>> the PID level by introducing a /proc/<pid>/page_idle file. It follows >>>> the exact same semantics as the global /sys/kernel/mm/page_idle, but now >>>> looking up PFN through pagemap is not needed since the interface uses >>>> virtual frame numbers, and at the same time also does not require >>>> SYS_ADMIN. >>>> >>>> In Android, we are using this for the heap profiler (heapprofd) which >>>> profiles and pin points code paths which allocates and leaves memory >>>> idle for long periods of time. This method solves the security issue >>>> with userspace learning the PFN, and while at it is also shown to yield >>>> better results than the pagemap lookup, the theory being that the window >>>> where the address space can change is reduced by eliminating the >>>> intermediate pagemap look up stage. In virtual address indexing, the >>>> process's mmap_sem is held for the duration of the access. >>> >>> What happens when you use this interface on shared pages, like memory >>> inherited from the zygote, library file mappings and so on? If two >>> profilers ran concurrently for two different processes that both map >>> the same libraries, would they end up messing up each other's data? >> >> Yup PageIdle state is shared. That is the page_idle semantic even now >> IIRC. >> >>> Can this be used to observe which library pages other processes are >>> accessing, even if you don't have access to those processes, as long >>> as you can map the same libraries? I realize that there are already a >>> bunch of ways to do that with side channels and such; but if you're >>> adding an interface that allows this by design, it seems to me like >>> something that should be gated behind some sort of privilege check. >> >> Hmm, you need to be priviledged to get the pfn now and without that you >> cannot get to any page so the new interface is weakening the rules. >> Maybe we should limit setting the idle state to processes with the write >> status. Or do you think that even observing idle status is useful for >> practical side channel attacks? If yes, is that a problem of the >> profiler which does potentially dangerous things? > > I suppose read-only access isn't a real problem as long as the > profiler isn't writing the idle state in a very tight loop... but I > don't see a usecase where you'd actually want that? As far as I can > tell, if you can't write the idle state, being able to read it is > pretty much useless. > > If the profiler only wants to profile process-private memory, then > that should be implementable in a safe way in principle, I think, but > since Joel said that they want to profile CoW memory as well, I think > that's inherently somewhat dangerous. I agree that allowing profiling of shared pages would leak information. To me the use case is not entirely clear. This is not a feature that would normally be run in everyday computer usage, right?
On Tue, Aug 13, 2019 at 5:30 PM Joel Fernandes <joel@joelfernandes.org> wrote: > On Mon, Aug 12, 2019 at 08:14:38PM +0200, Jann Horn wrote: > [snip] > > > +/* Helper to get the start and end frame given a pos and count */ > > > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm, > > > + unsigned long *start, unsigned long *end) > > > +{ > > > + unsigned long max_frame; > > > + > > > + /* If an mm is not given, assume we want physical frames */ > > > + max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn; > > > + > > > + if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE) > > > + return -EINVAL; > > > + > > > + *start = pos * BITS_PER_BYTE; > > > + if (*start >= max_frame) > > > + return -ENXIO; > > > + > > > + *end = *start + count * BITS_PER_BYTE; > > > + if (*end > max_frame) > > > + *end = max_frame; > > > + return 0; > > > +} > > > > You could add some overflow checks for the multiplications. I haven't > > seen any place where it actually matters, but it seems unclean; and in > > particular, on a 32-bit architecture where the maximum user address is > > very high (like with a 4G:4G split), it looks like this function might > > theoretically return with `*start > *end`, which could be confusing to > > callers. > > I could store the multiplication result in unsigned long long (since we are > bounds checking with max_frame, start > end would not occur). Something like > the following (with extraneous casts). But I'll think some more about the > point you are raising. check_mul_overflow() exists and could make that a bit cleaner. > > This means that BITMAP_CHUNK_SIZE is UAPI on big-endian systems, > > right? My opinion is that it would be slightly nicer to design the > > UAPI such that incrementing virtual addresses are mapped to > > incrementing offsets in the buffer (iow, either use bytewise access or > > use little-endian), but I'm not going to ask you to redesign the UAPI > > this late. > > That would also be slow and consume more memory in userspace buffers. > Currently, a 64-bit (8 byte) chunk accounts for 64 pages worth or 256KB. I still wanted to use one bit per page; I just wanted to rearrange the bits. So the first byte would always contain 8 bits corresponding to the first 8 pages, instead of corresponding to pages 56-63 on some systems depending on endianness. Anyway, this is a moot point, since as you said... > Also I wanted to keep the interface consistent with the global > /sys/kernel/mm/page_idle interface. Sorry, I missed that this is already UAPI in the global interface. I agree, using a different API for the per-process interface would be a bad idea.
On Tue, Aug 13, 2019 at 05:34:16PM +0200, Daniel Gruss wrote: > On 8/13/19 5:29 PM, Jann Horn wrote: > > On Tue, Aug 13, 2019 at 12:09 PM Michal Hocko <mhocko@kernel.org> wrote: > >> On Mon 12-08-19 20:14:38, Jann Horn wrote: > >>> On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google) > >>> <joel@joelfernandes.org> wrote: > >>>> The page_idle tracking feature currently requires looking up the pagemap > >>>> for a process followed by interacting with /sys/kernel/mm/page_idle. > >>>> Looking up PFN from pagemap in Android devices is not supported by > >>>> unprivileged process and requires SYS_ADMIN and gives 0 for the PFN. > >>>> > >>>> This patch adds support to directly interact with page_idle tracking at > >>>> the PID level by introducing a /proc/<pid>/page_idle file. It follows > >>>> the exact same semantics as the global /sys/kernel/mm/page_idle, but now > >>>> looking up PFN through pagemap is not needed since the interface uses > >>>> virtual frame numbers, and at the same time also does not require > >>>> SYS_ADMIN. > >>>> > >>>> In Android, we are using this for the heap profiler (heapprofd) which > >>>> profiles and pin points code paths which allocates and leaves memory > >>>> idle for long periods of time. This method solves the security issue > >>>> with userspace learning the PFN, and while at it is also shown to yield > >>>> better results than the pagemap lookup, the theory being that the window > >>>> where the address space can change is reduced by eliminating the > >>>> intermediate pagemap look up stage. In virtual address indexing, the > >>>> process's mmap_sem is held for the duration of the access. > >>> > >>> What happens when you use this interface on shared pages, like memory > >>> inherited from the zygote, library file mappings and so on? If two > >>> profilers ran concurrently for two different processes that both map > >>> the same libraries, would they end up messing up each other's data? > >> > >> Yup PageIdle state is shared. That is the page_idle semantic even now > >> IIRC. > >> > >>> Can this be used to observe which library pages other processes are > >>> accessing, even if you don't have access to those processes, as long > >>> as you can map the same libraries? I realize that there are already a > >>> bunch of ways to do that with side channels and such; but if you're > >>> adding an interface that allows this by design, it seems to me like > >>> something that should be gated behind some sort of privilege check. > >> > >> Hmm, you need to be priviledged to get the pfn now and without that you > >> cannot get to any page so the new interface is weakening the rules. > >> Maybe we should limit setting the idle state to processes with the write > >> status. Or do you think that even observing idle status is useful for > >> practical side channel attacks? If yes, is that a problem of the > >> profiler which does potentially dangerous things? > > > > I suppose read-only access isn't a real problem as long as the > > profiler isn't writing the idle state in a very tight loop... but I > > don't see a usecase where you'd actually want that? As far as I can > > tell, if you can't write the idle state, being able to read it is > > pretty much useless. > > > > If the profiler only wants to profile process-private memory, then > > that should be implementable in a safe way in principle, I think, but > > since Joel said that they want to profile CoW memory as well, I think > > that's inherently somewhat dangerous. > > I agree that allowing profiling of shared pages would leak information. Will think more about it. If we limit it to private pages, then it could become useless. Consider a scenario where: A process allocates a some memory, then forks a bunch of worker processes that read that memory and perform some work with them. Per-PID page idle tracking is now run on the parent processes. Now it should appear that the pages are actively accessed (not-idle). If we don't track shared pages, then we cannot detect if those pages are really due to memory leaking, or if they are there for a purpose and are actively used. > To me the use case is not entirely clear. This is not a feature that > would normally be run in everyday computer usage, right? Generally, this to be used as a debugging feature that helps developers detect memory leaks in their programs. thanks, - Joel
On Tue 13-08-19 17:29:09, Jann Horn wrote: > On Tue, Aug 13, 2019 at 12:09 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 12-08-19 20:14:38, Jann Horn wrote: > > > On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google) > > > <joel@joelfernandes.org> wrote: > > > > The page_idle tracking feature currently requires looking up the pagemap > > > > for a process followed by interacting with /sys/kernel/mm/page_idle. > > > > Looking up PFN from pagemap in Android devices is not supported by > > > > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN. > > > > > > > > This patch adds support to directly interact with page_idle tracking at > > > > the PID level by introducing a /proc/<pid>/page_idle file. It follows > > > > the exact same semantics as the global /sys/kernel/mm/page_idle, but now > > > > looking up PFN through pagemap is not needed since the interface uses > > > > virtual frame numbers, and at the same time also does not require > > > > SYS_ADMIN. > > > > > > > > In Android, we are using this for the heap profiler (heapprofd) which > > > > profiles and pin points code paths which allocates and leaves memory > > > > idle for long periods of time. This method solves the security issue > > > > with userspace learning the PFN, and while at it is also shown to yield > > > > better results than the pagemap lookup, the theory being that the window > > > > where the address space can change is reduced by eliminating the > > > > intermediate pagemap look up stage. In virtual address indexing, the > > > > process's mmap_sem is held for the duration of the access. > > > > > > What happens when you use this interface on shared pages, like memory > > > inherited from the zygote, library file mappings and so on? If two > > > profilers ran concurrently for two different processes that both map > > > the same libraries, would they end up messing up each other's data? > > > > Yup PageIdle state is shared. That is the page_idle semantic even now > > IIRC. > > > > > Can this be used to observe which library pages other processes are > > > accessing, even if you don't have access to those processes, as long > > > as you can map the same libraries? I realize that there are already a > > > bunch of ways to do that with side channels and such; but if you're > > > adding an interface that allows this by design, it seems to me like > > > something that should be gated behind some sort of privilege check. > > > > Hmm, you need to be priviledged to get the pfn now and without that you > > cannot get to any page so the new interface is weakening the rules. > > Maybe we should limit setting the idle state to processes with the write > > status. Or do you think that even observing idle status is useful for > > practical side channel attacks? If yes, is that a problem of the > > profiler which does potentially dangerous things? > > I suppose read-only access isn't a real problem as long as the > profiler isn't writing the idle state in a very tight loop... but I > don't see a usecase where you'd actually want that? As far as I can > tell, if you can't write the idle state, being able to read it is > pretty much useless. > > If the profiler only wants to profile process-private memory, then > that should be implementable in a safe way in principle, I think, but > since Joel said that they want to profile CoW memory as well, I think > that's inherently somewhat dangerous. I cannot really say how useful that would be but I can see that implementing ownership checks would be really non-trivial for shared pages. Reducing the interface to exclusive pages would make it easier as you noted but less helpful. Besides that the attack vector shouldn't be really much different from the page cache access, right? So essentially can_do_mincore model. I guess we want to document that page idle tracking should be used with care because it potentially opens a side channel opportunity if used on sensitive data.
On Wed, Aug 14, 2019 at 09:56:01AM +0200, Michal Hocko wrote: [snip] > > > > Can this be used to observe which library pages other processes are > > > > accessing, even if you don't have access to those processes, as long > > > > as you can map the same libraries? I realize that there are already a > > > > bunch of ways to do that with side channels and such; but if you're > > > > adding an interface that allows this by design, it seems to me like > > > > something that should be gated behind some sort of privilege check. > > > > > > Hmm, you need to be priviledged to get the pfn now and without that you > > > cannot get to any page so the new interface is weakening the rules. > > > Maybe we should limit setting the idle state to processes with the write > > > status. Or do you think that even observing idle status is useful for > > > practical side channel attacks? If yes, is that a problem of the > > > profiler which does potentially dangerous things? > > > > I suppose read-only access isn't a real problem as long as the > > profiler isn't writing the idle state in a very tight loop... but I > > don't see a usecase where you'd actually want that? As far as I can > > tell, if you can't write the idle state, being able to read it is > > pretty much useless. > > > > If the profiler only wants to profile process-private memory, then > > that should be implementable in a safe way in principle, I think, but > > since Joel said that they want to profile CoW memory as well, I think > > that's inherently somewhat dangerous. > > I cannot really say how useful that would be but I can see that > implementing ownership checks would be really non-trivial for > shared pages. Reducing the interface to exclusive pages would make it > easier as you noted but less helpful. > > Besides that the attack vector shouldn't be really much different from > the page cache access, right? So essentially can_do_mincore model. > > I guess we want to document that page idle tracking should be used with > care because it potentially opens a side channel opportunity if used > on sensitive data. I have been thinking of this, and discussing with our heap profiler folks. Not being able to track shared pages would be a limitation, but I don't see any way forward considering this security concern so maybe we have to limit what we can do. I will look into implementing this without doing the rmap but still make it work on shared pages from the point of view of the process being tracked. It just would no longer through the PTEs of *other* processes sharing the page. My current thought is to just rely on the PTE accessed bit, and not use the PageIdle flag at all. But we'd still set the PageYoung flag so that the reclaim code still sees the page as accessed. The reason I feel like avoiding the PageIdle flag is: 1. It looks like mark_page_accessed() can be called from other paths which can also result in some kind of side-channel issue if a page was shared. 2. I don't think I need the PageIdle flag since the access bit alone should let me know, although it could be a bit slower. Since previously, I did not need to check every PTE and if the PageIdle flag was already cleared, then the page was declared as idle. At least this series resulted in a bug fix and a tonne of learning, so thank you everyone! Any other thoughts? thanks, - Joel
diff --git a/fs/proc/base.c b/fs/proc/base.c index ebea9501afb8..fd2f74bd4e35 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3039,6 +3039,9 @@ static const struct pid_entry tgid_base_stuff[] = { REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), +#ifdef CONFIG_IDLE_PAGE_TRACKING + REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations), +#endif #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), diff --git a/fs/proc/internal.h b/fs/proc/internal.h index cd0c8d5ce9a1..bc9371880c63 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations; extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +extern const struct file_operations proc_page_idle_operations; extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 582c5e680176..192ffc4e24d7 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1650,6 +1650,48 @@ const struct file_operations proc_pagemap_operations = { .open = pagemap_open, .release = pagemap_release, }; + +#ifdef CONFIG_IDLE_PAGE_TRACKING +static ssize_t proc_page_idle_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + return page_idle_proc_read(file, buf, count, ppos); +} + +static ssize_t proc_page_idle_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + return page_idle_proc_write(file, (char __user *)buf, count, ppos); +} + +static int proc_page_idle_open(struct inode *inode, struct file *file) +{ + struct mm_struct *mm; + + mm = proc_mem_open(inode, PTRACE_MODE_READ); + if (IS_ERR(mm)) + return PTR_ERR(mm); + file->private_data = mm; + return 0; +} + +static int proc_page_idle_release(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = file->private_data; + + mmdrop(mm); + return 0; +} + +const struct file_operations proc_page_idle_operations = { + .llseek = mem_lseek, /* borrow this */ + .read = proc_page_idle_read, + .write = proc_page_idle_write, + .open = proc_page_idle_open, + .release = proc_page_idle_release, +}; +#endif /* CONFIG_IDLE_PAGE_TRACKING */ + #endif /* CONFIG_PROC_PAGE_MONITOR */ #ifdef CONFIG_NUMA diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h index 1e894d34bdce..a765a6d14e1a 100644 --- a/include/linux/page_idle.h +++ b/include/linux/page_idle.h @@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page) } #endif /* CONFIG_64BIT */ +ssize_t page_idle_proc_write(struct file *file, + char __user *buf, size_t count, loff_t *ppos); +ssize_t page_idle_proc_read(struct file *file, + char __user *buf, size_t count, loff_t *ppos); #else /* !CONFIG_IDLE_PAGE_TRACKING */ static inline bool page_is_young(struct page *page) diff --git a/mm/page_idle.c b/mm/page_idle.c index 295512465065..9de4f4c67a8c 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -5,17 +5,22 @@ #include <linux/sysfs.h> #include <linux/kobject.h> #include <linux/mm.h> -#include <linux/mmzone.h> -#include <linux/pagemap.h> -#include <linux/rmap.h> #include <linux/mmu_notifier.h> +#include <linux/mmzone.h> #include <linux/page_ext.h> #include <linux/page_idle.h> +#include <linux/pagemap.h> +#include <linux/rmap.h> +#include <linux/sched/mm.h> +#include <linux/swap.h> +#include <linux/swapops.h> #define BITMAP_CHUNK_SIZE sizeof(u64) #define BITMAP_CHUNK_BITS (BITMAP_CHUNK_SIZE * BITS_PER_BYTE) /* + * Get a reference to a page for idle tracking purposes, with additional checks. + * * Idle page tracking only considers user memory pages, for other types of * pages the idle flag is always unset and an attempt to set it is silently * ignored. @@ -25,18 +30,13 @@ * page tracking. With such an indicator of user pages we can skip isolated * pages, but since there are not usually many of them, it will hardly affect * the overall result. - * - * This function tries to get a user memory page by pfn as described above. */ -static struct page *page_idle_get_page(unsigned long pfn) +static struct page *page_idle_get_page(struct page *page_in) { struct page *page; pg_data_t *pgdat; - if (!pfn_valid(pfn)) - return NULL; - - page = pfn_to_page(pfn); + page = page_in; if (!page || !PageLRU(page) || !get_page_unless_zero(page)) return NULL; @@ -51,6 +51,18 @@ static struct page *page_idle_get_page(unsigned long pfn) return page; } +/* + * This function tries to get a user memory page by pfn as described above. + */ +static struct page *page_idle_get_page_pfn(unsigned long pfn) +{ + + if (!pfn_valid(pfn)) + return NULL; + + return page_idle_get_page(pfn_to_page(pfn)); +} + static bool page_idle_clear_pte_refs_one(struct page *page, struct vm_area_struct *vma, unsigned long addr, void *arg) @@ -118,6 +130,47 @@ static void page_idle_clear_pte_refs(struct page *page) unlock_page(page); } +/* Helper to get the start and end frame given a pos and count */ +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm, + unsigned long *start, unsigned long *end) +{ + unsigned long max_frame; + + /* If an mm is not given, assume we want physical frames */ + max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn; + + if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE) + return -EINVAL; + + *start = pos * BITS_PER_BYTE; + if (*start >= max_frame) + return -ENXIO; + + *end = *start + count * BITS_PER_BYTE; + if (*end > max_frame) + *end = max_frame; + return 0; +} + +static bool page_idle_pte_check(struct page *page) +{ + if (!page) + return false; + + if (page_is_idle(page)) { + /* + * The page might have been referenced via a + * pte, in which case it is not idle. Clear + * refs and recheck. + */ + page_idle_clear_pte_refs(page); + if (page_is_idle(page)) + return true; + } + + return false; +} + static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj, struct bin_attribute *attr, char *buf, loff_t pos, size_t count) @@ -125,35 +178,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj, u64 *out = (u64 *)buf; struct page *page; unsigned long pfn, end_pfn; - int bit; + int bit, ret; - if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE) - return -EINVAL; - - pfn = pos * BITS_PER_BYTE; - if (pfn >= max_pfn) - return 0; - - end_pfn = pfn + count * BITS_PER_BYTE; - if (end_pfn > max_pfn) - end_pfn = max_pfn; + ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn); + if (ret == -ENXIO) + return 0; /* Reads beyond max_pfn do nothing */ + else if (ret) + return ret; for (; pfn < end_pfn; pfn++) { bit = pfn % BITMAP_CHUNK_BITS; if (!bit) *out = 0ULL; - page = page_idle_get_page(pfn); - if (page) { - if (page_is_idle(page)) { - /* - * The page might have been referenced via a - * pte, in which case it is not idle. Clear - * refs and recheck. - */ - page_idle_clear_pte_refs(page); - if (page_is_idle(page)) - *out |= 1ULL << bit; - } + page = page_idle_get_page_pfn(pfn); + if (page && page_idle_pte_check(page)) { + *out |= 1ULL << bit; put_page(page); } if (bit == BITMAP_CHUNK_BITS - 1) @@ -170,23 +209,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj, const u64 *in = (u64 *)buf; struct page *page; unsigned long pfn, end_pfn; - int bit; + int bit, ret; - if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE) - return -EINVAL; - - pfn = pos * BITS_PER_BYTE; - if (pfn >= max_pfn) - return -ENXIO; - - end_pfn = pfn + count * BITS_PER_BYTE; - if (end_pfn > max_pfn) - end_pfn = max_pfn; + ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn); + if (ret) + return ret; for (; pfn < end_pfn; pfn++) { bit = pfn % BITMAP_CHUNK_BITS; if ((*in >> bit) & 1) { - page = page_idle_get_page(pfn); + page = page_idle_get_page_pfn(pfn); if (page) { page_idle_clear_pte_refs(page); set_page_idle(page); @@ -224,6 +256,221 @@ struct page_ext_operations page_idle_ops = { }; #endif +/* page_idle tracking for /proc/<pid>/page_idle */ + +struct page_node { + struct page *page; + unsigned long addr; + struct list_head list; +}; + +struct page_idle_proc_priv { + unsigned long start_addr; + char *buffer; + int write; + + /* Pre-allocate and provide nodes to pte_page_idle_proc_add() */ + struct page_node *page_nodes; + int cur_page_node; + struct list_head *idle_page_list; +}; + +/* + * Add page to list to be set as idle later. + */ +static void pte_page_idle_proc_add(struct page *page, + unsigned long addr, struct mm_walk *walk) +{ + struct page *page_get = NULL; + struct page_node *pn; + int bit; + unsigned long frames; + struct page_idle_proc_priv *priv = walk->private; + u64 *chunk = (u64 *)priv->buffer; + + if (priv->write) { + VM_BUG_ON(!page); + + /* Find whether this page was asked to be marked */ + frames = (addr - priv->start_addr) >> PAGE_SHIFT; + bit = frames % BITMAP_CHUNK_BITS; + chunk = &chunk[frames / BITMAP_CHUNK_BITS]; + if (((*chunk >> bit) & 1) == 0) + return; + } + + if (page) { + page_get = page_idle_get_page(page); + if (!page_get) + return; + } + + /* + * For all other pages, add it to a list since we have to walk rmap, + * which acquires ptlock, and we cannot walk rmap right now. + */ + pn = &(priv->page_nodes[priv->cur_page_node++]); + pn->page = page_get; + pn->addr = addr; + list_add(&pn->list, priv->idle_page_list); +} + +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr, + unsigned long end, + struct mm_walk *walk) +{ + pte_t *pte; + spinlock_t *ptl; + struct page *page; + struct vm_area_struct *vma = walk->vma; + + ptl = pmd_trans_huge_lock(pmd, vma); + if (ptl) { + if (pmd_present(*pmd)) { + page = follow_trans_huge_pmd(vma, addr, pmd, + FOLL_DUMP|FOLL_WRITE); + if (!IS_ERR_OR_NULL(page)) + pte_page_idle_proc_add(page, addr, walk); + } + spin_unlock(ptl); + return 0; + } + + if (pmd_trans_unstable(pmd)) + return 0; + + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + for (; addr != end; pte++, addr += PAGE_SIZE) { + if (!pte_present(*pte)) + continue; + + page = vm_normal_page(vma, addr, *pte); + if (page) + pte_page_idle_proc_add(page, addr, walk); + } + + pte_unmap_unlock(pte - 1, ptl); + return 0; +} + +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff, + size_t count, loff_t *pos, int write) +{ + int ret; + char *buffer; + u64 *out; + unsigned long start_addr, end_addr, start_frame, end_frame; + struct mm_struct *mm = file->private_data; + struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, }; + struct page_node *cur; + struct page_idle_proc_priv priv; + bool walk_error = false; + LIST_HEAD(idle_page_list); + + if (!mm || !mmget_not_zero(mm)) + return -EINVAL; + + if (count > PAGE_SIZE) + count = PAGE_SIZE; + + buffer = kzalloc(PAGE_SIZE, GFP_KERNEL); + if (!buffer) { + ret = -ENOMEM; + goto out_mmput; + } + out = (u64 *)buffer; + + if (write && copy_from_user(buffer, ubuff, count)) { + ret = -EFAULT; + goto out; + } + + ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame); + if (ret) + goto out; + + start_addr = (start_frame << PAGE_SHIFT); + end_addr = (end_frame << PAGE_SHIFT); + priv.buffer = buffer; + priv.start_addr = start_addr; + priv.write = write; + + priv.idle_page_list = &idle_page_list; + priv.cur_page_node = 0; + priv.page_nodes = kzalloc(sizeof(struct page_node) * + (end_frame - start_frame), GFP_KERNEL); + if (!priv.page_nodes) { + ret = -ENOMEM; + goto out; + } + + walk.private = &priv; + walk.mm = mm; + + down_read(&mm->mmap_sem); + + /* + * idle_page_list is needed because walk_page_vma() holds ptlock which + * deadlocks with page_idle_clear_pte_refs(). So we have to collect all + * pages first, and then call page_idle_clear_pte_refs(). + */ + ret = walk_page_range(start_addr, end_addr, &walk); + if (ret) + walk_error = true; + + list_for_each_entry(cur, &idle_page_list, list) { + int bit, index; + unsigned long off; + struct page *page = cur->page; + + if (unlikely(walk_error)) + goto remove_page; + + if (write) { + if (page) { + page_idle_clear_pte_refs(page); + set_page_idle(page); + } + } else { + if (page_idle_pte_check(page)) { + off = ((cur->addr) >> PAGE_SHIFT) - start_frame; + bit = off % BITMAP_CHUNK_BITS; + index = off / BITMAP_CHUNK_BITS; + out[index] |= 1ULL << bit; + } + } +remove_page: + if (page) + put_page(page); + } + + if (!write && !walk_error) + ret = copy_to_user(ubuff, buffer, count); + + up_read(&mm->mmap_sem); + kfree(priv.page_nodes); +out: + kfree(buffer); +out_mmput: + mmput(mm); + if (!ret) + ret = count; + return ret; + +} + +ssize_t page_idle_proc_read(struct file *file, char __user *ubuff, + size_t count, loff_t *pos) +{ + return page_idle_proc_generic(file, ubuff, count, pos, 0); +} + +ssize_t page_idle_proc_write(struct file *file, char __user *ubuff, + size_t count, loff_t *pos) +{ + return page_idle_proc_generic(file, ubuff, count, pos, 1); +} + static int __init page_idle_init(void) { int err;
The page_idle tracking feature currently requires looking up the pagemap for a process followed by interacting with /sys/kernel/mm/page_idle. Looking up PFN from pagemap in Android devices is not supported by unprivileged process and requires SYS_ADMIN and gives 0 for the PFN. This patch adds support to directly interact with page_idle tracking at the PID level by introducing a /proc/<pid>/page_idle file. It follows the exact same semantics as the global /sys/kernel/mm/page_idle, but now looking up PFN through pagemap is not needed since the interface uses virtual frame numbers, and at the same time also does not require SYS_ADMIN. In Android, we are using this for the heap profiler (heapprofd) which profiles and pin points code paths which allocates and leaves memory idle for long periods of time. This method solves the security issue with userspace learning the PFN, and while at it is also shown to yield better results than the pagemap lookup, the theory being that the window where the address space can change is reduced by eliminating the intermediate pagemap look up stage. In virtual address indexing, the process's mmap_sem is held for the duration of the access. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- v3->v4: Minor fixups (Minchan) Add swap pte handling (Konstantin, Minchan) v2->v3: Fixed a bug where I was doing a kfree that is not needed due to not needing to do GFP_ATOMIC allocations. v1->v2: Mark swap ptes as idle (Minchan) Avoid need for GFP_ATOMIC (Andrew) Get rid of idle_page_list lock by moving list to stack Internal review -> v1: Fixes from Suren. Corrections to change log, docs (Florian, Sandeep) fs/proc/base.c | 3 + fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 42 +++++ include/linux/page_idle.h | 4 + mm/page_idle.c | 337 +++++++++++++++++++++++++++++++++----- 5 files changed, 342 insertions(+), 45 deletions(-)