Message ID | alpine.DEB.2.00.1105091246550.32299@artax.karlin.mff.cuni.cz (mailing list archive) |
---|---|
State | Awaiting Upstream |
Headers | show |
Dne 9.5.2011 13:01, Mikulas Patocka napsal(a): > > > On Sun, 8 May 2011, Linus Torvalds wrote: > >> On Sun, May 8, 2011 at 11:55 AM, Mikulas Patocka >> <mikulas@artax.karlin.mff.cuni.cz> wrote: >>> >>> This patch fixes lvm2 on PA-RISC (and possibly other architectures with >>> up-growing stack). lvm2 calculates the number of used pages when locking >>> and when unlocking and reports an internal error if the numbers mismatch. >> >> This patch won't apply on current kernels (including stable) because >> of commit a1fde08c74e9 that changed the test of "pages" to instead >> just test "flags & FOLL_MLOCK". >> >> That should be trivial to fix up. >> >> However, I really don't much like this complex test: >> >>> static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr) >>> { >>> - return (vma->vm_flags & VM_GROWSDOWN) && >>> + return ((vma->vm_flags & VM_GROWSDOWN) && >>> (vma->vm_start == addr) && >>> - !vma_stack_continue(vma->vm_prev, addr); >>> + !vma_stack_continue(vma->vm_prev, addr)) || >>> + ((vma->vm_flags & VM_GROWSUP) && >>> + (vma->vm_end == addr + PAGE_SIZE) && >>> + !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE)); >>> } >> >> in that format. It gets really hard to read, and I think you'd be >> better off writing it as two helper functions (or macros) for the two >> cases, and then have >> >> static inline int stack_guard_page(struct vm_area_struct *vma, >> unsigned long addr) >> { >> return stack_guard_page_growsdown(vma, addr) || >> stack_guard_page_growsup(vma, addr); >> } >> >> I'd also like to verify that it doesn't actually generate any extra >> code for the common case (iirc VM_GROWSUP is 0 for the architectures >> that don't need it, and so the compiler shouldn't generate any extra >> code, but I'd like that mentioned and verified explicitly). >> >> Hmm? >> >> Other than that it looks ok to me. >> >> That said, could we please fix LVM to not do that crazy sh*t in the >> first place? The STACK_GROWSUP case is never going to have a lot of >> testing, this is just sad. > > LVM reads process maps from /proc/self/maps and locks them with mlock. > > Why it doesn't use mlockall()? Because glibc maps all locales to the > process. Glibc packs all locales to a 100MB file and maps that file to > every process. Even if the process uses just one locale, glibc maps all. > > So, when LVM used mlockall, it consumed >100MB memory and it caused > out-of-memory problems in system installers. > > So, alternate way of locking was added to LVM --- read all maps and lock > them, except for the glibc locale file. > > The real fix would be to fix glibc not to map 100MB to every process. > I should add here probably few words. Glibc knows few more ways around - so it could work only with one locale file per language, or even without using mmap and allocating them in memory. Depends on the distribution usually - Fedora decided to combine all locales into one huge file (>100MB) - Ubuntu/Debian mmaps each locales individually (usually ~MB) LVM support both ways - either user may select in lvm.conf to always use mlockall, or he may switch to use mlock mapping of individual memory areas where those memory parts, that cannot be executed during suspend state and cannot cause memory deadlock, are not locked into memory. As a 'bonus' it's internally used for tracking algorithmic bugs. Zdenek -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 09, 2011 at 01:43:59PM +0200, Zdenek Kabelac wrote: > Dne 9.5.2011 13:01, Mikulas Patocka napsal(a): > > So, when LVM used mlockall, it consumed >100MB memory and it caused > > out-of-memory problems in system installers. The way glibc and some distros have implemented locales makes mlockall() quite useless for many purposes now in practice IMHO. We discussed the options at the time and wrote the extra code to implement the only solution we were offered. (1) There was no possibility of some distros not using a single locale database (reasons: optimisation and packaging convenience); (2) there was no possibility of glibc providing a mechanism to unmap the locale database (if you reset your locale to the default glibc will not unmap the file and indeed doesn't rule out doing something similar with other files in future - these are glibc internals and it's none of our business). Alasdair -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 09, 2011 at 01:43:59PM +0200, Zdenek Kabelac wrote: > > Why it doesn't use mlockall()? Because glibc maps all locales to the > > process. Glibc packs all locales to a 100MB file and maps that file to > > every process. Even if the process uses just one locale, glibc maps all. > > > > So, when LVM used mlockall, it consumed >100MB memory and it caused > > out-of-memory problems in system installers. > > > > So, alternate way of locking was added to LVM --- read all maps and lock > > them, except for the glibc locale file. > > > > The real fix would be to fix glibc not to map 100MB to every process. > > I should add here probably few words. > > Glibc knows few more ways around - so it could work only with one locale file > per language, or even without using mmap and allocating them in memory. > Depends on the distribution usually - Fedora decided to combine all locales > into one huge file (>100MB) - Ubuntu/Debian mmaps each locales individually > (usually ~MB) Sounds to me like glibc should introduce an mlockmost() call that does all the work for you ...
On Mon, May 9, 2011 at 3:45 PM, Matthew Wilcox <matthew@wil.cx> wrote: > > Sounds to me like glibc should introduce an mlockmost() call that does all > the work for you ... That sounds like the worst of all worlds. Nobody will use it, now there's two places to break, and how do you describe what you don't want mlocked? I dunno. There are so few applications that use mlock at *all* that I wonder if it's even worth worrying about. And this one case we've hopefully now fixed anyway. But if people really want to fix mlockall(), I'd suggest some way of just marking certain mappings as "sparse mappings", and then we can just teach mlockall to not lock them. And then glibc could just mark its locale archive that way - and maybe others would mark other mappings. We could still make such a "sparse mapping" act as locked for the actual pages that are mapped into it - so it would have some kind of real semantics. You could do a "mlockall(..)" on the process, and then touch the sparse pages you know you want, and they'd be guaranteed to be mapped after that. We might even have a "mlockall(MCL_SPARSE)" flag that does that for everything - basically guarantee that "whatever is mapped in now will remain mapped in, but I won't bother paging it in just because you ask me to do a mlockall". So there are sane semantics for the concept, and it would be easy to do in the kernel. Whether it's worth doing, I dunno. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 09, 2011 at 03:56:11PM -0700, Linus Torvalds wrote: > So there are sane semantics for the concept, and it would be easy to > do in the kernel. Whether it's worth doing, I dunno. At this point we have a workaround that seems to work for us. I find it ugly and it has needed some tweaking already but we can cope. If others find similar problems to ours and start replicating the logic we're using, then that would be the time to give serious thought to a clean 'sparse' extension. (Maybe marking individual mappings as MAP_SPARSE and adding a MCL_NO_SPARSE option to ignore them sounds most promising to me.) (What other software packages make use of mlockall() and under what circumstances?) Alasdair -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/11/2011 12:57 AM, Alasdair G Kergon wrote: > (What other software packages make use of mlockall() and under what > circumstances?) Another one is cryptsetup for commands which manipulate with keys in memory. (Users of libcryptetup library are not forced to lock memory, it is optional call. But cryptsetup itself as libcryptsetup library user always locks memory.) And I am not happy with mlockall() as well but the lvm2 workaround is quite complicated. Basically it wants to lock memory with explicitly allocated keys (this can be rewritten to use specific locked page though) but it also need to lock various libdevmapper buffers when issuing dmcrypt cfg ioctl (mapping table and messages contains key). So that's why mlockall(MCL_CURRENT|MCL_FUTURE) was the simplest way (and no problems reported yet). (No that it is perfect but better than nothing... Of course more important is to wipe memory with keys after use.) Milan -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 11, 2011 at 1:42 AM, Milan Broz <mbroz@redhat.com> wrote: > > Another one is cryptsetup [..] Quite frankly, all security-related uses should always be happy about a "MCL_SPARSE" model, since there is no point in ever bringing in pages that haven't been used. The whole (and only) point of mlock[all]() for them is the "avoid to push to disk" issue. I do wonder if we really should ever do the page-in at all. We might simply be better off always just saying "we'll lock pages you've touched, that's it". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dne 12.5.2011 04:12, Linus Torvalds napsal(a): > On Wed, May 11, 2011 at 1:42 AM, Milan Broz <mbroz@redhat.com> wrote: >> >> Another one is cryptsetup [..] > > Quite frankly, all security-related uses should always be happy about > a "MCL_SPARSE" model, since there is no point in ever bringing in > pages that haven't been used. The whole (and only) point of > mlock[all]() for them is the "avoid to push to disk" issue. > > I do wonder if we really should ever do the page-in at all. We might > simply be better off always just saying "we'll lock pages you've > touched, that's it". > For LVM we need to ensure the code which might ever be executed during disk suspend state must be paged and locked in - thus we would need MCL_SPARSE only on several selected 'unneeded' libraries - as we are obviously not really able to select which part of glibc might be needed during all code path (though I guess we may find some limits). But if we are sure that some libraries and locale files will never be used during suspend state - we do not care about those pages at all. So it's not like we would always need only MCL_SPARSE all the time - we would probably need to have some control to switch i.e. glibc into MCL_ALL. Zdenek -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2011-05-09 12:33:50.000000000 +0200 +++ linux-2.6/include/linux/mm.h 2011-05-09 12:42:05.000000000 +0200 @@ -1011,11 +1011,19 @@ int set_page_dirty_lock(struct page *pag int clear_page_dirty_for_io(struct page *page); /* Is the vma a continuation of the stack vma above it? */ -static inline int vma_stack_continue(struct vm_area_struct *vma, unsigned long addr) +static inline int vma_stack_growsdown_continue(struct vm_area_struct *vma, + unsigned long addr) { return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN); } +/* Is the vma a continuation of the stack vma below it? */ +static inline int vma_stack_growsup_continue(struct vm_area_struct *vma, + unsigned long addr) +{ + return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP); +} + extern unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2011-05-09 12:33:51.000000000 +0200 +++ linux-2.6/mm/memory.c 2011-05-09 12:41:38.000000000 +0200 @@ -1410,11 +1410,21 @@ no_page_table: return page; } -static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr) +static inline int stack_guard_page_growsdown(struct vm_area_struct *vma, + unsigned long addr) { return (vma->vm_flags & VM_GROWSDOWN) && (vma->vm_start == addr) && - !vma_stack_continue(vma->vm_prev, addr); + !vma_stack_growsdown_continue(vma->vm_prev, addr); +} + + +static inline int stack_guard_page_growsup(struct vm_area_struct *vma, + unsigned long addr) +{ + return (vma->vm_flags & VM_GROWSUP) && + (vma->vm_end == addr + PAGE_SIZE) && + !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE); } /** @@ -1554,13 +1564,18 @@ int __get_user_pages(struct task_struct /* * For mlock, just skip the stack guard page. */ - if ((gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start)) + if ((gup_flags & FOLL_MLOCK) && + stack_guard_page_growsdown(vma, start)) goto next_page; do { struct page *page; unsigned int foll_flags = gup_flags; + if ((gup_flags & FOLL_MLOCK) && + stack_guard_page_growsup(vma, start)) + goto next_page; + /* * If we have a pending SIGKILL, don't keep faulting * pages and potentially allocating memory.