Message ID | 20200306132537.783769-3-imbrenda@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | add callbacks for inaccessible pages | expand |
On 3/6/20 5:25 AM, Claudio Imbrenda wrote: > On s390x the function is not supposed to fail, so it is ok to use a > WARN_ON on failure. If we ever need some more finegrained handling > we can tackle this when we know the details. Could you explain a bit why the function can't fail? If the guest has secret data in the page, then it *can* and does fail. It won't fail, though, if the host and guest agree on whether the page is protected. Right? > @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page *page, bool keep_write) > inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); > } > unlock_page_memcg(page); > + access_ret = arch_make_page_accessible(page); > + /* > + * If writeback has been triggered on a page that cannot be made > + * accessible, it is too late to recover here. > + */ > + VM_BUG_ON_PAGE(access_ret != 0, page); > + > return ret; > > } This seems like a really odd place to do this. Writeback is specific to block I/O. I would have thought there were other kinds of devices that matter, not just block devices. Also, this patch seems odd that it only does the arch_make_page_accessible() half. Where's the other half where the page is made inaccessible? I assume it's OK to "leak" things like this, it's just not clear to me _why_ it's OK.
On Mon, 13 Apr 2020 13:22:24 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 3/6/20 5:25 AM, Claudio Imbrenda wrote: > > On s390x the function is not supposed to fail, so it is ok to use a > > WARN_ON on failure. If we ever need some more finegrained handling > > we can tackle this when we know the details. > > Could you explain a bit why the function can't fail? the concept of "making accessible" is only to make sure that accessing the page will not trigger faults or I/O or DMA errors. in general it does not mean freely accessing the content of the page in cleartext. on s390x, protected guest pages can be shared. the guest has to actively share its pages, and in that case those pages are both part of the protected VM and freely accessible by the host. pages that are not shared cannot be accessed by the host. in our case "making the page accessible" means: - if the page was shared, make sure it stays shared - if the page was not shared, first encrypt it and then make it accessible to the host (both operations performed securely and atomically by the hardware) then the page can be swapped out, or used for direct I/O (obviously if you do I/O on a page that was not shared, you cannot expect good things to happen, since you basically corrupt the memory of the guest). on s390x performing I/O directly on protected pages results in (in practice) unrecoverable I/O errors, so we want to avoid it at all costs. accessing protected pages from the CPU triggers an exception that can be handled (and we do handle it, in fact) now imagine a buggy or malicious qemu process crashing the whole machine just because it did I/O to/from a protected page. we clearly don't want that. > If the guest has secret data in the page, then it *can* and does fail. no, that's the whole point of this mechanism. in fact, most of the guest pages will be "secret data", only the few pages used for guest I/O bounce buffers will be shared with the host > It won't fail, though, if the host and guest agree on whether the page > is protected. > > Right? > > > @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page > > *page, bool keep_write) inc_zone_page_state(page, > > NR_ZONE_WRITE_PENDING); } > > unlock_page_memcg(page); > > + access_ret = arch_make_page_accessible(page); > > + /* > > + * If writeback has been triggered on a page that cannot > > be made > > + * accessible, it is too late to recover here. > > + */ > > + VM_BUG_ON_PAGE(access_ret != 0, page); > > + > > return ret; > > > > } > > This seems like a really odd place to do this. Writeback is specific > to block I/O. I would have thought there were other kinds of devices > that matter, not just block devices. well, yes and no. for writeback (block I/O and swap) this is the right place. at this point we know that the page is present and nobody else has started doing I/O yet, and I/O will happen soon-ish. so we make the page accessible. there is no turning back here, unlike pinning. we are not allowed to fail, we can't regarding the other kinds of devices: yes, they will use pinning, which is covered by the rest of the patch. the semantics of get page and pin page (if the documentation has not changed meanwhile) is that the traditional get_page is used for when the page is needed but not its content, and pin_page is used when the content of the page is accessed. since the only issue here is accessing the content of the page, we don't need to make it accessible for get_page, but only for pin_page. get_page and pin_page are allowed to fail, so in this case we return an error code, so other architectures can potentially abort the pinning if needed. on s390x we will never fail, for the same reasons written above. > Also, this patch seems odd that it only does the > arch_make_page_accessible() half. Where's the other half where the > page is made inaccessible? that is very arch-specific. for s390x, you can look at this patch and the ones immediately before/after: 214d9bbcd3a67230b932f6ce > I assume it's OK to "leak" things like this, it's just not clear to me > _why_ it's OK. nothing is being leaked :) I hope I clarified a little how this works on s390x :) feel free to poke me again if some things are still unclear best regards, Claudio Imbrenda
On 4/14/20 9:03 AM, Claudio Imbrenda wrote: > On Mon, 13 Apr 2020 13:22:24 -0700 > Dave Hansen <dave.hansen@intel.com> wrote: > >> On 3/6/20 5:25 AM, Claudio Imbrenda wrote: >>> On s390x the function is not supposed to fail, so it is ok to use a >>> WARN_ON on failure. If we ever need some more finegrained handling >>> we can tackle this when we know the details. >> >> Could you explain a bit why the function can't fail? > > the concept of "making accessible" is only to make sure that accessing > the page will not trigger faults or I/O or DMA errors. in general it > does not mean freely accessing the content of the page in cleartext. > > on s390x, protected guest pages can be shared. the guest has to > actively share its pages, and in that case those pages are both part of > the protected VM and freely accessible by the host. Oh, that's interesting. It sounds like there are three separate concepts: 1. Protection 2. Sharing 3. Accessibility Protected pages may be shared and the request of the guest. Shared pages' plaintext can be accessed by the host. For unshared pages, the host can only see ciphertext. I wonder if Documentation/virt/kvm/s390-pv.rst can be beefed up with some of this information. It seems a bit sparse on this topic. As it stands, if I were modifying generic code, I don't think I'd have even a chance of getting an arch_make_page_accessible() in the right spot. > in our case "making the page accessible" means: ... > - if the page was not shared, first encrypt it and then make it > accessible to the host (both operations performed securely and > atomically by the hardware) What happens to the guest's view of the page when this happens? Does it keep seeing plaintext? > then the page can be swapped out, or used for direct I/O (obviously if > you do I/O on a page that was not shared, you cannot expect good > things to happen, since you basically corrupt the memory of the guest). So why even allow access to the encrypted contents if the host can't do anything useful with it? Is there some reason for going to the trouble of encrypting it and exposing it to the host? > on s390x performing I/O directly on protected pages results in (in > practice) unrecoverable I/O errors, so we want to avoid it at all costs. This is understandable, but we usually steer I/O operations in places like the DMA API, not in the core VM. We *have* the concept of pages to which I/O can't be done. There are plenty of crippled devices where we have to bounce data into a low buffer before it can go where we really want it to. I think the AMD SEV patches do this, for instance. > accessing protected pages from the CPU triggers an exception that can > be handled (and we do handle it, in fact) > > now imagine a buggy or malicious qemu process crashing the whole machine > just because it did I/O to/from a protected page. we clearly don't want > that. Is DMA disallowed to *all* protected pages? Even pages which the guest has explicitly shared with the host? >>> @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page >>> *page, bool keep_write) inc_zone_page_state(page, >>> NR_ZONE_WRITE_PENDING); } >>> unlock_page_memcg(page); >>> + access_ret = arch_make_page_accessible(page); >>> + /* >>> + * If writeback has been triggered on a page that cannot >>> be made >>> + * accessible, it is too late to recover here. >>> + */ >>> + VM_BUG_ON_PAGE(access_ret != 0, page); >>> + >>> return ret; >>> >>> } >> >> This seems like a really odd place to do this. Writeback is specific >> to block I/O. I would have thought there were other kinds of devices >> that matter, not just block devices. > > well, yes and no. for writeback (block I/O and swap) this is the right > place. at this point we know that the page is present and nobody else > has started doing I/O yet, and I/O will happen soon-ish. so we make the > page accessible. there is no turning back here, unlike pinning. we > are not allowed to fail, we can't This description sounds really incomplete to me. Not all swap involved device I/O. For instance, zswap doesn't involve any devices. Would zswap need this hook?
On Tue, 14 Apr 2020 11:50:16 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 4/14/20 9:03 AM, Claudio Imbrenda wrote: > > On Mon, 13 Apr 2020 13:22:24 -0700 > > Dave Hansen <dave.hansen@intel.com> wrote: > > > >> On 3/6/20 5:25 AM, Claudio Imbrenda wrote: > >>> On s390x the function is not supposed to fail, so it is ok to use > >>> a WARN_ON on failure. If we ever need some more finegrained > >>> handling we can tackle this when we know the details. > >> > >> Could you explain a bit why the function can't fail? > > > > the concept of "making accessible" is only to make sure that > > accessing the page will not trigger faults or I/O or DMA errors. in > > general it does not mean freely accessing the content of the page > > in cleartext. > > > > on s390x, protected guest pages can be shared. the guest has to > > actively share its pages, and in that case those pages are both > > part of the protected VM and freely accessible by the host. > > Oh, that's interesting. > > It sounds like there are three separate concepts: > 1. Protection > 2. Sharing > 3. Accessibility > > Protected pages may be shared and the request of the guest. > Shared pages' plaintext can be accessed by the host. For unshared > pages, the host can only see ciphertext. > > I wonder if Documentation/virt/kvm/s390-pv.rst can be beefed up with > some of this information. It seems a bit sparse on this topic. that is definitely something that can be fixed. I will improve the documentation and make sure it properly explains all the details of how protected VMs work on s390x. > As it stands, if I were modifying generic code, I don't think I'd have > even a chance of getting an arch_make_page_accessible() in the right > spot. > > > in our case "making the page accessible" means: > ... > > - if the page was not shared, first encrypt it and then make it > > accessible to the host (both operations performed securely and > > atomically by the hardware) > > What happens to the guest's view of the page when this happens? Does > it keep seeing plaintext? > > > then the page can be swapped out, or used for direct I/O (obviously > > if you do I/O on a page that was not shared, you cannot expect good > > things to happen, since you basically corrupt the memory of the > > guest). > > So why even allow access to the encrypted contents if the host can't > do anything useful with it? Is there some reason for going to the > trouble of encrypting it and exposing it to the host? you should not overwrite it, but you can/should write it out verbatim, e.g. for swap > > on s390x performing I/O directly on protected pages results in (in > > practice) unrecoverable I/O errors, so we want to avoid it at all > > costs. > > This is understandable, but we usually steer I/O operations in places > like the DMA API, not in the core VM. > > We *have* the concept of pages to which I/O can't be done. There are > plenty of crippled devices where we have to bounce data into a low > buffer before it can go where we really want it to. I think the AMD > SEV patches do this, for instance. > > > accessing protected pages from the CPU triggers an exception that > > can be handled (and we do handle it, in fact) > > > > now imagine a buggy or malicious qemu process crashing the whole > > machine just because it did I/O to/from a protected page. we > > clearly don't want that. > > Is DMA disallowed to *all* protected pages? Even pages which the > guest has explicitly shared with the host? > > > >>> @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page > >>> *page, bool keep_write) inc_zone_page_state(page, > >>> NR_ZONE_WRITE_PENDING); } > >>> unlock_page_memcg(page); > >>> + access_ret = arch_make_page_accessible(page); > >>> + /* > >>> + * If writeback has been triggered on a page that cannot > >>> be made > >>> + * accessible, it is too late to recover here. > >>> + */ > >>> + VM_BUG_ON_PAGE(access_ret != 0, page); > >>> + > >>> return ret; > >>> > >>> } > >> > >> This seems like a really odd place to do this. Writeback is > >> specific to block I/O. I would have thought there were other > >> kinds of devices that matter, not just block devices. > > > > well, yes and no. for writeback (block I/O and swap) this is the > > right place. at this point we know that the page is present and > > nobody else has started doing I/O yet, and I/O will happen > > soon-ish. so we make the page accessible. there is no turning back > > here, unlike pinning. we are not allowed to fail, we can't > > This description sounds really incomplete to me. > > Not all swap involved device I/O. For instance, zswap doesn't involve > any devices. Would zswap need this hook? please feel free to write to me privately if you have any further questions or doubts :) best regards, Claudio Imbrenda
On 4/15/20 11:26 AM, Claudio Imbrenda wrote: > On Tue, 14 Apr 2020 11:50:16 -0700 > Dave Hansen <dave.hansen@intel.com> wrote: > >> On 4/14/20 9:03 AM, Claudio Imbrenda wrote: >>> On Mon, 13 Apr 2020 13:22:24 -0700 >>> Dave Hansen <dave.hansen@intel.com> wrote: >>> >>>> On 3/6/20 5:25 AM, Claudio Imbrenda wrote: >>>>> On s390x the function is not supposed to fail, so it is ok to use >>>>> a WARN_ON on failure. If we ever need some more finegrained >>>>> handling we can tackle this when we know the details. >>>> >>>> Could you explain a bit why the function can't fail? >>> >>> the concept of "making accessible" is only to make sure that >>> accessing the page will not trigger faults or I/O or DMA errors. in >>> general it does not mean freely accessing the content of the page >>> in cleartext. >>> >>> on s390x, protected guest pages can be shared. the guest has to >>> actively share its pages, and in that case those pages are both >>> part of the protected VM and freely accessible by the host. >> >> Oh, that's interesting. >> >> It sounds like there are three separate concepts: >> 1. Protection >> 2. Sharing >> 3. Accessibility >> >> Protected pages may be shared and the request of the guest. >> Shared pages' plaintext can be accessed by the host. For unshared >> pages, the host can only see ciphertext. >> >> I wonder if Documentation/virt/kvm/s390-pv.rst can be beefed up with >> some of this information. It seems a bit sparse on this topic. > > that is definitely something that can be fixed. > > I will improve the documentation and make sure it properly explains > all the details of how protected VMs work on s390x. I'd also definitely appreciate more people looking over that document and adding things I forgot ;-) > >> As it stands, if I were modifying generic code, I don't think I'd have >> even a chance of getting an arch_make_page_accessible() in the right >> spot. >> >>> in our case "making the page accessible" means: >> ... >>> - if the page was not shared, first encrypt it and then make it >>> accessible to the host (both operations performed securely and >>> atomically by the hardware) >> >> What happens to the guest's view of the page when this happens? Does >> it keep seeing plaintext? >> >>> then the page can be swapped out, or used for direct I/O (obviously >>> if you do I/O on a page that was not shared, you cannot expect good >>> things to happen, since you basically corrupt the memory of the >>> guest). >> >> So why even allow access to the encrypted contents if the host can't >> do anything useful with it? Is there some reason for going to the >> trouble of encrypting it and exposing it to the host? > > you should not overwrite it, but you can/should write it out verbatim, > e.g. for swap > >>> on s390x performing I/O directly on protected pages results in (in >>> practice) unrecoverable I/O errors, so we want to avoid it at all >>> costs. >> >> This is understandable, but we usually steer I/O operations in places >> like the DMA API, not in the core VM. >> >> We *have* the concept of pages to which I/O can't be done. There are >> plenty of crippled devices where we have to bounce data into a low >> buffer before it can go where we really want it to. I think the AMD >> SEV patches do this, for instance. >> >>> accessing protected pages from the CPU triggers an exception that >>> can be handled (and we do handle it, in fact) >>> >>> now imagine a buggy or malicious qemu process crashing the whole >>> machine just because it did I/O to/from a protected page. we >>> clearly don't want that. >> >> Is DMA disallowed to *all* protected pages? Even pages which the >> guest has explicitly shared with the host? >> >> >>>>> @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page >>>>> *page, bool keep_write) inc_zone_page_state(page, >>>>> NR_ZONE_WRITE_PENDING); } >>>>> unlock_page_memcg(page); >>>>> + access_ret = arch_make_page_accessible(page); >>>>> + /* >>>>> + * If writeback has been triggered on a page that cannot >>>>> be made >>>>> + * accessible, it is too late to recover here. >>>>> + */ >>>>> + VM_BUG_ON_PAGE(access_ret != 0, page); >>>>> + >>>>> return ret; >>>>> >>>>> } >>>> >>>> This seems like a really odd place to do this. Writeback is >>>> specific to block I/O. I would have thought there were other >>>> kinds of devices that matter, not just block devices. >>> >>> well, yes and no. for writeback (block I/O and swap) this is the >>> right place. at this point we know that the page is present and >>> nobody else has started doing I/O yet, and I/O will happen >>> soon-ish. so we make the page accessible. there is no turning back >>> here, unlike pinning. we are not allowed to fail, we can't >> >> This description sounds really incomplete to me. >> >> Not all swap involved device I/O. For instance, zswap doesn't involve >> any devices. Would zswap need this hook? > > please feel free to write to me privately if you have any further > questions or doubts :) > > > best regards, > > Claudio Imbrenda >
On 3/6/20 5:25 AM, Claudio Imbrenda wrote: > + /* > + * We need to make the page accessible if and only if we are going > + * to access its content (the FOLL_PIN case). Please see > + * Documentation/core-api/pin_user_pages.rst for details. > + */ > + if (flags & FOLL_PIN) { > + ret = arch_make_page_accessible(page); > + if (ret) { > + unpin_user_page(page); > + page = ERR_PTR(ret); > + goto out; > + } > + } Thanks, Claudio, for a really thorough refresher on this in private mail. But, I think this mechanism probably hooks into the wrong place. I don't doubt that it *functions* on s390, but I think these calls are misplaced. I think the end result is that no other architecture will have a chance to use the same hooks. They're far too s390-specific even for a concept that's not limited to s390. get_user_pages(FOLL_PIN) does *not* mean "the kernel will access this page's contents". The kmap() family is really what we use for that. kmap()s are often *preceded* by get_user_pages(), which is probably why this works for you, though. Yes, the docs do say that FOLL_PIN is for accessing the pages. But, there's a crucial thing that it leaves out: *WHO* will be accessing the pages. For Direct IO, for instance, the CPU isn't touching the page at all. It's always a device. Also, crucially, the page contents are *not* accessible from the CPU's perspective after a gup. They're not accessible until a kmap(). They're also not even accessible for *devices* after a gup. There's a _separate_ mapping process that's requires to make them accessible to the CPU. > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2764,7 +2764,7 @@ int test_clear_page_writeback(struct page *page) > int __test_set_page_writeback(struct page *page, bool keep_write) > { > struct address_space *mapping = page_mapping(page); > - int ret; > + int ret, access_ret; > > lock_page_memcg(page); > if (mapping && mapping_use_writeback_tags(mapping)) { > @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page *page, bool keep_write) > inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); > } > unlock_page_memcg(page); > + access_ret = arch_make_page_accessible(page); > + /* > + * If writeback has been triggered on a page that cannot be made > + * accessible, it is too late to recover here. > + */ > + VM_BUG_ON_PAGE(access_ret != 0, page); > + > return ret; > > } I think this one really shows the cracks in the approach. Pages being swapped *don't* have get_user_pages() done on them since we've already got the physical page at the time writeback and aren't looking at PTEs. They're read by I/O devices sending them out to storage, but also by the CPU if you're doing something like zswap. But, again, critically, accessing page contents won't be done until kmap(). I suspect you saw crashes underneath __swap_writepage()->submit_bio() and looked a few lines up to the set_page_writeback() and decided to hook in there. I think a better spot, again, is to hook into kmap() which is called in the block layer. Why do I care? I was looking at AMD's SEV (Secure Encrypted Virtualization) code which is in the kernel which shares some implementation details with the not-in-the-tree Intel MKTME. SEV currently has a concept of guest pages being encrypted and being gibberish to the host, plus a handshake to share guest-selected pages. Some of the side-effects of exposing the gibberish to the host aren't great (I think it can break cache coherency if a stray write occurs) and it would be nice to get better behavior. But, to get better behavior, the host kernel might need to remove pages from its direct map, making them inaccessible. I was hoping to reuse arch_make_page_accessible() for obvious reasons. But, get_user_pages() is not the right spot to map pages because they might not *ever* be accessed by the CPU, only devices. Anyway, I know it's late feedback, but I'd hate to have core code like this that has no hope of ever getting reused.
On Wed, Apr 15, 2020 at 02:52:31PM -0700, Dave Hansen wrote: > On 3/6/20 5:25 AM, Claudio Imbrenda wrote: > > + /* > > + * We need to make the page accessible if and only if we are going > > + * to access its content (the FOLL_PIN case). Please see > > + * Documentation/core-api/pin_user_pages.rst for details. > > + */ > > + if (flags & FOLL_PIN) { > > + ret = arch_make_page_accessible(page); > > + if (ret) { > > + unpin_user_page(page); > > + page = ERR_PTR(ret); > > + goto out; > > + } > > + } > > Thanks, Claudio, for a really thorough refresher on this in private mail. > > But, I think this mechanism probably hooks into the wrong place. I > don't doubt that it *functions* on s390, but I think these calls are > misplaced. I think the end result is that no other architecture will > have a chance to use the same hooks. They're far too s390-specific even > for a concept that's not limited to s390. > > get_user_pages(FOLL_PIN) does *not* mean "the kernel will access this > page's contents". The kmap() family is really what we use for that. > kmap()s are often *preceded* by get_user_pages(), which is probably why > this works for you, though. > > Yes, the docs do say that FOLL_PIN is for accessing the pages. But, > there's a crucial thing that it leaves out: *WHO* will be accessing the > pages. For Direct IO, for instance, the CPU isn't touching the page at > all. It's always a device. Also, crucially, the page contents are > *not* accessible from the CPU's perspective after a gup. They're not > accessible until a kmap(). They're also not even accessible for > *devices* after a gup. There's a _separate_ mapping process that's > requires to make them accessible to the CPU. I think the crucial detail is that we can fail gup(), while we cannot ever fail kmap() or whatever else a device needs to do. > > --- a/mm/page-writeback.c > > +++ b/mm/page-writeback.c > > @@ -2764,7 +2764,7 @@ int test_clear_page_writeback(struct page *page) > > int __test_set_page_writeback(struct page *page, bool keep_write) > > { > > struct address_space *mapping = page_mapping(page); > > - int ret; > > + int ret, access_ret; > > > > lock_page_memcg(page); > > if (mapping && mapping_use_writeback_tags(mapping)) { > > @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page *page, bool keep_write) > > inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); > > } > > unlock_page_memcg(page); > > + access_ret = arch_make_page_accessible(page); > > + /* > > + * If writeback has been triggered on a page that cannot be made > > + * accessible, it is too late to recover here. > > + */ > > + VM_BUG_ON_PAGE(access_ret != 0, page); > > + > > return ret; > > > > } > > I think this one really shows the cracks in the approach. Pages being > swapped *don't* have get_user_pages() done on them since we've already > got the physical page at the time writeback and aren't looking at PTEs. I suspect this happens because FOLL_TOUCH or something later does set_page_dirty() on the page, which then eventually gets it in writeback. Failing gup() ealier, should ensure the above VM_BUG never happens, unless someone is doing dodgy things. > Why do I care? > > I was looking at AMD's SEV (Secure Encrypted Virtualization) code which > is in the kernel which shares some implementation details with the > not-in-the-tree Intel MKTME. SEV currently has a concept of guest pages > being encrypted and being gibberish to the host, plus a handshake to > share guest-selected pages. Some of the side-effects of exposing the > gibberish to the host aren't great (I think it can break cache coherency > if a stray write occurs) and it would be nice to get better behavior. > > But, to get better behavior, the host kernel might need to remove pages > from its direct map, making them inaccessible. But for SEV we would actually need to fail this arch_make_page_acesssible() thing, right? The encrypted guest pages cannot be sanely accessed by the host IIRC, ever. Isn't their encryption key linked to the phys addr of the page? > I was hoping to reuse > arch_make_page_accessible() for obvious reasons. But, get_user_pages() > is not the right spot to map pages because they might not *ever* be > accessed by the CPU, only devices. I'm confused, why does it matter who accesses it? The point is that they want to access it through this vaddr/mapping.
On 4/15/20 3:17 PM, Peter Zijlstra wrote: > On Wed, Apr 15, 2020 at 02:52:31PM -0700, Dave Hansen wrote: >> Yes, the docs do say that FOLL_PIN is for accessing the pages. But, >> there's a crucial thing that it leaves out: *WHO* will be accessing the >> pages. For Direct IO, for instance, the CPU isn't touching the page at >> all. It's always a device. Also, crucially, the page contents are >> *not* accessible from the CPU's perspective after a gup. They're not >> accessible until a kmap(). They're also not even accessible for >> *devices* after a gup. There's a _separate_ mapping process that's >> requires to make them accessible to the CPU. > > I think the crucial detail is that we can fail gup(), while we cannot > ever fail kmap() or whatever else a device needs to do. Yep, good point. The patch in question puts says that you now need to do something to the page before it can be accessed by the kernel. Because this is presumably anonymous-only, and the only main way to get to anonymous pages is the page tables, and the gup code is our de facto user page table walker, this works OK-ish. But, the gup code isn't out only walker. Grepping for pte_page() finds a bunch of sites that this approach misses. They'll theoretically each have to be patched if we want to extend this gup-time approach for anything other than anonymous, small pages. (Most of the additional pte_page() sites are for huge pages, which can't be protected on s390.) >>> + access_ret = arch_make_page_accessible(page); >>> + /* >>> + * If writeback has been triggered on a page that cannot be made >>> + * accessible, it is too late to recover here. >>> + */ >>> + VM_BUG_ON_PAGE(access_ret != 0, page); >>> + >>> return ret; >>> >>> } >> >> I think this one really shows the cracks in the approach. Pages being >> swapped *don't* have get_user_pages() done on them since we've already >> got the physical page at the time writeback and aren't looking at PTEs. > > I suspect this happens because FOLL_TOUCH or something later does > set_page_dirty() on the page, which then eventually gets it in > writeback. I assumed that this was all anonymous-only so it's always dirty before writeback starts. >> Why do I care? >> >> I was looking at AMD's SEV (Secure Encrypted Virtualization) code which >> is in the kernel which shares some implementation details with the >> not-in-the-tree Intel MKTME. SEV currently has a concept of guest pages >> being encrypted and being gibberish to the host, plus a handshake to >> share guest-selected pages. Some of the side-effects of exposing the >> gibberish to the host aren't great (I think it can break cache coherency >> if a stray write occurs) and it would be nice to get better behavior. >> >> But, to get better behavior, the host kernel might need to remove pages >> from its direct map, making them inaccessible. > > But for SEV we would actually need to fail this > arch_make_page_acesssible() thing, right? Yeah, we would ideally fail it, but not at the current arch_make_page_acesssible() site. If the PTE isn't usable, we shouldn't be creating it in the first place, or shouldn't leave it Present=1 and GUP'able in the page tables once the underlying memory is no longer accessible. I _think_ vm_normal_page() is the right place to do it, when we're dealing with a PTE but don't yet have a 'struct page'. > The encrypted guest pages cannot be sanely accessed by the host IIRC, > ever. Isn't their encryption key linked to the phys addr of the > page? Yes, and the keys can't even be used unless you are inside the VM. But this begs the question: why did we create PTEs which can't be used? Just to have something to gup? >> I was hoping to reuse >> arch_make_page_accessible() for obvious reasons. But, get_user_pages() >> is not the right spot to map pages because they might not *ever* be >> accessed by the CPU, only devices. > > I'm confused, why does it matter who accesses it? The point is that they > want to access it through this vaddr/mapping. To me, that's the entire *point* of get_user_pages(). It's someone saying: "I want to find out what this mapping does, but I actually can't use *that* mapping." I'm either: 1. A device that does I/O to the paddr space or through an IOMMU, or 2. The kernel but I want access to that page via *another* mapping (if we could use the gup'd mapping, we would, but we know we can't) and I need the physical address space to stay consistent for a bit so I can do those things via other address spaces.
On Wed, 15 Apr 2020 14:52:31 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 3/6/20 5:25 AM, Claudio Imbrenda wrote: > > + /* > > + * We need to make the page accessible if and only if we > > are going > > + * to access its content (the FOLL_PIN case). Please see > > + * Documentation/core-api/pin_user_pages.rst for details. > > + */ > > + if (flags & FOLL_PIN) { > > + ret = arch_make_page_accessible(page); > > + if (ret) { > > + unpin_user_page(page); > > + page = ERR_PTR(ret); > > + goto out; > > + } > > + } > > Thanks, Claudio, for a really thorough refresher on this in private > mail. > > But, I think this mechanism probably hooks into the wrong place. I > don't doubt that it *functions* on s390, but I think these calls are > misplaced. I think the end result is that no other architecture will > have a chance to use the same hooks. They're far too s390-specific > even for a concept that's not limited to s390. > > get_user_pages(FOLL_PIN) does *not* mean "the kernel will access this > page's contents". The kmap() family is really what we use for that. it means that _something_ _might_ access the content of the physical page. be it kernel or device, and the device can access the page through DMA or through other means (and yes on s390 many devices read and write directly from/to memory without using DMA... it's complicated) also, not all architectures use kmap (e.g. s390 doesn't) > kmap()s are often *preceded* by get_user_pages(), which is probably > why this works for you, though. > > Yes, the docs do say that FOLL_PIN is for accessing the pages. But, > there's a crucial thing that it leaves out: *WHO* will be accessing exactly > the pages. For Direct IO, for instance, the CPU isn't touching the > page at all. It's always a device. Also, crucially, the page exactly. and that is the one case we need to protect ourselves from. letting a device touch directly a protected page causes an unrecoverable error state in the device, potentially bringing down the whole system. and this would be triggerable by userspace. > contents are *not* accessible from the CPU's perspective after a gup. depends on the architecture, I think > They're not accessible until a kmap(). They're also not even > accessible for *devices* after a gup. There's a _separate_ mapping also depends on the architecture > process that's requires to make them accessible to the CPU. > > > --- a/mm/page-writeback.c > > +++ b/mm/page-writeback.c > > @@ -2764,7 +2764,7 @@ int test_clear_page_writeback(struct page > > *page) int __test_set_page_writeback(struct page *page, bool > > keep_write) { > > struct address_space *mapping = page_mapping(page); > > - int ret; > > + int ret, access_ret; > > > > lock_page_memcg(page); > > if (mapping && mapping_use_writeback_tags(mapping)) { > > @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page > > *page, bool keep_write) inc_zone_page_state(page, > > NR_ZONE_WRITE_PENDING); } > > unlock_page_memcg(page); > > + access_ret = arch_make_page_accessible(page); > > + /* > > + * If writeback has been triggered on a page that cannot > > be made > > + * accessible, it is too late to recover here. > > + */ > > + VM_BUG_ON_PAGE(access_ret != 0, page); > > + > > return ret; > > > > } > > I think this one really shows the cracks in the approach. Pages being > swapped *don't* have get_user_pages() done on them since we've already > got the physical page at the time writeback and aren't looking at > PTEs. correct. that's why we are doing it when setting the writeback bit. > They're read by I/O devices sending them out to storage, but also by > the CPU if you're doing something like zswap. But, again, critically, > accessing page contents won't be done until kmap(). is kmap called for direct I/O too? > I suspect you saw crashes underneath __swap_writepage()->submit_bio() > and looked a few lines up to the set_page_writeback() and decided to > hook in there. I think a better spot, again, is to hook into kmap() > which is called in the block layer. making a page accessible is a potentially long operation (e.g. on s390 requires the hardware to encrypt the page and do some other expensive operations), while kmap is a nop. > Why do I care? > > I was looking at AMD's SEV (Secure Encrypted Virtualization) code > which is in the kernel which shares some implementation details with > the not-in-the-tree Intel MKTME. SEV currently has a concept of > guest pages being encrypted and being gibberish to the host, plus a > handshake to share guest-selected pages. Some of the side-effects of > exposing the gibberish to the host aren't great (I think it can break > cache coherency if a stray write occurs) and it would be nice to get > better behavior. > > But, to get better behavior, the host kernel might need to remove > pages from its direct map, making them inaccessible. I was hoping to > reuse arch_make_page_accessible() for obvious reasons. But, we are talking about physical pages being inaccessible, not mappings. you can have the page correctly mapped and still inaccessible. > get_user_pages() is not the right spot to map pages because they > might not *ever* be accessed by the CPU, only devices. > > Anyway, I know it's late feedback, but I'd hate to have core code like > this that has no hope of ever getting reused.
On Wed, 15 Apr 2020 16:34:14 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 4/15/20 3:17 PM, Peter Zijlstra wrote: > > On Wed, Apr 15, 2020 at 02:52:31PM -0700, Dave Hansen wrote: > >> Yes, the docs do say that FOLL_PIN is for accessing the pages. > >> But, there's a crucial thing that it leaves out: *WHO* will be > >> accessing the pages. For Direct IO, for instance, the CPU isn't > >> touching the page at all. It's always a device. Also, crucially, > >> the page contents are *not* accessible from the CPU's perspective > >> after a gup. They're not accessible until a kmap(). They're also > >> not even accessible for *devices* after a gup. There's a > >> _separate_ mapping process that's requires to make them accessible > >> to the CPU. > > > > I think the crucial detail is that we can fail gup(), while we > > cannot ever fail kmap() or whatever else a device needs to do. > > Yep, good point. > > The patch in question puts says that you now need to do something to > the page before it can be accessed by the kernel. Because this is actually, before it can be accessed by anything. In fact, if the kernel touches a protected page on s390 it gets a recoverable fault, and our fault handler basically calls arch_make_page_accessible. the problem is that I/O devices can touch the memory without the CPU touching it first. so they would try to read or write on protected memory, causing all kinds of issues. > presumably anonymous-only, and the only main way to get to anonymous > pages is the page tables, and the gup code is our de facto user page > table walker, this works OK-ish. actually this also works for mmapped memory, you can always write to memory and then call sync, the effect is similar to swapping (both use writeback IIRC) > But, the gup code isn't out only walker. Grepping for pte_page() > finds a bunch of sites that this approach misses. They'll > theoretically each have to be patched if we want to extend this > gup-time approach for anything other than anonymous, small pages. well, no. walking the page tables gives you a physical address, but you have no guarantees that the page will still be there later. that's the whole point of gup - you make sure the page stays pinned in memory e.g. stuff like follow_page without FOLL_PIN or FOLL_GET offers no guarantee that the page will still be there one nanosecond later > (Most of the additional pte_page() sites are for huge pages, which > can't be protected on s390.) > > >>> + access_ret = arch_make_page_accessible(page); > >>> + /* > >>> + * If writeback has been triggered on a page that cannot > >>> be made > >>> + * accessible, it is too late to recover here. > >>> + */ > >>> + VM_BUG_ON_PAGE(access_ret != 0, page); > >>> + > >>> return ret; > >>> > >>> } > >> > >> I think this one really shows the cracks in the approach. Pages > >> being swapped *don't* have get_user_pages() done on them since > >> we've already got the physical page at the time writeback and > >> aren't looking at PTEs. > > > > I suspect this happens because FOLL_TOUCH or something later does > > set_page_dirty() on the page, which then eventually gets it in > > writeback. > > I assumed that this was all anonymous-only so it's always dirty before > writeback starts. it could also be mmapped > >> Why do I care? > >> > >> I was looking at AMD's SEV (Secure Encrypted Virtualization) code > >> which is in the kernel which shares some implementation details > >> with the not-in-the-tree Intel MKTME. SEV currently has a concept > >> of guest pages being encrypted and being gibberish to the host, > >> plus a handshake to share guest-selected pages. Some of the > >> side-effects of exposing the gibberish to the host aren't great (I > >> think it can break cache coherency if a stray write occurs) and it > >> would be nice to get better behavior. > >> > >> But, to get better behavior, the host kernel might need to remove > >> pages from its direct map, making them inaccessible. > > > > But for SEV we would actually need to fail this > > arch_make_page_acesssible() thing, right? > > Yeah, we would ideally fail it, but not at the current > arch_make_page_acesssible() site. If the PTE isn't usable, we > shouldn't be creating it in the first place, or shouldn't leave it > Present=1 and GUP'able in the page tables once the underlying memory > is no longer accessible. the problem is that the page needs to be present, otherwise it cannot be used in the VM. the protected VM can access the content of otherwise inaccessible pages -- that's the whole point. we have userspace pages that need to be mapped, but whose content should not be touched by I/O without being "made accessible" first. and the I/O devices don't necessarily use the DMA API > I _think_ vm_normal_page() is the right place to do it, when we're > dealing with a PTE but don't yet have a 'struct page'. > > > The encrypted guest pages cannot be sanely accessed by the host > > IIRC, ever. Isn't their encryption key linked to the phys addr of > > the page? > Yes, and the keys can't even be used unless you are inside the VM. > > But this begs the question: why did we create PTEs which can't be > used? Just to have something to gup? the userspace PTEs (which are used for the protected guest) are needed by the protected guest to access its memory. memory that is "inaccessible" for the kernel is accessible by the hardware/firmware and by the protected guest it belongs to. if the PTE is not valid, the guest cannot run. > >> I was hoping to reuse > >> arch_make_page_accessible() for obvious reasons. But, > >> get_user_pages() is not the right spot to map pages because they > >> might not *ever* be accessed by the CPU, only devices. > > > > I'm confused, why does it matter who accesses it? The point is that > > they want to access it through this vaddr/mapping. > > To me, that's the entire *point* of get_user_pages(). It's someone > saying: "I want to find out what this mapping does, but I actually > can't use *that* mapping." I'm either: > > 1. A device that does I/O to the paddr space or through an IOMMU, or this is exactly the thing we want to protect ourselves from > 2. The kernel but I want access to that page via *another* mapping (if > we could use the gup'd mapping, we would, but we know we can't) on s390 this is also true sometimes, because the kernel has a 1-to-1 mapping of all physical memory, and that's the mapping used to access all pages when we have only the physical address. copy to user actually uses the userspace mapping s390 has separate address spaces for kernel and userspace also, notice that in this case which mapping is used is irrelevant: you could work with paging disabled, you still need to make the physical pages accessible before the kernel or anyone else (except hardware/firmware and the protected VM it belongs to) can touch them. > and I need the physical address space to stay consistent for a bit so > I can do those things via other address spaces.
On 4/16/20 5:15 AM, Claudio Imbrenda wrote: >> I assumed that this was all anonymous-only so it's always dirty before >> writeback starts. > it could also be mmapped Let's say you have a mmap()'d ramfs file. Another process calls which doesn't have it mapped calls sys_write() and writes to the file. This means that host host has to write to the physical page and must do arch_make_page_accessible() in the sys_write() path somewhere. There is a get_user_pages() in that path, but it's on the _source_ buffer, not the ramfs page because the ramfs page is not mapped. There's also no __test_set_page_writeback() because you can't write back ramfs. Where is the arch_make_page_accessible() in this case on the ramfs page?
On Thu, 16 Apr 2020 07:20:48 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 4/16/20 5:15 AM, Claudio Imbrenda wrote: > >> I assumed that this was all anonymous-only so it's always dirty > >> before writeback starts. > > it could also be mmapped > > Let's say you have a mmap()'d ramfs file. Another process calls which > doesn't have it mapped calls sys_write() and writes to the file. > > This means that host host has to write to the physical page and must > do arch_make_page_accessible() in the sys_write() path somewhere. > > There is a get_user_pages() in that path, but it's on the _source_ > buffer, not the ramfs page because the ramfs page is not mapped. > There's also no __test_set_page_writeback() because you can't write > back ramfs. > > Where is the arch_make_page_accessible() in this case on the ramfs > page? it's in the fault handler for the exception the CPU will get when attempting to write the data to the protected page
On 4/16/20 7:59 AM, Claudio Imbrenda wrote: > On Thu, 16 Apr 2020 07:20:48 -0700 > Dave Hansen <dave.hansen@intel.com> wrote: >> On 4/16/20 5:15 AM, Claudio Imbrenda wrote: >>>> I assumed that this was all anonymous-only so it's always dirty >>>> before writeback starts. >>> it could also be mmapped >> >> Let's say you have a mmap()'d ramfs file. Another process calls which >> doesn't have it mapped calls sys_write() and writes to the file. ... >> Where is the arch_make_page_accessible() in this case on the ramfs >> page? > > it's in the fault handler for the exception the CPU will get when > attempting to write the data to the protected page Ahh, so this is *just* intended to precede I/O done on the page, when a non-host entity is touching the memory? That seems inconsistent with the process_vm_readv/writev() paths which set FOLL_PIN on their pin_remote_user_pages() requests, but don't do I/O to the memory.
On Thu, 16 Apr 2020 08:36:50 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 4/16/20 7:59 AM, Claudio Imbrenda wrote: > > On Thu, 16 Apr 2020 07:20:48 -0700 > > Dave Hansen <dave.hansen@intel.com> wrote: > >> On 4/16/20 5:15 AM, Claudio Imbrenda wrote: > >>>> I assumed that this was all anonymous-only so it's always dirty > >>>> before writeback starts. > >>> it could also be mmapped > >> > >> Let's say you have a mmap()'d ramfs file. Another process calls > >> which doesn't have it mapped calls sys_write() and writes to the > >> file. > ... > >> Where is the arch_make_page_accessible() in this case on the ramfs > >> page? > > > > it's in the fault handler for the exception the CPU will get when > > attempting to write the data to the protected page > > Ahh, so this is *just* intended to precede I/O done on the page, when > a non-host entity is touching the memory? yep > That seems inconsistent with the process_vm_readv/writev() paths which > set FOLL_PIN on their pin_remote_user_pages() requests, but don't do > I/O to the memory. FOLL_PIN simply indicates potential access to the content of the page, not just for I/O. so yes, we are overdoing arch_make_page_accessible() in some cases, because we can't tell when a page will be used for I/O and when not. In most cases this will boil down to checking a flag and doing nothing, for example in case the page was already accessible. Also note that making the page accessible because of a FOLL_PIN in absence of I/O will probably later on spare us from triggering and handling the exception that would have caused us to make the page accessible anyway.
On 4/16/20 9:34 AM, Claudio Imbrenda wrote: >> Ahh, so this is *just* intended to precede I/O done on the page, when >> a non-host entity is touching the memory? > yep OK, so we've got to do an action that precedes *all* I/O to a page. That's not too bad. I still don't understand how this could work generally, though There are lots of places where I/O is done to a page without either going through __test_set_page_writeback() or gup() with FOLL_PIN set. sendfile() is probably the best example of this: fd = open("/normal/ext4/file", O_RDONLY); sendfile(socket_fd, fd, &off, count); There's no gup in sight since the file doesn't have an address and it's not being written to so there's no writeback. How does sendfile work?
On 4/16/20 12:02 PM, Dave Hansen wrote: > On 4/16/20 9:34 AM, Claudio Imbrenda wrote: >>> Ahh, so this is *just* intended to precede I/O done on the page, when >>> a non-host entity is touching the memory? >> yep > OK, so we've got to do an action that precedes *all* I/O to a page. > That's not too bad. > > I still don't understand how this could work generally, though There > are lots of places where I/O is done to a page without either going > through __test_set_page_writeback() or gup() with FOLL_PIN set. > > sendfile() is probably the best example of this: > > fd = open("/normal/ext4/file", O_RDONLY); > sendfile(socket_fd, fd, &off, count); > > There's no gup in sight since the file doesn't have an address and it's > not being written to so there's no writeback. > > How does sendfile work? Did you manage to see if sendfile works (or any other operation that DMAs file-backed data without being preceded by a gup)? I suspect it's actually not that hard to fix. As long as you have a dma_ops for the devices in question either via dev->dma_ops or you add an s390 get_arch_vm_ops(), you can fix *all* the DMA sites, sendfile() included. BTW, device drivers do need to know how to use the DMA mapping API. If s390 has drivers that need to be updated, I think that's vastly preferable to incomplete hooks in core mm code.
On 4/21/20 2:31 PM, Dave Hansen wrote: > On 4/16/20 12:02 PM, Dave Hansen wrote: >> On 4/16/20 9:34 AM, Claudio Imbrenda wrote: >>>> Ahh, so this is *just* intended to precede I/O done on the page, when >>>> a non-host entity is touching the memory? >>> yep >> OK, so we've got to do an action that precedes *all* I/O to a page. >> That's not too bad. >> >> I still don't understand how this could work generally, though There >> are lots of places where I/O is done to a page without either going >> through __test_set_page_writeback() or gup() with FOLL_PIN set. >> >> sendfile() is probably the best example of this: >> >> fd = open("/normal/ext4/file", O_RDONLY); >> sendfile(socket_fd, fd, &off, count); >> >> There's no gup in sight since the file doesn't have an address and it's >> not being written to so there's no writeback. >> >> How does sendfile work? > > Did you manage to see if sendfile works (or any other operation that > DMAs file-backed data without being preceded by a gup)? It's been a couple of weeks with no response on this. From where I'm standing, we have a hook in the core VM that can't possibly work with some existing kernel functionality and has virtually no chance of getting used on a second architecture. It sounds like there may need to be some additional work here, but should these hooks stay in for 5.7? Or, should we revert this patch and try again for 5.8?
On 28.04.20 21:43, Dave Hansen wrote: > On 4/21/20 2:31 PM, Dave Hansen wrote: >> On 4/16/20 12:02 PM, Dave Hansen wrote: >>> On 4/16/20 9:34 AM, Claudio Imbrenda wrote: >>>>> Ahh, so this is *just* intended to precede I/O done on the page, when >>>>> a non-host entity is touching the memory? >>>> yep >>> OK, so we've got to do an action that precedes *all* I/O to a page. >>> That's not too bad. >>> >>> I still don't understand how this could work generally, though There >>> are lots of places where I/O is done to a page without either going >>> through __test_set_page_writeback() or gup() with FOLL_PIN set. >>> >>> sendfile() is probably the best example of this: >>> >>> fd = open("/normal/ext4/file", O_RDONLY); >>> sendfile(socket_fd, fd, &off, count); >>> >>> There's no gup in sight since the file doesn't have an address and it's >>> not being written to so there's no writeback. >>> >>> How does sendfile work? >> >> Did you manage to see if sendfile works (or any other operation that >> DMAs file-backed data without being preceded by a gup)? > > It's been a couple of weeks with no response on this. > > From where I'm standing, we have a hook in the core VM that can't > possibly work with some existing kernel functionality and has virtually > no chance of getting used on a second architecture. > > It sounds like there may need to be some additional work here, but > should these hooks stay in for 5.7? Or, should we revert this patch and > try again for 5.8? We have something and Claudio is going to send out an addon patch soon. Reverting would be harmful basically allowing for I/O errors on a valid path that can happen with normal use. The sendfile case is also valid, but it only triggers for a handcrafted cases (you have to change QEMU to use MAP_SHARED and you have to use a file as memory backing and you have to splice this file then to another file that is using DIRECT_IO). Those cases then run into an I/O error case that is recovered by the driver. the same would not work for a swap backing. It is also not trivial (or doable) to change all device drivers that exist on s390 to use the DMA API. In the long run we might be able to get the necessary changes into the DMA api. Claudio will send the details tomorrow.
On Tue, 28 Apr 2020 12:43:45 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 4/21/20 2:31 PM, Dave Hansen wrote: > > On 4/16/20 12:02 PM, Dave Hansen wrote: > >> On 4/16/20 9:34 AM, Claudio Imbrenda wrote: > >>>> Ahh, so this is *just* intended to precede I/O done on the page, > >>>> when a non-host entity is touching the memory? > >>> yep > >> OK, so we've got to do an action that precedes *all* I/O to a page. > >> That's not too bad. > >> > >> I still don't understand how this could work generally, though > >> There are lots of places where I/O is done to a page without > >> either going through __test_set_page_writeback() or gup() with > >> FOLL_PIN set. > >> > >> sendfile() is probably the best example of this: > >> > >> fd = open("/normal/ext4/file", O_RDONLY); > >> sendfile(socket_fd, fd, &off, count); > >> > >> There's no gup in sight since the file doesn't have an address and > >> it's not being written to so there's no writeback. > >> > >> How does sendfile work? > > > > Did you manage to see if sendfile works (or any other operation that > > DMAs file-backed data without being preceded by a gup)? > > It's been a couple of weeks with no response on this. sorry, I've been busy with things > From where I'm standing, we have a hook in the core VM that can't > possibly work with some existing kernel functionality and has > virtually no chance of getting used on a second architecture. it seems to work at least for us, so it does possibly work :) regarding second architectures: when we started sending these patches around, there has been interest from some other architectures, so just because nobody else needs them now, it doesn't mean nobody will use them ever. Moreover this is the only way for us to reasonably implement this (see below). > It sounds like there may need to be some additional work here, but > should these hooks stay in for 5.7? Or, should we revert this patch > and try again for 5.8? I don't see why we should revert a patch that works as intended and poses no overhead for non-users, whereas reverting it would break functionality. Now let me elaborate a little on the DMA API. There are some issues with some of the bus types used on s390 when it comes to the DMA API. Most I/O instructions on s390 need to allocate some small control blocks for each operation, and those need to be under 2GB. Those control blocks will be accessed directly by the hardware. The rest of the actual I/O buffers have no restriction and can be anywhere (64 bits). Setting the DMA mask to 2GB means that all other buffers will be almost always bounced, which is unacceptable. Especially since there are no bounce buffers set up for s390x hosts anyway (they are set up only in protected guests (and not in normal guests), so this was also introduced quite recently). Also notice that, until now, there has been no actual need to use the DMA API on most s390 device drivers, hence why it's not being used there. I know that you are used to need the DMA API for DMA operations otherwise Bad Thingsā¢ happen, but this is not the case on s390 (for non-PCI devices at least). So until the DMA API is fixed, there is no way to convert all the drivers to the DMA API (which would be quite a lot of work in itself already, but that's not the point here). A fix for the DMA API was actually proposed by my colleague Halil several months ago now, but it did not go through. His proposal was to allow architectures to override the GFP flags for DMA allocations, to allow allocating some buffers from some areas and some other buffers from other areas. I hope this clarifies the matter a little :)
On 4/28/20 4:39 PM, Claudio Imbrenda wrote: >> From where I'm standing, we have a hook in the core VM that can't >> possibly work with some existing kernel functionality and has >> virtually no chance of getting used on a second architecture. > it seems to work at least for us, so it does possibly work :) I think all you're saying is that it's been very lightly tested. :) > regarding second architectures: when we started sending these patches > around, there has been interest from some other architectures, so > just because nobody else needs them now, it doesn't mean nobody will > use them ever. I was really interested in using them... until I looked at them. Conceptually, they do something really useful, but the _implementation_ falls short of its promises. I can't imagine ever using these hooks on x86.
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index e5b817cb86e7..be2754841369 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page, int order) { } #ifndef HAVE_ARCH_ALLOC_PAGE static inline void arch_alloc_page(struct page *page, int order) { } #endif +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE +static inline int arch_make_page_accessible(struct page *page) +{ + return 0; +} +#endif struct page * __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, diff --git a/mm/gup.c b/mm/gup.c index 81a95fbe9901..d0c4c6f336bb 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -413,6 +413,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, struct page *page; spinlock_t *ptl; pte_t *ptep, pte; + int ret; /* FOLL_GET and FOLL_PIN are mutually exclusive. */ if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == @@ -471,8 +472,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, if (is_zero_pfn(pte_pfn(pte))) { page = pte_page(pte); } else { - int ret; - ret = follow_pfn_pte(vma, address, ptep, flags); page = ERR_PTR(ret); goto out; @@ -480,7 +479,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } if (flags & FOLL_SPLIT && PageTransCompound(page)) { - int ret; get_page(page); pte_unmap_unlock(ptep, ptl); lock_page(page); @@ -497,6 +495,19 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, page = ERR_PTR(-ENOMEM); goto out; } + /* + * We need to make the page accessible if and only if we are going + * to access its content (the FOLL_PIN case). Please see + * Documentation/core-api/pin_user_pages.rst for details. + */ + if (flags & FOLL_PIN) { + ret = arch_make_page_accessible(page); + if (ret) { + unpin_user_page(page); + page = ERR_PTR(ret); + goto out; + } + } if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) @@ -2162,6 +2173,19 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, VM_BUG_ON_PAGE(compound_head(page) != head, page); + /* + * We need to make the page accessible if and only if we are + * going to access its content (the FOLL_PIN case). Please + * see Documentation/core-api/pin_user_pages.rst for + * details. + */ + if (flags & FOLL_PIN) { + ret = arch_make_page_accessible(page); + if (ret) { + unpin_user_page(page); + goto pte_unmap; + } + } SetPageReferenced(page); pages[*nr] = page; (*nr)++; diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ab5a3cee8ad3..b7f3d0766a5f 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2764,7 +2764,7 @@ int test_clear_page_writeback(struct page *page) int __test_set_page_writeback(struct page *page, bool keep_write) { struct address_space *mapping = page_mapping(page); - int ret; + int ret, access_ret; lock_page_memcg(page); if (mapping && mapping_use_writeback_tags(mapping)) { @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page *page, bool keep_write) inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); } unlock_page_memcg(page); + access_ret = arch_make_page_accessible(page); + /* + * If writeback has been triggered on a page that cannot be made + * accessible, it is too late to recover here. + */ + VM_BUG_ON_PAGE(access_ret != 0, page); + return ret; }