Message ID | 20200428225043.3091359-1-imbrenda@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v1,1/1] fs/splice: add missing callback for inaccessible pages | expand |
On 4/28/20 3:50 PM, Claudio Imbrenda wrote: > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -106,6 +106,9 @@ static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe, > struct page *page = buf->page; > int err; > > + if (arch_make_page_accessible(page)) > + return -EIO; > + > if (!PageUptodate(page)) { > lock_page(page); This is a cute fix, but doesn't it 100% depend on the internal implementation detail of page cache sendfile() being implemented with a pipe? Depending on that seems rather fragile. While I'm glad that you surgically plugged the one single, specific case that I pointed out, I can't help but suspect there are more of these. For instance, I tried a file-to-file sendfile, basically: fd1 = open("file1"); fd2 = open("file2"); sendfile(fd1, fd2, ...); ftrace showed page_cache_pipe_buf_confirm() getting called for the source pipe pages but not the receiver. There were no calls to arch_make_page_accessible() outside of page_cache_pipe_buf_confirm() (I put a stub in for it on x86 so I could trace it). That indicates to me that one side of this might be fixed (the sender), but the receiver is not. This also doesn't even have the maintainer of fs/splice.c on cc. The changelog about what this is trying to do probably also lacks enough context to bring Al up to speed about what this is trying to do.
On 4/28/20 3:50 PM, Claudio Imbrenda wrote: > If a page is inaccesible and it is used for things like sendfile, then > the content of the page is not always touched, and can be passed > directly to a driver, causing issues. > > This patch fixes the issue by adding a call to arch_make_page_accessible > in page_cache_pipe_buf_confirm; this fixes the issue. I spent about 5 minutes putting together a patch: https://sr71.net/~dave/intel/accessible.patch It adds a page flag ("daccess") which starts out set. It clears the flag it when the page is added to the page cache or mapped as anonymous. This are presumably the the two mostly likely kinds of pages to be problematic. It re-sets the flag when it hits the new hook for s390: arch_make_page_accessible(). It then patches the DMA mapping API. If a page gets to the DMA mapping API without being accessible, it hits a tracepoint. It goes boom shortly after hitting userspace underneath a sys_sendto(). That code uses lib/iov_iter.c which does get_user_pages_fast() and apparently does not set FOLL_PIN, so never hits the s390 arch hooks. I hacked out the FOLL_PIN check and just universally call the hook for all gup_pte_range() calls. I think you'll need to do that as well. I don't think the assumptions about FOLL_PIN always preceding I/O is true universally. Hacking out FOLL_PIN quiets down the warning spew quite a bit, but it still hits a few of them. Here's one example: 0) sd-reso-410 | | /* mm_accessible_error: ... sd-resolve-410 [000] .... 212.918838: <stack trace> => trace_event_raw_event_mm_accessible_error => check_page_accessible => e1000_xmit_frame => dev_hard_start_xmit => sch_direct_xmit => __qdisc_run => __dev_queue_xmit => ip_finish_output2 => ip_output => ip_send_skb => udp_send_skb.isra.59 => udp_sendmsg => ____sys_sendmsg => ___sys_sendmsg => __sys_sendmmsg => __x64_sys_sendmmsg => do_syscall_64 => entry_SYSCALL_64_after_hwframe This is just from booting and sitting on an idle Ubuntu 16.04.6 system. I think the process in question here is the systemd resolver.
On 29.04.20 18:07, Dave Hansen wrote: > On 4/28/20 3:50 PM, Claudio Imbrenda wrote: >> If a page is inaccesible and it is used for things like sendfile, then >> the content of the page is not always touched, and can be passed >> directly to a driver, causing issues. >> >> This patch fixes the issue by adding a call to arch_make_page_accessible >> in page_cache_pipe_buf_confirm; this fixes the issue. > > I spent about 5 minutes putting together a patch: > > https://sr71.net/~dave/intel/accessible.patch > > It adds a page flag ("daccess") which starts out set. It clears the > flag it when the page is added to the page cache or mapped as anonymous. And that of course does not work. Pages are not made unaccessible at a random point in time. We do check for several page flags and page count before doing so and we also do this while with paqe_ref_freeze to avoid several races. I guess you just hit one of those.
On 4/29/20 10:31 AM, Christian Borntraeger wrote: > On 29.04.20 18:07, Dave Hansen wrote: >> On 4/28/20 3:50 PM, Claudio Imbrenda wrote: >>> If a page is inaccesible and it is used for things like sendfile, then >>> the content of the page is not always touched, and can be passed >>> directly to a driver, causing issues. >>> >>> This patch fixes the issue by adding a call to arch_make_page_accessible >>> in page_cache_pipe_buf_confirm; this fixes the issue. >> I spent about 5 minutes putting together a patch: >> >> https://sr71.net/~dave/intel/accessible.patch >> >> It adds a page flag ("daccess") which starts out set. It clears the >> flag it when the page is added to the page cache or mapped as anonymous. > And that of course does not work. Pages are not made unaccessible at a random > point in time. We do check for several page flags and page count before doing > so and we also do this while with paqe_ref_freeze to avoid several races. > I guess you just hit one of those. Actually, that's the problem. You've gone through all these careful checks and made the page inaccessible. *After* that process, how do you keep the page from being hit by an I/O device before it's made accessible again? My patch just assumes that *all* pages have gone through that process and passed those checks. I'm pretty sure if I lifted all the checks in arch/s390/kernel/uv.c::make_secure_pte() and duplicated them at the sites where I'm doing ClearPageAccessible(), they'd happily pass. Freezing page refs is a transient thing you do *during* the conversion, but it doesn't stop future access to the page. That's what these incomplete hooks are trying to do. Anyway, I look forward to seeing the patch for the FOLL_PIN issue I pointed out, and I hope to see another copy of the fs/splice changes with a proper changelog and the maintainer on cc. It's starting to get late in the rc's.
On Wed, 29 Apr 2020 10:55:52 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 4/29/20 10:31 AM, Christian Borntraeger wrote: > > On 29.04.20 18:07, Dave Hansen wrote: > >> On 4/28/20 3:50 PM, Claudio Imbrenda wrote: > >>> If a page is inaccesible and it is used for things like sendfile, > >>> then the content of the page is not always touched, and can be > >>> passed directly to a driver, causing issues. > >>> > >>> This patch fixes the issue by adding a call to > >>> arch_make_page_accessible in page_cache_pipe_buf_confirm; this > >>> fixes the issue. > >> I spent about 5 minutes putting together a patch: > >> > >> https://sr71.net/~dave/intel/accessible.patch > >> > >> It adds a page flag ("daccess") which starts out set. It clears > >> the flag it when the page is added to the page cache or mapped as > >> anonymous. > > And that of course does not work. Pages are not made unaccessible > > at a random point in time. We do check for several page flags and > > page count before doing so and we also do this while with > > paqe_ref_freeze to avoid several races. I guess you just hit one of > > those. > > Actually, that's the problem. You've gone through all these careful > checks and made the page inaccessible. *After* that process, how do > you keep the page from being hit by an I/O device before it's made > accessible again? My patch just assumes that *all* pages have gone > through that process and passed those checks. I don't understand what you are saying here. we start with all pages accessible, we mark pages as inaccessible when they are imported in the secure guest (we use the PG_arch_1 bit in struct page). we then try to catch all I/O on inaccessible pages and make them accessible so that I/O devices are happy. when we need to make page inaccessible again, we mark the page, make sure the counters and the flag in struct page look ok, then we actually make it inaccessible. this way pages that are undergoing I/O will never actually transition from from accessible to inaccessible, although they might briefly marked as inaccessible. this means that the flag we use to mark a page inaccessible is overindicating, but that is fine. pages need to be accessible in order to be used for I/O, and need to be inaccessible in order to be used by protected VMs. > I'm pretty sure if I lifted all the checks in > arch/s390/kernel/uv.c::make_secure_pte() and duplicated them at the > sites where I'm doing ClearPageAccessible(), they'd happily pass. > > Freezing page refs is a transient thing you do *during* the > conversion, but it doesn't stop future access to the page. That's > what these incomplete hooks are trying to do. we use the PG_arch_1 bit to actually stop access to the page, freezing is used just for a short moment to make sure nobody is touching the page and avoid races while we are checking. > Anyway, I look forward to seeing the patch for the FOLL_PIN issue I > pointed out, and I hope to see another copy of the fs/splice changes > with a proper changelog and the maintainer on cc. It's starting to > get late in the rc's. either your quick and dirty patch was too dirty (e.g. not accounting for possible races between make_accessible/make_inaccessible), or some of the functions in the trace you provided should do pin_user_page instead of get_user_page (or both) our first implementation (before FOLL_PIN was introduced) called make_page_accessible for each FOLL_GET. Once FOLL_PIN was introduced, we thought it would be the right place. If you think calling make_accessible for both FOLL_PIN and FOLL_GET is the right thing, then I can surely patch it that way. I'll send out an v2 for fs/splice later on today.
On 4/29/20 3:53 PM, Claudio Imbrenda wrote: >> Actually, that's the problem. You've gone through all these careful >> checks and made the page inaccessible. *After* that process, how do >> you keep the page from being hit by an I/O device before it's made >> accessible again? My patch just assumes that *all* pages have gone >> through that process and passed those checks. > I don't understand what you are saying here. > > we start with all pages accessible, we mark pages as inaccessible when > they are imported in the secure guest (we use the PG_arch_1 bit in > struct page). we then try to catch all I/O on inaccessible pages and > make them accessible so that I/O devices are happy. The catching mechanism is incomplete, that's all I'm saying. Without looking too hard, and not even having the hardware, I've found two paths where the "catching" was incomplete: 1. sendfile(), which you've patched 2. sendto(), which you haven't patched > either your quick and dirty patch was too dirty (e.g. not accounting > for possible races between make_accessible/make_inaccessible), or some > of the functions in the trace you provided should do pin_user_page > instead of get_user_page (or both) I looked in the traces for any races. For sendto(), at least, the make_accessible() never happened before the process exited. That's entirely consistent with the theory that it entirely missed being caught. I can't find any evidence that there were races. Go ahead and try it. You have the patch! I mean, I found a bug in about 10 minutes in one tiny little VM. And, yes, you need to get rid of the FOLL_PIN check unless you want to go change a big swath of the remaining get_user_pages*() sites.
On Wed, 29 Apr 2020 16:52:46 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > On 4/29/20 3:53 PM, Claudio Imbrenda wrote: > >> Actually, that's the problem. You've gone through all these > >> careful checks and made the page inaccessible. *After* that > >> process, how do you keep the page from being hit by an I/O device > >> before it's made accessible again? My patch just assumes that > >> *all* pages have gone through that process and passed those > >> checks. > > I don't understand what you are saying here. > > > > we start with all pages accessible, we mark pages as inaccessible > > when they are imported in the secure guest (we use the PG_arch_1 > > bit in struct page). we then try to catch all I/O on inaccessible > > pages and make them accessible so that I/O devices are happy. > > The catching mechanism is incomplete, that's all I'm saying. well, sendto in the end does a copy_from_user or a get_user_pages_fast, both are covered (once we fix the make_accessible to work on FOLL_GET too). > Without looking too hard, and not even having the hardware, I've found > two paths where the "catching" was incomplete: > > 1. sendfile(), which you've patched > 2. sendto(), which you haven't patched > > > either your quick and dirty patch was too dirty (e.g. not accounting > > for possible races between make_accessible/make_inaccessible), or > > some of the functions in the trace you provided should do > > pin_user_page instead of get_user_page (or both) > > I looked in the traces for any races. For sendto(), at least, the > make_accessible() never happened before the process exited. That's > entirely consistent with the theory that it entirely missed being > caught. I can't find any evidence that there were races. > > Go ahead and try it. You have the patch! I mean, I found a bug in > about 10 minutes in one tiny little VM. I tried your patch, but I could not reproduce the problem. I have a Debian 10 x86_64 with the latest kernel from master and your patch on top. is there anything I'm missing? which virtual devices are you using? any particular .config options? I could easily get the mm_make_accessible tracepoints, but I never manage to trigger the mm_accessible_error ones. are you using transparent hugepages by any chance? the infrastructure for inaccessible pages is meant only for small pages, since on s390x only small pages can ever be used for secure guests and therefore become inaccessible. > And, yes, you need to get rid of the FOLL_PIN check unless you want to > go change a big swath of the remaining get_user_pages*() sites.
On 4/30/20 10:19 AM, Claudio Imbrenda wrote: > On Wed, 29 Apr 2020 16:52:46 -0700 > Dave Hansen <dave.hansen@intel.com> wrote: > >> On 4/29/20 3:53 PM, Claudio Imbrenda wrote: >>>> Actually, that's the problem. You've gone through all these >>>> careful checks and made the page inaccessible. *After* that >>>> process, how do you keep the page from being hit by an I/O device >>>> before it's made accessible again? My patch just assumes that >>>> *all* pages have gone through that process and passed those >>>> checks. >>> I don't understand what you are saying here. >>> >>> we start with all pages accessible, we mark pages as inaccessible >>> when they are imported in the secure guest (we use the PG_arch_1 >>> bit in struct page). we then try to catch all I/O on inaccessible >>> pages and make them accessible so that I/O devices are happy. >> >> The catching mechanism is incomplete, that's all I'm saying. > > well, sendto in the end does a copy_from_user or a get_user_pages_fast, > both are covered (once we fix the make_accessible to work on FOLL_GET > too). Agreed. This is inching forward, increasing the coverage of the hooks. My confidence in the hooks being complete at this point is pretty low. >> Without looking too hard, and not even having the hardware, I've found >> two paths where the "catching" was incomplete: >> >> 1. sendfile(), which you've patched >> 2. sendto(), which you haven't patched >> >>> either your quick and dirty patch was too dirty (e.g. not accounting >>> for possible races between make_accessible/make_inaccessible), or >>> some of the functions in the trace you provided should do >>> pin_user_page instead of get_user_page (or both) >> >> I looked in the traces for any races. For sendto(), at least, the >> make_accessible() never happened before the process exited. That's >> entirely consistent with the theory that it entirely missed being >> caught. I can't find any evidence that there were races. >> >> Go ahead and try it. You have the patch! I mean, I found a bug in >> about 10 minutes in one tiny little VM. > > I tried your patch, but I could not reproduce the problem. I have a > Debian 10 x86_64 with the latest kernel from master and your patch on > top. is there anything I'm missing? which virtual devices are you using? > any particular .config options? I'm using an emulated e1000 which seems to be the device that notices problems most readily. A snippet of the qemu command-line is: qemu-system-x86_64 -enable-kvm -drive id=disk,file=/path,if=none -device ahci,id=ahci -device ide-drive,drive=disk,bus=ahci.0 -net nic,model=e1000 -net user,net=192.168.76.0/24,dhcpstart=192.168.76.44 I'm also using "-cpu host,-pcid" if that makes a difference. It's just a plain Skylake client system: "Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz". > are you using transparent hugepages by any chance? the > infrastructure for inaccessible pages is meant only for small pages, > since on s390x only small pages can ever be used for secure > guests and therefore become inaccessible. Yes, THP is probably enabled and in play. But, I did dump out PageAnon() in some of my traces and it was never set, which rules out THP. The fact that this infrastructure is not designed to play nicely with large pages doesn't bode well for its use on a second architecture.
On 29.04.20 18:07, Dave Hansen wrote: > On 4/28/20 3:50 PM, Claudio Imbrenda wrote: >> If a page is inaccesible and it is used for things like sendfile, then >> the content of the page is not always touched, and can be passed >> directly to a driver, causing issues. >> >> This patch fixes the issue by adding a call to arch_make_page_accessible >> in page_cache_pipe_buf_confirm; this fixes the issue. > > I spent about 5 minutes putting together a patch: > > https://sr71.net/~dave/intel/accessible.patch You only set the page flag for compound pages. that of course leaves a big pile of pages marked a not accessible, thus explaining the sendto trace and all kind of other random traces. What do you see when you also do the SetPageAccessible(page); in the else page of prep_new_page (order == 0). (I do get > 10000 of these non compound page allocs just during boot). > > It adds a page flag ("daccess") which starts out set. It clears the > flag it when the page is added to the page cache or mapped as anonymous. > This are presumably the the two mostly likely kinds of pages to be > problematic. It re-sets the flag when it hits the new hook for s390: > arch_make_page_accessible(). > > It then patches the DMA mapping API. If a page gets to the DMA mapping > API without being accessible, it hits a tracepoint. > > It goes boom shortly after hitting userspace underneath a sys_sendto(). > That code uses lib/iov_iter.c which does get_user_pages_fast() and > apparently does not set FOLL_PIN, so never hits the s390 arch hooks. > > I hacked out the FOLL_PIN check and just universally call the hook for > all gup_pte_range() calls. I think you'll need to do that as well. I > don't think the assumptions about FOLL_PIN always preceding I/O is true > universally. Hacking out FOLL_PIN quiets down the warning spew quite a > bit, but it still hits a few of them. > > Here's one example: > > 0) sd-reso-410 | | /* mm_accessible_error: ... > sd-resolve-410 [000] .... 212.918838: <stack trace> > => trace_event_raw_event_mm_accessible_error > => check_page_accessible > => e1000_xmit_frame > => dev_hard_start_xmit > => sch_direct_xmit > => __qdisc_run > => __dev_queue_xmit > => ip_finish_output2 > => ip_output > => ip_send_skb > => udp_send_skb.isra.59 > => udp_sendmsg > => ____sys_sendmsg > => ___sys_sendmsg > => __sys_sendmmsg > => __x64_sys_sendmmsg > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > This is just from booting and sitting on an idle Ubuntu 16.04.6 system. > I think the process in question here is the systemd resolver. >
On 30.04.20 20:12, Christian Borntraeger wrote: > > > On 29.04.20 18:07, Dave Hansen wrote: >> On 4/28/20 3:50 PM, Claudio Imbrenda wrote: >>> If a page is inaccesible and it is used for things like sendfile, then >>> the content of the page is not always touched, and can be passed >>> directly to a driver, causing issues. >>> >>> This patch fixes the issue by adding a call to arch_make_page_accessible >>> in page_cache_pipe_buf_confirm; this fixes the issue. >> >> I spent about 5 minutes putting together a patch: >> >> https://sr71.net/~dave/intel/accessible.patch > > You only set the page flag for compound pages. that of course leaves a big pile > of pages marked a not accessible, thus explaining the sendto trace and all kind > of other random traces. > > > What do you see when you also do the SetPageAccessible(page); > in the else page of prep_new_page (order == 0). > (I do get > 10000 of these non compound page allocs just during boot). > And yes, I think you are right that we should call the callback also for !FOLL_PIN.
On 4/30/20 11:12 AM, Christian Borntraeger wrote: > On 29.04.20 18:07, Dave Hansen wrote: >> On 4/28/20 3:50 PM, Claudio Imbrenda wrote: >>> If a page is inaccesible and it is used for things like sendfile, then >>> the content of the page is not always touched, and can be passed >>> directly to a driver, causing issues. >>> >>> This patch fixes the issue by adding a call to arch_make_page_accessible >>> in page_cache_pipe_buf_confirm; this fixes the issue. >> I spent about 5 minutes putting together a patch: >> >> https://sr71.net/~dave/intel/accessible.patch > You only set the page flag for compound pages. that of course leaves a big pile > of pages marked a not accessible, thus explaining the sendto trace and all kind > of other random traces. Ahh, nice catch! That does explain an oddity or two that I saw. > What do you see when you also do the SetPageAccessible(page); > in the else page of prep_new_page (order == 0). > (I do get > 10000 of these non compound page allocs just during boot). Yes, I see the same thing. I updated the patch and double-checked that it triggers properly with a socket-based sendfile().
On 30.04.20 21:32, Dave Hansen wrote: > On 4/30/20 11:12 AM, Christian Borntraeger wrote: >> On 29.04.20 18:07, Dave Hansen wrote: >>> On 4/28/20 3:50 PM, Claudio Imbrenda wrote: >>>> If a page is inaccesible and it is used for things like sendfile, then >>>> the content of the page is not always touched, and can be passed >>>> directly to a driver, causing issues. >>>> >>>> This patch fixes the issue by adding a call to arch_make_page_accessible >>>> in page_cache_pipe_buf_confirm; this fixes the issue. >>> I spent about 5 minutes putting together a patch: >>> >>> https://sr71.net/~dave/intel/accessible.patch >> You only set the page flag for compound pages. that of course leaves a big pile >> of pages marked a not accessible, thus explaining the sendto trace and all kind >> of other random traces. > > Ahh, nice catch! That does explain an oddity or two that I saw. > >> What do you see when you also do the SetPageAccessible(page); >> in the else page of prep_new_page (order == 0). >> (I do get > 10000 of these non compound page allocs just during boot). > > Yes, I see the same thing. > > I updated the patch and double-checked that it triggers properly with a > socket-based sendfile(). Do you have a calltrace?
On 30.04.20 21:32, Dave Hansen wrote: > On 4/30/20 11:12 AM, Christian Borntraeger wrote: >> On 29.04.20 18:07, Dave Hansen wrote: >>> On 4/28/20 3:50 PM, Claudio Imbrenda wrote: >>>> If a page is inaccesible and it is used for things like sendfile, then >>>> the content of the page is not always touched, and can be passed >>>> directly to a driver, causing issues. >>>> >>>> This patch fixes the issue by adding a call to arch_make_page_accessible >>>> in page_cache_pipe_buf_confirm; this fixes the issue. >>> I spent about 5 minutes putting together a patch: >>> >>> https://sr71.net/~dave/intel/accessible.patch >> You only set the page flag for compound pages. that of course leaves a big pile >> of pages marked a not accessible, thus explaining the sendto trace and all kind >> of other random traces. > > Ahh, nice catch! That does explain an oddity or two that I saw. > >> What do you see when you also do the SetPageAccessible(page); >> in the else page of prep_new_page (order == 0). >> (I do get > 10000 of these non compound page allocs just during boot). > > Yes, I see the same thing. > > I updated the patch and double-checked that it triggers properly with a > socket-based sendfile(). Can you place the updated patch somewhere?
On 30.04.20 21:02, Christian Borntraeger wrote: > On 30.04.20 20:12, Christian Borntraeger wrote: >> >> >> On 29.04.20 18:07, Dave Hansen wrote: >>> On 4/28/20 3:50 PM, Claudio Imbrenda wrote: >>>> If a page is inaccesible and it is used for things like sendfile, then >>>> the content of the page is not always touched, and can be passed >>>> directly to a driver, causing issues. >>>> >>>> This patch fixes the issue by adding a call to arch_make_page_accessible >>>> in page_cache_pipe_buf_confirm; this fixes the issue. >>> >>> I spent about 5 minutes putting together a patch: >>> >>> https://sr71.net/~dave/intel/accessible.patch >> >> You only set the page flag for compound pages. that of course leaves a big pile >> of pages marked a not accessible, thus explaining the sendto trace and all kind >> of other random traces. >> >> >> What do you see when you also do the SetPageAccessible(page); >> in the else page of prep_new_page (order == 0). >> (I do get > 10000 of these non compound page allocs just during boot). >> > > And yes, I think you are right that we should call the callback also for !FOLL_PIN. Thinking again about this I am no longer sure. Adding John Hubbard. Documentation/core-api/pin_user_pages.rst says: -------snip---------- Another way of thinking about these flags is as a progression of restrictions: FOLL_GET is for struct page manipulation, without affecting the data that the struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more restrictive case that has FOLL_PIN as a prerequisite: this is for pages that will be pinned longterm, and whose data will be accessed. -------snip---------- So John,is it ok to give a page to an I/O device where the code has used gup with FOLL_GET (or gup fast without pup) or would you consider this a bug?
On 4/30/20 12:38 PM, Christian Borntraeger wrote: >> >>> What do you see when you also do the SetPageAccessible(page); >>> in the else page of prep_new_page (order == 0). >>> (I do get > 10000 of these non compound page allocs just during boot). >> Yes, I see the same thing. >> >> I updated the patch and double-checked that it triggers properly with a >> socket-based sendfile(). > Do you have a calltrace? It triggers with the thread from this patch _not_ applied. I just wanted to point out that it was able to find the real bug and that the patch in question squashed this instance. Here's the call trace I see: > [ 199.566150] WARNING: CPU: 0 PID: 878 at mm/page_alloc.c:8860 check_page_accessible+0x5f/0xb0 > [ 199.567813] Modules linked in: > [ 199.568447] CPU: 0 PID: 878 Comm: server Not tainted 5.7.0-rc3-dirty #6544 > [ 199.569948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 > [ 199.571980] RIP: 0010:check_page_accessible+0x5f/0xb0 > [ 199.572737] Code: 4c 01 48 85 db 74 18 48 8b 03 48 8b 7b 08 48 83 c3 18 48 89 ee ff d0 48 8b 03 48 85 c0 75 eb 48 8b 45 00 a9 00 00 20 00 75 bb <0f> 0b 0f 1f 44 00 00 3e 80 4d 02 20 5b 5d c3 65 8b 05 db 72 d3 7e > [ 199.576514] RSP: 0018:ffffc900003f7810 EFLAGS: 00010246 > [ 199.577556] RAX: 000ffff800020016 RBX: ffff88800ff5b068 RCX: 0000000000000000 > [ 199.578964] RDX: 0000000000000003 RSI: 0000000000000008 RDI: ffff88800fc18400 > [ 199.580352] RBP: ffffea0001d5ca40 R08: 00000000000001b5 R09: ffff88800fe74000 > [ 199.581784] R10: ffff88800fe74170 R11: ffff88800fc18400 R12: 0000000000001000 > [ 199.583179] R13: 0000000000000000 R14: 0000000000000000 R15: ffffc900005e52d0 > [ 199.584587] FS: 00007ffff7fe8700(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000 > [ 199.586045] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 199.587200] CR2: 00007ffff7b042b0 CR3: 000000007ab5a000 CR4: 00000000003406f0 > [ 199.588655] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 199.591784] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 199.593217] Call Trace: > [ 199.593755] ? e1000_xmit_frame+0x50c/0x1040 > [ 199.594626] ? dev_hard_start_xmit+0x8d/0x1e0 > [ 199.596126] ? sch_direct_xmit+0xe8/0x220 > [ 199.597676] ? __qdisc_run+0x13a/0x4e0 > [ 199.598476] ? __dev_queue_xmit+0x2d7/0x710 > [ 199.599767] ? ip_finish_output2+0x2a8/0x550 > [ 199.601195] ? skb_gso_validate_network_len+0x11/0x70 > [ 199.602272] ? ip_output+0x6d/0xe0 > [ 199.602955] ? ip_forward_options.cold.9+0x27/0x27 > [ 199.603918] ? __ip_queue_xmit+0x14f/0x370 > [ 199.604734] ? __tcp_transmit_skb+0x54b/0xad0 > [ 199.605608] ? tcp_write_xmit+0x379/0x10c0 > [ 199.606354] ? do_tcp_sendpages+0x2bc/0x5f0 > [ 199.607206] ? tcp_sendpage_locked+0x44/0x60 > [ 199.608066] ? tcp_sendpage+0x37/0x50 > [ 199.608801] ? inet_sendpage+0x4f/0x80 > [ 199.609554] ? kernel_sendpage+0x17/0x20 > [ 199.610341] ? sock_sendpage+0x20/0x30 > [ 199.610989] ? pipe_to_sendpage+0x60/0xa0 > [ 199.611658] ? __splice_from_pipe+0x9f/0x180 > [ 199.612567] ? generic_pipe_buf_nosteal+0x10/0x10 > [ 199.613509] ? generic_pipe_buf_nosteal+0x10/0x10 > [ 199.614452] ? splice_from_pipe+0x5d/0x90 > [ 199.615258] ? direct_splice_actor+0x32/0x40 > [ 199.616102] ? splice_direct_to_actor+0x101/0x220 > [ 199.617054] ? pipe_to_sendpage+0xa0/0xa0 > [ 199.617858] ? do_splice_direct+0x9a/0xd0 > [ 199.618660] ? do_sendfile+0x1ce/0x3d0 > [ 199.619411] ? __x64_sys_sendfile64+0x5c/0xc0 > [ 199.620343] ? do_syscall_64+0x4a/0x130 > [ 199.621042] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3 > [ 199.622119] ---[ end trace 19796ac5d41cc1f4 ]---
On 30.04.20 22:01, Dave Hansen wrote: > On 4/30/20 12:38 PM, Christian Borntraeger wrote: >>> >>>> What do you see when you also do the SetPageAccessible(page); >>>> in the else page of prep_new_page (order == 0). >>>> (I do get > 10000 of these non compound page allocs just during boot). >>> Yes, I see the same thing. >>> >>> I updated the patch and double-checked that it triggers properly with a >>> socket-based sendfile(). >> Do you have a calltrace? > > It triggers with the thread from this patch _not_ applied. I just > wanted to point out that it was able to find the real bug and that the > patch in question squashed this instance. Ah now it makes sense. Thanks. > > Here's the call trace I see: > >> [ 199.566150] WARNING: CPU: 0 PID: 878 at mm/page_alloc.c:8860 check_page_accessible+0x5f/0xb0 >> [ 199.567813] Modules linked in: >> [ 199.568447] CPU: 0 PID: 878 Comm: server Not tainted 5.7.0-rc3-dirty #6544 >> [ 199.569948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 >> [ 199.571980] RIP: 0010:check_page_accessible+0x5f/0xb0 >> [ 199.572737] Code: 4c 01 48 85 db 74 18 48 8b 03 48 8b 7b 08 48 83 c3 18 48 89 ee ff d0 48 8b 03 48 85 c0 75 eb 48 8b 45 00 a9 00 00 20 00 75 bb <0f> 0b 0f 1f 44 00 00 3e 80 4d 02 20 5b 5d c3 65 8b 05 db 72 d3 7e >> [ 199.576514] RSP: 0018:ffffc900003f7810 EFLAGS: 00010246 >> [ 199.577556] RAX: 000ffff800020016 RBX: ffff88800ff5b068 RCX: 0000000000000000 >> [ 199.578964] RDX: 0000000000000003 RSI: 0000000000000008 RDI: ffff88800fc18400 >> [ 199.580352] RBP: ffffea0001d5ca40 R08: 00000000000001b5 R09: ffff88800fe74000 >> [ 199.581784] R10: ffff88800fe74170 R11: ffff88800fc18400 R12: 0000000000001000 >> [ 199.583179] R13: 0000000000000000 R14: 0000000000000000 R15: ffffc900005e52d0 >> [ 199.584587] FS: 00007ffff7fe8700(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000 >> [ 199.586045] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 199.587200] CR2: 00007ffff7b042b0 CR3: 000000007ab5a000 CR4: 00000000003406f0 >> [ 199.588655] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [ 199.591784] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> [ 199.593217] Call Trace: >> [ 199.593755] ? e1000_xmit_frame+0x50c/0x1040 >> [ 199.594626] ? dev_hard_start_xmit+0x8d/0x1e0 >> [ 199.596126] ? sch_direct_xmit+0xe8/0x220 >> [ 199.597676] ? __qdisc_run+0x13a/0x4e0 >> [ 199.598476] ? __dev_queue_xmit+0x2d7/0x710 >> [ 199.599767] ? ip_finish_output2+0x2a8/0x550 >> [ 199.601195] ? skb_gso_validate_network_len+0x11/0x70 >> [ 199.602272] ? ip_output+0x6d/0xe0 >> [ 199.602955] ? ip_forward_options.cold.9+0x27/0x27 >> [ 199.603918] ? __ip_queue_xmit+0x14f/0x370 >> [ 199.604734] ? __tcp_transmit_skb+0x54b/0xad0 >> [ 199.605608] ? tcp_write_xmit+0x379/0x10c0 >> [ 199.606354] ? do_tcp_sendpages+0x2bc/0x5f0 >> [ 199.607206] ? tcp_sendpage_locked+0x44/0x60 >> [ 199.608066] ? tcp_sendpage+0x37/0x50 >> [ 199.608801] ? inet_sendpage+0x4f/0x80 >> [ 199.609554] ? kernel_sendpage+0x17/0x20 >> [ 199.610341] ? sock_sendpage+0x20/0x30 >> [ 199.610989] ? pipe_to_sendpage+0x60/0xa0 >> [ 199.611658] ? __splice_from_pipe+0x9f/0x180 >> [ 199.612567] ? generic_pipe_buf_nosteal+0x10/0x10 >> [ 199.613509] ? generic_pipe_buf_nosteal+0x10/0x10 >> [ 199.614452] ? splice_from_pipe+0x5d/0x90 >> [ 199.615258] ? direct_splice_actor+0x32/0x40 >> [ 199.616102] ? splice_direct_to_actor+0x101/0x220 >> [ 199.617054] ? pipe_to_sendpage+0xa0/0xa0 >> [ 199.617858] ? do_splice_direct+0x9a/0xd0 >> [ 199.618660] ? do_sendfile+0x1ce/0x3d0 >> [ 199.619411] ? __x64_sys_sendfile64+0x5c/0xc0 >> [ 199.620343] ? do_syscall_64+0x4a/0x130 >> [ 199.621042] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3 >> [ 199.622119] ---[ end trace 19796ac5d41cc1f4 ]---
On 2020-04-30 12:54, Christian Borntraeger wrote: > On 30.04.20 21:02, Christian Borntraeger wrote: >> On 30.04.20 20:12, Christian Borntraeger wrote: >>> On 29.04.20 18:07, Dave Hansen wrote: >>>> On 4/28/20 3:50 PM, Claudio Imbrenda wrote: >>>>> If a page is inaccesible and it is used for things like sendfile, then >>>>> the content of the page is not always touched, and can be passed >>>>> directly to a driver, causing issues. >>>>> >>>>> This patch fixes the issue by adding a call to arch_make_page_accessible >>>>> in page_cache_pipe_buf_confirm; this fixes the issue. >>>> >>>> I spent about 5 minutes putting together a patch: >>>> >>>> https://sr71.net/~dave/intel/accessible.patch >>> >>> You only set the page flag for compound pages. that of course leaves a big pile >>> of pages marked a not accessible, thus explaining the sendto trace and all kind >>> of other random traces. >>> >>> >>> What do you see when you also do the SetPageAccessible(page); >>> in the else page of prep_new_page (order == 0). >>> (I do get > 10000 of these non compound page allocs just during boot). >>> >> >> And yes, I think you are right that we should call the callback also for !FOLL_PIN. > Disclaimer: I haven't dug into the details of the latest points above, so answers below will be narrowly focused. > > Thinking again about this I am no longer sure. Adding John Hubbard. > > Documentation/core-api/pin_user_pages.rst says: > -------snip---------- > Another way of thinking about these flags is as a progression of restrictions: > FOLL_GET is for struct page manipulation, without affecting the data that the > struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for > short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is > a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more > restrictive case that has FOLL_PIN as a prerequisite: this is for pages that > will be pinned longterm, and whose data will be accessed. > -------snip---------- > > So John,is it ok to give a page to an I/O device where the code has used gup > with FOLL_GET (or gup fast without pup) or would you consider this a bug? > Well, it's a bug (or a bug-in-waiting): even though gup/FOLL_GET works just as well (and as badly) as ever, pup/FOLL_PIN is required in order to safely and correctly allow a non-CPU device to operate on a page's data. Core mm and fs code is going to key off of page_maybe_dma_pinned() in order to make critical decisions about writeback and umount, and FOLL_PIN opts into that; FOLL_GET does not. Basically, you'd be creating another set of call sites that someone would have to convert to pup/FOLL_PIN. btw, on the FOLL_LONGTERM documentation above: that's more of an aspiration than a description of current behavior, in some ways. The current FOLL_LONGTERM is a little more quirky than is implied there. Also on a related note, I've been slow in posting patches to implement the remaining call site conversions, and am trying to get back to that asap. There have been some distractions. :) Once every call site is correctly using gup or pup, it will be easier for everyone. thanks, -- John Hubbard NVIDIA
diff --git a/fs/splice.c b/fs/splice.c index 4735defc46ee..f026e0ce9acd 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -106,6 +106,9 @@ static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe, struct page *page = buf->page; int err; + if (arch_make_page_accessible(page)) + return -EIO; + if (!PageUptodate(page)) { lock_page(page);
If a page is inaccesible and it is used for things like sendfile, then the content of the page is not always touched, and can be passed directly to a driver, causing issues. This patch fixes the issue by adding a call to arch_make_page_accessible in page_cache_pipe_buf_confirm; this fixes the issue. Fixes: f28d43636d6f ("mm/gup/writeback: add callbacks for inaccessible pages") Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> --- fs/splice.c | 3 +++ 1 file changed, 3 insertions(+)