diff mbox series

[v1,1/1] fs/splice: add missing callback for inaccessible pages

Message ID 20200428225043.3091359-1-imbrenda@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series [v1,1/1] fs/splice: add missing callback for inaccessible pages | expand

Commit Message

Claudio Imbrenda April 28, 2020, 10:50 p.m. UTC
If a page is inaccesible and it is used for things like sendfile, then
the content of the page is not always touched, and can be passed
directly to a driver, causing issues.

This patch fixes the issue by adding a call to arch_make_page_accessible
in page_cache_pipe_buf_confirm; this fixes the issue.

Fixes: f28d43636d6f ("mm/gup/writeback: add callbacks for inaccessible pages")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 fs/splice.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Dave Hansen April 29, 2020, 12:25 a.m. UTC | #1
On 4/28/20 3:50 PM, Claudio Imbrenda wrote:
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -106,6 +106,9 @@ static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe,
>  	struct page *page = buf->page;
>  	int err;
>  
> +	if (arch_make_page_accessible(page))
> +		return -EIO;
> +
>  	if (!PageUptodate(page)) {
>  		lock_page(page);

This is a cute fix, but doesn't it 100% depend on the internal
implementation detail of page cache sendfile() being implemented with a
pipe?  Depending on that seems rather fragile.  While I'm glad that you
surgically plugged the one single, specific case that I pointed out, I
can't help but suspect there are more of these.

For instance, I tried a file-to-file sendfile, basically:

	fd1 = open("file1");
	fd2 = open("file2");
	sendfile(fd1, fd2, ...);

ftrace showed page_cache_pipe_buf_confirm() getting called for the
source pipe pages but not the receiver.  There were no calls to
arch_make_page_accessible() outside of page_cache_pipe_buf_confirm() (I
put a stub in for it on x86 so I could trace it).

That indicates to me that one side of this might be fixed (the sender),
but the receiver is not.

This also doesn't even have the maintainer of fs/splice.c on cc.  The
changelog about what this is trying to do probably also lacks enough
context to bring Al up to speed about what this is trying to do.
Dave Hansen April 29, 2020, 4:07 p.m. UTC | #2
On 4/28/20 3:50 PM, Claudio Imbrenda wrote:
> If a page is inaccesible and it is used for things like sendfile, then
> the content of the page is not always touched, and can be passed
> directly to a driver, causing issues.
> 
> This patch fixes the issue by adding a call to arch_make_page_accessible
> in page_cache_pipe_buf_confirm; this fixes the issue.

I spent about 5 minutes putting together a patch:

	https://sr71.net/~dave/intel/accessible.patch

It adds a page flag ("daccess") which starts out set.  It clears the
flag it when the page is added to the page cache or mapped as anonymous.
 This are presumably the the two mostly likely kinds of pages to be
problematic.  It re-sets the flag when it hits the new hook for s390:
arch_make_page_accessible().

It then patches the DMA mapping API.  If a page gets to the DMA mapping
API without being accessible, it hits a tracepoint.

It goes boom shortly after hitting userspace underneath a sys_sendto().
 That code uses lib/iov_iter.c which does get_user_pages_fast() and
apparently does not set FOLL_PIN, so never hits the s390 arch hooks.

I hacked out the FOLL_PIN check and just universally call the hook for
all gup_pte_range() calls.  I think you'll need to do that as well.  I
don't think the assumptions about FOLL_PIN always preceding I/O is true
universally.  Hacking out FOLL_PIN quiets down the warning spew quite a
bit, but it still hits a few of them.

Here's one example:

 0)  sd-reso-410   |               |  /* mm_accessible_error: ...
      sd-resolve-410   [000] ....   212.918838: <stack trace>
 => trace_event_raw_event_mm_accessible_error
 => check_page_accessible
 => e1000_xmit_frame
 => dev_hard_start_xmit
 => sch_direct_xmit
 => __qdisc_run
 => __dev_queue_xmit
 => ip_finish_output2
 => ip_output
 => ip_send_skb
 => udp_send_skb.isra.59
 => udp_sendmsg
 => ____sys_sendmsg
 => ___sys_sendmsg
 => __sys_sendmmsg
 => __x64_sys_sendmmsg
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

This is just from booting and sitting on an idle Ubuntu 16.04.6 system.
 I think the process in question here is the systemd resolver.
Christian Borntraeger April 29, 2020, 5:31 p.m. UTC | #3
On 29.04.20 18:07, Dave Hansen wrote:
> On 4/28/20 3:50 PM, Claudio Imbrenda wrote:
>> If a page is inaccesible and it is used for things like sendfile, then
>> the content of the page is not always touched, and can be passed
>> directly to a driver, causing issues.
>>
>> This patch fixes the issue by adding a call to arch_make_page_accessible
>> in page_cache_pipe_buf_confirm; this fixes the issue.
> 
> I spent about 5 minutes putting together a patch:
> 
> 	https://sr71.net/~dave/intel/accessible.patch
> 
> It adds a page flag ("daccess") which starts out set.  It clears the
> flag it when the page is added to the page cache or mapped as anonymous.

And that of course does not work. Pages are not made unaccessible at a random
point in time. We do check for several page flags and page count before doing
so and we also do this while with paqe_ref_freeze to avoid several races.
I guess you just hit one of those.
Dave Hansen April 29, 2020, 5:55 p.m. UTC | #4
On 4/29/20 10:31 AM, Christian Borntraeger wrote:
> On 29.04.20 18:07, Dave Hansen wrote:
>> On 4/28/20 3:50 PM, Claudio Imbrenda wrote:
>>> If a page is inaccesible and it is used for things like sendfile, then
>>> the content of the page is not always touched, and can be passed
>>> directly to a driver, causing issues.
>>>
>>> This patch fixes the issue by adding a call to arch_make_page_accessible
>>> in page_cache_pipe_buf_confirm; this fixes the issue.
>> I spent about 5 minutes putting together a patch:
>>
>> 	https://sr71.net/~dave/intel/accessible.patch
>>
>> It adds a page flag ("daccess") which starts out set.  It clears the
>> flag it when the page is added to the page cache or mapped as anonymous.
> And that of course does not work. Pages are not made unaccessible at a random
> point in time. We do check for several page flags and page count before doing
> so and we also do this while with paqe_ref_freeze to avoid several races.
> I guess you just hit one of those.

Actually, that's the problem.  You've gone through all these careful
checks and made the page inaccessible.  *After* that process, how do you
keep the page from being hit by an I/O device before it's made
accessible again?  My patch just assumes that *all* pages have gone
through that process and passed those checks.

I'm pretty sure if I lifted all the checks in
arch/s390/kernel/uv.c::make_secure_pte() and duplicated them at the
sites where I'm doing ClearPageAccessible(), they'd happily pass.

Freezing page refs is a transient thing you do *during* the conversion,
but it doesn't stop future access to the page.  That's what these
incomplete hooks are trying to do.

Anyway, I look forward to seeing the patch for the FOLL_PIN issue I
pointed out, and I hope to see another copy of the fs/splice changes
with a proper changelog and the maintainer on cc.  It's starting to get
late in the rc's.
Claudio Imbrenda April 29, 2020, 10:53 p.m. UTC | #5
On Wed, 29 Apr 2020 10:55:52 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 4/29/20 10:31 AM, Christian Borntraeger wrote:
> > On 29.04.20 18:07, Dave Hansen wrote:  
> >> On 4/28/20 3:50 PM, Claudio Imbrenda wrote:  
> >>> If a page is inaccesible and it is used for things like sendfile,
> >>> then the content of the page is not always touched, and can be
> >>> passed directly to a driver, causing issues.
> >>>
> >>> This patch fixes the issue by adding a call to
> >>> arch_make_page_accessible in page_cache_pipe_buf_confirm; this
> >>> fixes the issue.  
> >> I spent about 5 minutes putting together a patch:
> >>
> >> 	https://sr71.net/~dave/intel/accessible.patch
> >>
> >> It adds a page flag ("daccess") which starts out set.  It clears
> >> the flag it when the page is added to the page cache or mapped as
> >> anonymous.  
> > And that of course does not work. Pages are not made unaccessible
> > at a random point in time. We do check for several page flags and
> > page count before doing so and we also do this while with
> > paqe_ref_freeze to avoid several races. I guess you just hit one of
> > those.  
> 
> Actually, that's the problem.  You've gone through all these careful
> checks and made the page inaccessible.  *After* that process, how do
> you keep the page from being hit by an I/O device before it's made
> accessible again?  My patch just assumes that *all* pages have gone
> through that process and passed those checks.

I don't understand what you are saying here.

we start with all pages accessible, we mark pages as inaccessible when
they are imported in the secure guest (we use the PG_arch_1 bit in
struct page). we then try to catch all I/O on inaccessible pages and
make them accessible so that I/O devices are happy. when we need to
make page inaccessible again, we mark the page, make sure the counters
and the flag in struct page look ok, then we actually make it
inaccessible. this way pages that are undergoing I/O will never
actually transition from from accessible to inaccessible, although they
might briefly marked as inaccessible. this means that the flag we use
to mark a page inaccessible is overindicating, but that is fine.

pages need to be accessible in order to be used for I/O, and need to be
inaccessible in order to be used by protected VMs.

> I'm pretty sure if I lifted all the checks in
> arch/s390/kernel/uv.c::make_secure_pte() and duplicated them at the
> sites where I'm doing ClearPageAccessible(), they'd happily pass.
> 
> Freezing page refs is a transient thing you do *during* the
> conversion, but it doesn't stop future access to the page.  That's
> what these incomplete hooks are trying to do.

we use the PG_arch_1 bit to actually stop access to the page, freezing
is used just for a short moment to make sure nobody is touching the
page and avoid races while we are checking.

> Anyway, I look forward to seeing the patch for the FOLL_PIN issue I
> pointed out, and I hope to see another copy of the fs/splice changes
> with a proper changelog and the maintainer on cc.  It's starting to
> get late in the rc's.

either your quick and dirty patch was too dirty (e.g. not accounting
for possible races between make_accessible/make_inaccessible), or some
of the functions in the trace you provided should do pin_user_page
instead of get_user_page (or both)


our first implementation (before FOLL_PIN was introduced) called
make_page_accessible for each FOLL_GET. Once FOLL_PIN was introduced,
we thought it would be the right place. If you think calling
make_accessible for both FOLL_PIN and FOLL_GET is the right thing, then
I can surely patch it that way.

I'll send out an v2 for fs/splice later on today.
Dave Hansen April 29, 2020, 11:52 p.m. UTC | #6
On 4/29/20 3:53 PM, Claudio Imbrenda wrote:
>> Actually, that's the problem.  You've gone through all these careful
>> checks and made the page inaccessible.  *After* that process, how do
>> you keep the page from being hit by an I/O device before it's made
>> accessible again?  My patch just assumes that *all* pages have gone
>> through that process and passed those checks.
> I don't understand what you are saying here.
> 
> we start with all pages accessible, we mark pages as inaccessible when
> they are imported in the secure guest (we use the PG_arch_1 bit in
> struct page). we then try to catch all I/O on inaccessible pages and
> make them accessible so that I/O devices are happy. 

The catching mechanism is incomplete, that's all I'm saying.

Without looking too hard, and not even having the hardware, I've found
two paths where the "catching" was incomplete:

	1. sendfile(), which you've patched
	2. sendto(), which you haven't patched

> either your quick and dirty patch was too dirty (e.g. not accounting
> for possible races between make_accessible/make_inaccessible), or some
> of the functions in the trace you provided should do pin_user_page
> instead of get_user_page (or both)

I looked in the traces for any races.  For sendto(), at least, the
make_accessible() never happened before the process exited.  That's
entirely consistent with the theory that it entirely missed being
caught.  I can't find any evidence that there were races.

Go ahead and try it.  You have the patch!  I mean, I found a bug in
about 10 minutes in one tiny little VM.

And, yes, you need to get rid of the FOLL_PIN check unless you want to
go change a big swath of the remaining get_user_pages*() sites.
Claudio Imbrenda April 30, 2020, 5:19 p.m. UTC | #7
On Wed, 29 Apr 2020 16:52:46 -0700
Dave Hansen <dave.hansen@intel.com> wrote:

> On 4/29/20 3:53 PM, Claudio Imbrenda wrote:
> >> Actually, that's the problem.  You've gone through all these
> >> careful checks and made the page inaccessible.  *After* that
> >> process, how do you keep the page from being hit by an I/O device
> >> before it's made accessible again?  My patch just assumes that
> >> *all* pages have gone through that process and passed those
> >> checks.  
> > I don't understand what you are saying here.
> > 
> > we start with all pages accessible, we mark pages as inaccessible
> > when they are imported in the secure guest (we use the PG_arch_1
> > bit in struct page). we then try to catch all I/O on inaccessible
> > pages and make them accessible so that I/O devices are happy.   
> 
> The catching mechanism is incomplete, that's all I'm saying.

well, sendto in the end does a copy_from_user or a get_user_pages_fast,
both are covered (once we fix the make_accessible to work on FOLL_GET
too). 

> Without looking too hard, and not even having the hardware, I've found
> two paths where the "catching" was incomplete:
> 
> 	1. sendfile(), which you've patched
> 	2. sendto(), which you haven't patched
> 
> > either your quick and dirty patch was too dirty (e.g. not accounting
> > for possible races between make_accessible/make_inaccessible), or
> > some of the functions in the trace you provided should do
> > pin_user_page instead of get_user_page (or both)  
> 
> I looked in the traces for any races.  For sendto(), at least, the
> make_accessible() never happened before the process exited.  That's
> entirely consistent with the theory that it entirely missed being
> caught.  I can't find any evidence that there were races.
> 
> Go ahead and try it.  You have the patch!  I mean, I found a bug in
> about 10 minutes in one tiny little VM.

I tried your patch, but I could not reproduce the problem. I have a
Debian 10 x86_64 with the latest kernel from master and your patch on
top. is there anything I'm missing? which virtual devices are you using?
any particular .config options?

I could easily get the mm_make_accessible tracepoints, but I never
manage to trigger the mm_accessible_error ones.

are you using transparent hugepages by any chance? the
infrastructure for inaccessible pages is meant only for small pages,
since on s390x only small pages can ever be used for secure
guests and therefore become inaccessible.

> And, yes, you need to get rid of the FOLL_PIN check unless you want to
> go change a big swath of the remaining get_user_pages*() sites.
Dave Hansen April 30, 2020, 5:30 p.m. UTC | #8
On 4/30/20 10:19 AM, Claudio Imbrenda wrote:
> On Wed, 29 Apr 2020 16:52:46 -0700
> Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> On 4/29/20 3:53 PM, Claudio Imbrenda wrote:
>>>> Actually, that's the problem.  You've gone through all these
>>>> careful checks and made the page inaccessible.  *After* that
>>>> process, how do you keep the page from being hit by an I/O device
>>>> before it's made accessible again?  My patch just assumes that
>>>> *all* pages have gone through that process and passed those
>>>> checks.  
>>> I don't understand what you are saying here.
>>>
>>> we start with all pages accessible, we mark pages as inaccessible
>>> when they are imported in the secure guest (we use the PG_arch_1
>>> bit in struct page). we then try to catch all I/O on inaccessible
>>> pages and make them accessible so that I/O devices are happy.   
>>
>> The catching mechanism is incomplete, that's all I'm saying.
> 
> well, sendto in the end does a copy_from_user or a get_user_pages_fast,
> both are covered (once we fix the make_accessible to work on FOLL_GET
> too). 

Agreed.  This is inching forward, increasing the coverage of the hooks.
 My confidence in the hooks being complete at this point is pretty low.

>> Without looking too hard, and not even having the hardware, I've found
>> two paths where the "catching" was incomplete:
>>
>> 	1. sendfile(), which you've patched
>> 	2. sendto(), which you haven't patched
>>
>>> either your quick and dirty patch was too dirty (e.g. not accounting
>>> for possible races between make_accessible/make_inaccessible), or
>>> some of the functions in the trace you provided should do
>>> pin_user_page instead of get_user_page (or both)  
>>
>> I looked in the traces for any races.  For sendto(), at least, the
>> make_accessible() never happened before the process exited.  That's
>> entirely consistent with the theory that it entirely missed being
>> caught.  I can't find any evidence that there were races.
>>
>> Go ahead and try it.  You have the patch!  I mean, I found a bug in
>> about 10 minutes in one tiny little VM.
> 
> I tried your patch, but I could not reproduce the problem. I have a
> Debian 10 x86_64 with the latest kernel from master and your patch on
> top. is there anything I'm missing? which virtual devices are you using?
> any particular .config options?

I'm using an emulated e1000 which seems to be the device that notices
problems most readily.  A snippet of the qemu command-line is:

qemu-system-x86_64 -enable-kvm -drive id=disk,file=/path,if=none -device
ahci,id=ahci -device ide-drive,drive=disk,bus=ahci.0 -net
nic,model=e1000 -net user,net=192.168.76.0/24,dhcpstart=192.168.76.44

I'm also using "-cpu host,-pcid" if that makes a difference.  It's just
a plain Skylake client system: "Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz".

> are you using transparent hugepages by any chance? the
> infrastructure for inaccessible pages is meant only for small pages,
> since on s390x only small pages can ever be used for secure
> guests and therefore become inaccessible.

Yes, THP is probably enabled and in play.  But, I did dump out
PageAnon() in some of my traces and it was never set, which rules out THP.

The fact that this infrastructure is not designed to play nicely with
large pages doesn't bode well for its use on a second architecture.
Christian Borntraeger April 30, 2020, 6:12 p.m. UTC | #9
On 29.04.20 18:07, Dave Hansen wrote:
> On 4/28/20 3:50 PM, Claudio Imbrenda wrote:
>> If a page is inaccesible and it is used for things like sendfile, then
>> the content of the page is not always touched, and can be passed
>> directly to a driver, causing issues.
>>
>> This patch fixes the issue by adding a call to arch_make_page_accessible
>> in page_cache_pipe_buf_confirm; this fixes the issue.
> 
> I spent about 5 minutes putting together a patch:
> 
> 	https://sr71.net/~dave/intel/accessible.patch

You only set the page flag for compound pages. that of course leaves a big pile
of pages marked a not accessible, thus explaining the sendto trace and all kind
of other random traces.


What do you see when you also do the  SetPageAccessible(page);
in the else page of prep_new_page (order == 0).
(I do get > 10000 of these non compound page allocs just during boot).


> 
> It adds a page flag ("daccess") which starts out set.  It clears the
> flag it when the page is added to the page cache or mapped as anonymous.
>  This are presumably the the two mostly likely kinds of pages to be
> problematic.  It re-sets the flag when it hits the new hook for s390:
> arch_make_page_accessible().
> 
> It then patches the DMA mapping API.  If a page gets to the DMA mapping
> API without being accessible, it hits a tracepoint.
> 
> It goes boom shortly after hitting userspace underneath a sys_sendto().
>  That code uses lib/iov_iter.c which does get_user_pages_fast() and
> apparently does not set FOLL_PIN, so never hits the s390 arch hooks.
> 
> I hacked out the FOLL_PIN check and just universally call the hook for
> all gup_pte_range() calls.  I think you'll need to do that as well.  I
> don't think the assumptions about FOLL_PIN always preceding I/O is true
> universally.  Hacking out FOLL_PIN quiets down the warning spew quite a
> bit, but it still hits a few of them.
> 
> Here's one example:
> 
>  0)  sd-reso-410   |               |  /* mm_accessible_error: ...
>       sd-resolve-410   [000] ....   212.918838: <stack trace>
>  => trace_event_raw_event_mm_accessible_error
>  => check_page_accessible
>  => e1000_xmit_frame
>  => dev_hard_start_xmit
>  => sch_direct_xmit
>  => __qdisc_run
>  => __dev_queue_xmit
>  => ip_finish_output2
>  => ip_output
>  => ip_send_skb
>  => udp_send_skb.isra.59
>  => udp_sendmsg
>  => ____sys_sendmsg
>  => ___sys_sendmsg
>  => __sys_sendmmsg
>  => __x64_sys_sendmmsg
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
> 
> This is just from booting and sitting on an idle Ubuntu 16.04.6 system.
>  I think the process in question here is the systemd resolver.
>
Christian Borntraeger April 30, 2020, 7:02 p.m. UTC | #10
On 30.04.20 20:12, Christian Borntraeger wrote:
> 
> 
> On 29.04.20 18:07, Dave Hansen wrote:
>> On 4/28/20 3:50 PM, Claudio Imbrenda wrote:
>>> If a page is inaccesible and it is used for things like sendfile, then
>>> the content of the page is not always touched, and can be passed
>>> directly to a driver, causing issues.
>>>
>>> This patch fixes the issue by adding a call to arch_make_page_accessible
>>> in page_cache_pipe_buf_confirm; this fixes the issue.
>>
>> I spent about 5 minutes putting together a patch:
>>
>> 	https://sr71.net/~dave/intel/accessible.patch
> 
> You only set the page flag for compound pages. that of course leaves a big pile
> of pages marked a not accessible, thus explaining the sendto trace and all kind
> of other random traces.
> 
> 
> What do you see when you also do the  SetPageAccessible(page);
> in the else page of prep_new_page (order == 0).
> (I do get > 10000 of these non compound page allocs just during boot).
> 

And yes, I think you are right that we should call the callback also for !FOLL_PIN.
Dave Hansen April 30, 2020, 7:32 p.m. UTC | #11
On 4/30/20 11:12 AM, Christian Borntraeger wrote:
> On 29.04.20 18:07, Dave Hansen wrote:
>> On 4/28/20 3:50 PM, Claudio Imbrenda wrote:
>>> If a page is inaccesible and it is used for things like sendfile, then
>>> the content of the page is not always touched, and can be passed
>>> directly to a driver, causing issues.
>>>
>>> This patch fixes the issue by adding a call to arch_make_page_accessible
>>> in page_cache_pipe_buf_confirm; this fixes the issue.
>> I spent about 5 minutes putting together a patch:
>>
>> 	https://sr71.net/~dave/intel/accessible.patch
> You only set the page flag for compound pages. that of course leaves a big pile
> of pages marked a not accessible, thus explaining the sendto trace and all kind
> of other random traces.

Ahh, nice catch!  That does explain an oddity or two that I saw.

> What do you see when you also do the  SetPageAccessible(page);
> in the else page of prep_new_page (order == 0).
> (I do get > 10000 of these non compound page allocs just during boot).

Yes, I see the same thing.

I updated the patch and double-checked that it triggers properly with a
socket-based sendfile().
Christian Borntraeger April 30, 2020, 7:38 p.m. UTC | #12
On 30.04.20 21:32, Dave Hansen wrote:
> On 4/30/20 11:12 AM, Christian Borntraeger wrote:
>> On 29.04.20 18:07, Dave Hansen wrote:
>>> On 4/28/20 3:50 PM, Claudio Imbrenda wrote:
>>>> If a page is inaccesible and it is used for things like sendfile, then
>>>> the content of the page is not always touched, and can be passed
>>>> directly to a driver, causing issues.
>>>>
>>>> This patch fixes the issue by adding a call to arch_make_page_accessible
>>>> in page_cache_pipe_buf_confirm; this fixes the issue.
>>> I spent about 5 minutes putting together a patch:
>>>
>>> 	https://sr71.net/~dave/intel/accessible.patch
>> You only set the page flag for compound pages. that of course leaves a big pile
>> of pages marked a not accessible, thus explaining the sendto trace and all kind
>> of other random traces.
> 
> Ahh, nice catch!  That does explain an oddity or two that I saw.
> 
>> What do you see when you also do the  SetPageAccessible(page);
>> in the else page of prep_new_page (order == 0).
>> (I do get > 10000 of these non compound page allocs just during boot).
> 
> Yes, I see the same thing.
> 
> I updated the patch and double-checked that it triggers properly with a
> socket-based sendfile().

Do you have a calltrace?
Christian Borntraeger April 30, 2020, 7:45 p.m. UTC | #13
On 30.04.20 21:32, Dave Hansen wrote:
> On 4/30/20 11:12 AM, Christian Borntraeger wrote:
>> On 29.04.20 18:07, Dave Hansen wrote:
>>> On 4/28/20 3:50 PM, Claudio Imbrenda wrote:
>>>> If a page is inaccesible and it is used for things like sendfile, then
>>>> the content of the page is not always touched, and can be passed
>>>> directly to a driver, causing issues.
>>>>
>>>> This patch fixes the issue by adding a call to arch_make_page_accessible
>>>> in page_cache_pipe_buf_confirm; this fixes the issue.
>>> I spent about 5 minutes putting together a patch:
>>>
>>> 	https://sr71.net/~dave/intel/accessible.patch
>> You only set the page flag for compound pages. that of course leaves a big pile
>> of pages marked a not accessible, thus explaining the sendto trace and all kind
>> of other random traces.
> 
> Ahh, nice catch!  That does explain an oddity or two that I saw.
> 
>> What do you see when you also do the  SetPageAccessible(page);
>> in the else page of prep_new_page (order == 0).
>> (I do get > 10000 of these non compound page allocs just during boot).
> 
> Yes, I see the same thing.
> 
> I updated the patch and double-checked that it triggers properly with a
> socket-based sendfile().

Can you place the updated patch somewhere?
Christian Borntraeger April 30, 2020, 7:54 p.m. UTC | #14
On 30.04.20 21:02, Christian Borntraeger wrote:
> On 30.04.20 20:12, Christian Borntraeger wrote:
>>
>>
>> On 29.04.20 18:07, Dave Hansen wrote:
>>> On 4/28/20 3:50 PM, Claudio Imbrenda wrote:
>>>> If a page is inaccesible and it is used for things like sendfile, then
>>>> the content of the page is not always touched, and can be passed
>>>> directly to a driver, causing issues.
>>>>
>>>> This patch fixes the issue by adding a call to arch_make_page_accessible
>>>> in page_cache_pipe_buf_confirm; this fixes the issue.
>>>
>>> I spent about 5 minutes putting together a patch:
>>>
>>> 	https://sr71.net/~dave/intel/accessible.patch
>>
>> You only set the page flag for compound pages. that of course leaves a big pile
>> of pages marked a not accessible, thus explaining the sendto trace and all kind
>> of other random traces.
>>
>>
>> What do you see when you also do the  SetPageAccessible(page);
>> in the else page of prep_new_page (order == 0).
>> (I do get > 10000 of these non compound page allocs just during boot).
>>
> 
> And yes, I think you are right that we should call the callback also for !FOLL_PIN.


Thinking again about this I am no longer sure. Adding John Hubbard.

Documentation/core-api/pin_user_pages.rst says:
-------snip----------
Another way of thinking about these flags is as a progression of restrictions:
FOLL_GET is for struct page manipulation, without affecting the data that the
struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
will be pinned longterm, and whose data will be accessed.
-------snip----------

So John,is it ok to give a page to an I/O device where the code has used gup 
with FOLL_GET (or gup fast without pup) or would you consider this a bug?
Dave Hansen April 30, 2020, 8:01 p.m. UTC | #15
On 4/30/20 12:38 PM, Christian Borntraeger wrote:
>>
>>> What do you see when you also do the  SetPageAccessible(page);
>>> in the else page of prep_new_page (order == 0).
>>> (I do get > 10000 of these non compound page allocs just during boot).
>> Yes, I see the same thing.
>>
>> I updated the patch and double-checked that it triggers properly with a
>> socket-based sendfile().
> Do you have a calltrace? 

It triggers with the thread from this patch _not_ applied.  I just
wanted to point out that it was able to find the real bug and that the
patch in question squashed this instance.

Here's the call trace I see:

> [  199.566150] WARNING: CPU: 0 PID: 878 at mm/page_alloc.c:8860 check_page_accessible+0x5f/0xb0
> [  199.567813] Modules linked in:
> [  199.568447] CPU: 0 PID: 878 Comm: server Not tainted 5.7.0-rc3-dirty #6544
> [  199.569948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [  199.571980] RIP: 0010:check_page_accessible+0x5f/0xb0
> [  199.572737] Code: 4c 01 48 85 db 74 18 48 8b 03 48 8b 7b 08 48 83 c3 18 48 89 ee ff d0 48 8b 03 48 85 c0 75 eb 48 8b 45 00 a9 00 00 20 00 75 bb <0f> 0b 0f 1f 44 00 00 3e 80 4d 02 20 5b 5d c3 65 8b 05 db 72 d3 7e
> [  199.576514] RSP: 0018:ffffc900003f7810 EFLAGS: 00010246
> [  199.577556] RAX: 000ffff800020016 RBX: ffff88800ff5b068 RCX: 0000000000000000
> [  199.578964] RDX: 0000000000000003 RSI: 0000000000000008 RDI: ffff88800fc18400
> [  199.580352] RBP: ffffea0001d5ca40 R08: 00000000000001b5 R09: ffff88800fe74000
> [  199.581784] R10: ffff88800fe74170 R11: ffff88800fc18400 R12: 0000000000001000
> [  199.583179] R13: 0000000000000000 R14: 0000000000000000 R15: ffffc900005e52d0
> [  199.584587] FS:  00007ffff7fe8700(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
> [  199.586045] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  199.587200] CR2: 00007ffff7b042b0 CR3: 000000007ab5a000 CR4: 00000000003406f0
> [  199.588655] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  199.591784] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  199.593217] Call Trace:
> [  199.593755]  ? e1000_xmit_frame+0x50c/0x1040
> [  199.594626]  ? dev_hard_start_xmit+0x8d/0x1e0
> [  199.596126]  ? sch_direct_xmit+0xe8/0x220
> [  199.597676]  ? __qdisc_run+0x13a/0x4e0
> [  199.598476]  ? __dev_queue_xmit+0x2d7/0x710
> [  199.599767]  ? ip_finish_output2+0x2a8/0x550
> [  199.601195]  ? skb_gso_validate_network_len+0x11/0x70
> [  199.602272]  ? ip_output+0x6d/0xe0
> [  199.602955]  ? ip_forward_options.cold.9+0x27/0x27
> [  199.603918]  ? __ip_queue_xmit+0x14f/0x370
> [  199.604734]  ? __tcp_transmit_skb+0x54b/0xad0
> [  199.605608]  ? tcp_write_xmit+0x379/0x10c0
> [  199.606354]  ? do_tcp_sendpages+0x2bc/0x5f0
> [  199.607206]  ? tcp_sendpage_locked+0x44/0x60
> [  199.608066]  ? tcp_sendpage+0x37/0x50
> [  199.608801]  ? inet_sendpage+0x4f/0x80
> [  199.609554]  ? kernel_sendpage+0x17/0x20
> [  199.610341]  ? sock_sendpage+0x20/0x30
> [  199.610989]  ? pipe_to_sendpage+0x60/0xa0
> [  199.611658]  ? __splice_from_pipe+0x9f/0x180
> [  199.612567]  ? generic_pipe_buf_nosteal+0x10/0x10
> [  199.613509]  ? generic_pipe_buf_nosteal+0x10/0x10
> [  199.614452]  ? splice_from_pipe+0x5d/0x90
> [  199.615258]  ? direct_splice_actor+0x32/0x40
> [  199.616102]  ? splice_direct_to_actor+0x101/0x220
> [  199.617054]  ? pipe_to_sendpage+0xa0/0xa0
> [  199.617858]  ? do_splice_direct+0x9a/0xd0
> [  199.618660]  ? do_sendfile+0x1ce/0x3d0
> [  199.619411]  ? __x64_sys_sendfile64+0x5c/0xc0
> [  199.620343]  ? do_syscall_64+0x4a/0x130
> [  199.621042]  ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [  199.622119] ---[ end trace 19796ac5d41cc1f4 ]---
Christian Borntraeger April 30, 2020, 8:03 p.m. UTC | #16
On 30.04.20 22:01, Dave Hansen wrote:
> On 4/30/20 12:38 PM, Christian Borntraeger wrote:
>>>
>>>> What do you see when you also do the  SetPageAccessible(page);
>>>> in the else page of prep_new_page (order == 0).
>>>> (I do get > 10000 of these non compound page allocs just during boot).
>>> Yes, I see the same thing.
>>>
>>> I updated the patch and double-checked that it triggers properly with a
>>> socket-based sendfile().
>> Do you have a calltrace? 
> 
> It triggers with the thread from this patch _not_ applied.  I just
> wanted to point out that it was able to find the real bug and that the
> patch in question squashed this instance.


Ah now it makes sense. Thanks. 
> 
> Here's the call trace I see:
> 
>> [  199.566150] WARNING: CPU: 0 PID: 878 at mm/page_alloc.c:8860 check_page_accessible+0x5f/0xb0
>> [  199.567813] Modules linked in:
>> [  199.568447] CPU: 0 PID: 878 Comm: server Not tainted 5.7.0-rc3-dirty #6544
>> [  199.569948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
>> [  199.571980] RIP: 0010:check_page_accessible+0x5f/0xb0
>> [  199.572737] Code: 4c 01 48 85 db 74 18 48 8b 03 48 8b 7b 08 48 83 c3 18 48 89 ee ff d0 48 8b 03 48 85 c0 75 eb 48 8b 45 00 a9 00 00 20 00 75 bb <0f> 0b 0f 1f 44 00 00 3e 80 4d 02 20 5b 5d c3 65 8b 05 db 72 d3 7e
>> [  199.576514] RSP: 0018:ffffc900003f7810 EFLAGS: 00010246
>> [  199.577556] RAX: 000ffff800020016 RBX: ffff88800ff5b068 RCX: 0000000000000000
>> [  199.578964] RDX: 0000000000000003 RSI: 0000000000000008 RDI: ffff88800fc18400
>> [  199.580352] RBP: ffffea0001d5ca40 R08: 00000000000001b5 R09: ffff88800fe74000
>> [  199.581784] R10: ffff88800fe74170 R11: ffff88800fc18400 R12: 0000000000001000
>> [  199.583179] R13: 0000000000000000 R14: 0000000000000000 R15: ffffc900005e52d0
>> [  199.584587] FS:  00007ffff7fe8700(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
>> [  199.586045] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  199.587200] CR2: 00007ffff7b042b0 CR3: 000000007ab5a000 CR4: 00000000003406f0
>> [  199.588655] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  199.591784] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [  199.593217] Call Trace:
>> [  199.593755]  ? e1000_xmit_frame+0x50c/0x1040
>> [  199.594626]  ? dev_hard_start_xmit+0x8d/0x1e0
>> [  199.596126]  ? sch_direct_xmit+0xe8/0x220
>> [  199.597676]  ? __qdisc_run+0x13a/0x4e0
>> [  199.598476]  ? __dev_queue_xmit+0x2d7/0x710
>> [  199.599767]  ? ip_finish_output2+0x2a8/0x550
>> [  199.601195]  ? skb_gso_validate_network_len+0x11/0x70
>> [  199.602272]  ? ip_output+0x6d/0xe0
>> [  199.602955]  ? ip_forward_options.cold.9+0x27/0x27
>> [  199.603918]  ? __ip_queue_xmit+0x14f/0x370
>> [  199.604734]  ? __tcp_transmit_skb+0x54b/0xad0
>> [  199.605608]  ? tcp_write_xmit+0x379/0x10c0
>> [  199.606354]  ? do_tcp_sendpages+0x2bc/0x5f0
>> [  199.607206]  ? tcp_sendpage_locked+0x44/0x60
>> [  199.608066]  ? tcp_sendpage+0x37/0x50
>> [  199.608801]  ? inet_sendpage+0x4f/0x80
>> [  199.609554]  ? kernel_sendpage+0x17/0x20
>> [  199.610341]  ? sock_sendpage+0x20/0x30
>> [  199.610989]  ? pipe_to_sendpage+0x60/0xa0
>> [  199.611658]  ? __splice_from_pipe+0x9f/0x180
>> [  199.612567]  ? generic_pipe_buf_nosteal+0x10/0x10
>> [  199.613509]  ? generic_pipe_buf_nosteal+0x10/0x10
>> [  199.614452]  ? splice_from_pipe+0x5d/0x90
>> [  199.615258]  ? direct_splice_actor+0x32/0x40
>> [  199.616102]  ? splice_direct_to_actor+0x101/0x220
>> [  199.617054]  ? pipe_to_sendpage+0xa0/0xa0
>> [  199.617858]  ? do_splice_direct+0x9a/0xd0
>> [  199.618660]  ? do_sendfile+0x1ce/0x3d0
>> [  199.619411]  ? __x64_sys_sendfile64+0x5c/0xc0
>> [  199.620343]  ? do_syscall_64+0x4a/0x130
>> [  199.621042]  ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
>> [  199.622119] ---[ end trace 19796ac5d41cc1f4 ]---
John Hubbard April 30, 2020, 10:26 p.m. UTC | #17
On 2020-04-30 12:54, Christian Borntraeger wrote:
> On 30.04.20 21:02, Christian Borntraeger wrote:
>> On 30.04.20 20:12, Christian Borntraeger wrote:
>>> On 29.04.20 18:07, Dave Hansen wrote:
>>>> On 4/28/20 3:50 PM, Claudio Imbrenda wrote:
>>>>> If a page is inaccesible and it is used for things like sendfile, then
>>>>> the content of the page is not always touched, and can be passed
>>>>> directly to a driver, causing issues.
>>>>>
>>>>> This patch fixes the issue by adding a call to arch_make_page_accessible
>>>>> in page_cache_pipe_buf_confirm; this fixes the issue.
>>>>
>>>> I spent about 5 minutes putting together a patch:
>>>>
>>>> 	https://sr71.net/~dave/intel/accessible.patch
>>>
>>> You only set the page flag for compound pages. that of course leaves a big pile
>>> of pages marked a not accessible, thus explaining the sendto trace and all kind
>>> of other random traces.
>>>
>>>
>>> What do you see when you also do the  SetPageAccessible(page);
>>> in the else page of prep_new_page (order == 0).
>>> (I do get > 10000 of these non compound page allocs just during boot).
>>>
>>
>> And yes, I think you are right that we should call the callback also for !FOLL_PIN.
> 


Disclaimer: I haven't dug into the details of the latest points above,
so answers below will be narrowly focused.


> 
> Thinking again about this I am no longer sure. Adding John Hubbard.
> 
> Documentation/core-api/pin_user_pages.rst says:
> -------snip----------
> Another way of thinking about these flags is as a progression of restrictions:
> FOLL_GET is for struct page manipulation, without affecting the data that the
> struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
> short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
> a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
> restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
> will be pinned longterm, and whose data will be accessed.
> -------snip----------
> 
> So John,is it ok to give a page to an I/O device where the code has used gup
> with FOLL_GET (or gup fast without pup) or would you consider this a bug?
> 

Well, it's a bug (or a bug-in-waiting): even though gup/FOLL_GET works
just as well (and as badly) as ever, pup/FOLL_PIN is required in order
to safely and correctly allow a non-CPU device to operate on a page's
data. Core mm and fs code is going to key off of page_maybe_dma_pinned()
in order to make critical decisions about writeback and umount, and
FOLL_PIN opts into that; FOLL_GET does not.

Basically, you'd be creating another set of call sites that someone
would have to convert to pup/FOLL_PIN.

btw, on the FOLL_LONGTERM documentation above: that's more of an
aspiration than a description of current behavior, in some ways.
The current FOLL_LONGTERM is a little more quirky than is implied
there.

Also on a related note, I've been slow in posting patches to implement
the remaining call site conversions, and am trying to get back to that
asap. There have been some distractions. :) Once every call site is
correctly using gup or pup, it will be easier for everyone.


thanks,
--
John Hubbard
NVIDIA
diff mbox series

Patch

diff --git a/fs/splice.c b/fs/splice.c
index 4735defc46ee..f026e0ce9acd 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -106,6 +106,9 @@  static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe,
 	struct page *page = buf->page;
 	int err;
 
+	if (arch_make_page_accessible(page))
+		return -EIO;
+
 	if (!PageUptodate(page)) {
 		lock_page(page);