Message ID | 20191103112113.8256-1-hdanton@sina.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] mm: gup: add helper page_try_gup_pin(page) | expand |
On 11/3/19 3:21 AM, Hillf Danton wrote: > > A helper is added for mitigating the gup issue described at > https://lwn.net/Articles/784574/. It is unsafe to write out > a dirty page that is already gup pinned for DMA. > > In the current writeback context, dirty pages are written out with > no detecting whether they have been gup pinned; Nor mark to keep > gupers off. In the gup context, file pages can be pinned with other > gupers and writeback ignored. > > The factor, that no room, supposedly even one bit, in the current > page struct can be used for tracking gupers, makes the issue harder > to tackle. Well, as long as we're counting bits, I've taken 21 bits (!) to track "gupers". :) More accurately, I'm sharing 31 bits with get_page()...please see my recently posted patchset for tracking dma-pinned pages: https://lore.kernel.org/r/20191030224930.3990755-1-jhubbard@nvidia.com Once that is merged, you will have this available: static inline bool page_dma_pinned(struct page *page); ...which will reliably track dma-pinned pages. After that, we still need to convert a some more call sites (block/bio in particular) to the new pin_user_pages()/put_user_page() system, in order for filesystems to take advantage of it, but this approach has the advantage of actually tracking such pages, rather than faking it by hoping that there is only one gup caller at a time. > > The approach here is, because it makes no sense to allow a file page > to have multiple gupers at the same time, looking to make gupers ohhh...no, that's definitely not a claim you can make. > mutually exclusive, and then guper's singulairty helps to tell if a > guper is existing by staring at the change in page count. > > The result of that sigularity is not yet 100% correct but something > of "best effort" as the effect of random get_page() is perhaps also > folded in it. > It is assumed the best effort is feasible/acceptable in practice > without the the cost of growing the page struct size by one bit, > were it true that something similar has been applied to the page > migrate and reclaim contexts for a while. > > With the helper in place, we skip writing out a dirty page if a > guper is detected; On gupping, we give up pinning a file page due > to writeback or losing the race to become a guper. > > The end result is, no gup-pinned page will be put under writeback. I think you must have missed the many contentious debates about the tension between gup-pinned pages, and writeback. File systems can't just ignore writeback in all cases. This patch leads to either system hangs or filesystem corruption, in the presence of long-lasting gup pins. Really, this won't work. sorry. thanks, John Hubbard NVIDIA
On Sun, 3 Nov 2019 12:20:10 -0800 John Hubbard wrote: > On 11/3/19 3:21 AM, Hillf Danton wrote: > > > > A helper is added for mitigating the gup issue described at > > https://lwn.net/Articles/784574/. It is unsafe to write out > > a dirty page that is already gup pinned for DMA. > > > > In the current writeback context, dirty pages are written out with > > no detecting whether they have been gup pinned; Nor mark to keep > > gupers off. In the gup context, file pages can be pinned with other > > gupers and writeback ignored. > > > > The factor, that no room, supposedly even one bit, in the current > > page struct can be used for tracking gupers, makes the issue harder > > to tackle. > > Well, as long as we're counting bits, I've taken 21 bits (!) to track > "gupers". :) More accurately, I'm sharing 31 bits with get_page()...please Would you please specify the reasoning of tracking multiple gupers for a dirty page? Do you mean that it is all fine for guper-A to add changes to guper-B's data without warning and vice versa? > see my recently posted patchset for tracking dma-pinned pages: > > https://lore.kernel.org/r/20191030224930.3990755-1-jhubbard@nvidia.com > > Once that is merged, you will have this available: > > static inline bool page_dma_pinned(struct page *page); > > ...which will reliably track dma-pinned pages. > > After that, we still need to convert a some more call sites (block/bio > in particular) to the new pin_user_pages()/put_user_page() system, in > order for filesystems to take advantage of it, but this approach has > the advantage of actually tracking such pages, rather than faking it by > hoping that there is only one gup caller at a time. > > > > > > The approach here is, because it makes no sense to allow a file page > > to have multiple gupers at the same time, looking to make gupers > > ohhh...no, that's definitely not a claim you can make. > > > > mutually exclusive, and then guper's singulairty helps to tell if a > > guper is existing by staring at the change in page count. > > > > The result of that sigularity is not yet 100% correct but something > > of "best effort" as the effect of random get_page() is perhaps also > > folded in it. > > It is assumed the best effort is feasible/acceptable in practice > > without the the cost of growing the page struct size by one bit, > > were it true that something similar has been applied to the page > > migrate and reclaim contexts for a while. > > > > With the helper in place, we skip writing out a dirty page if a > > guper is detected; On gupping, we give up pinning a file page due > > to writeback or losing the race to become a guper. > > > > The end result is, no gup-pinned page will be put under writeback. > > I think you must have missed the many contentious debates about the > tension between gup-pinned pages, and writeback. File systems can't > just ignore writeback in all cases. This patch leads to either > system hangs or filesystem corruption, in the presence of long-lasting > gup pins. The current risk of data corruption due to writeback with long-lived gup references all ignored is zeroed out by detecting gup-pinned dirty pages and skipping them; that may lead to problems you mention above. Though I doubt anything helpful about it can be expected from fs in near future, we have options for instance that gupers periodically release their references and re-pin pages after data sync the same way as the current flusher does. > Really, this won't work. sorry. > > > thanks, > > John Hubbard > NVIDIA
On 11/3/19 8:34 PM, Hillf Danton wrote: ... >> >> Well, as long as we're counting bits, I've taken 21 bits (!) to track >> "gupers". :) More accurately, I'm sharing 31 bits with get_page()...please > > Would you please specify the reasoning of tracking multiple gupers > for a dirty page? Do you mean that it is all fine for guper-A to add > changes to guper-B's data without warning and vice versa? It's generally OK to call get_user_pages() on a page more than once. And even though we are seeing some work to reduce the number of places in the kernel that call get_user_pages(), there are still lots of call sites. That means lots of combinations and situations that could result in more than one gup call per page. Furthermore, there is no mechanism, convention, documentation, nor anything at all that attempts to enforce "for each page, get_user_pages() may only be called once." ... >> >> I think you must have missed the many contentious debates about the >> tension between gup-pinned pages, and writeback. File systems can't >> just ignore writeback in all cases. This patch leads to either >> system hangs or filesystem corruption, in the presence of long-lasting >> gup pins. > > The current risk of data corruption due to writeback with long-lived > gup references all ignored is zeroed out by detecting gup-pinned dirty > pages and skipping them; that may lead to problems you mention above. > Here, I believe you're pointing out that the current situation in the kernel is already broken, with respect to fs interactions (especially writeback) with gup. Yes, you are correct, there is a problem. > Though I doubt anything helpful about it can be expected from fs in near Actually, fs and mm folks are working together to solve this. > future, we have options for instance that gupers periodically release > their references and re-pin pages after data sync the same way as the > current flusher does. > That's one idea. I don't see it as viable, given the behavior of, say, a compute process running OpenCL jobs on a GPU that is connected via a network or Infiniband card--the idea of "pause" really looks more like "tear down the complicated multi-driver connection, writeback, then set it all up again", I suspect. (And if we could easily interrupt the job, we'd probably really be running with a page-fault-capable GPU plus and IB card that does ODP, plus HMM, and we wouldn't need to gup-pin anyway...) Anyway, this is not amenable to quick fixes, because the problem is a couple of missing design pieces. Which we're working on putting in. But meanwhile, smaller changes such as this one are just going to move the problems to different places, rather than solving them. So it's best not to do that. thanks,
On Sun 03-11-19 22:09:03, John Hubbard wrote: > On 11/3/19 8:34 PM, Hillf Danton wrote: > > future, we have options for instance that gupers periodically release > > their references and re-pin pages after data sync the same way as the > > current flusher does. > > > > That's one idea. I don't see it as viable, given the behavior of, say, > a compute process running OpenCL jobs on a GPU that is connected via > a network or Infiniband card--the idea of "pause" really looks more like > "tear down the complicated multi-driver connection, writeback, then set it > all up again", I suspect. (And if we could easily interrupt the job, we'd > probably really be running with a page-fault-capable GPU plus and IB card > that does ODP, plus HMM, and we wouldn't need to gup-pin anyway...) > > Anyway, this is not amenable to quick fixes, because the problem is > a couple of missing design pieces. Which we're working on putting in. > But meanwhile, smaller changes such as this one are just going to move > the problems to different places, rather than solving them. So it's best > not to do that. Yeah, fully agreed here. Quick half baked fixes will make the current messy situation even worse... Honza
On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote: > On 11/3/19 8:34 PM, Hillf Danton wrote: > ... > >> > >> Well, as long as we're counting bits, I've taken 21 bits (!) to track > >> "gupers". :) More accurately, I'm sharing 31 bits with get_page()...please > > > > Would you please specify the reasoning of tracking multiple gupers > > for a dirty page? Do you mean that it is all fine for guper-A to add > > changes to guper-B's data without warning and vice versa? > > It's generally OK to call get_user_pages() on a page more than once. Does this explain that it's generally OK to gup pin a page under writeback and then start DMA to it behind the flusher's back without warning? > And even though we are seeing some work to reduce the number of places > in the kernel that call get_user_pages(), there are still lots of call sites. > That means lots of combinations and situations that could result in more > than one gup call per page. > > Furthermore, there is no mechanism, convention, documentation, nor anything > at all that attempts to enforce "for each page, get_user_pages() may only > be called once." What sense is this making wrt the data corruption resulting specifically from multiple gup references? > > ... > >> > >> I think you must have missed the many contentious debates about the > >> tension between gup-pinned pages, and writeback. File systems can't > >> just ignore writeback in all cases. This patch leads to either > >> system hangs or filesystem corruption, in the presence of long-lasting > >> gup pins. > > > > The current risk of data corruption due to writeback with long-lived > > gup references all ignored is zeroed out by detecting gup-pinned dirty > > pages and skipping them; that may lead to problems you mention above. > > > > Here, I believe you're pointing out that the current situation in the > kernel is already broken, with respect to fs interactions (especially > writeback) with gup. Yes, you are correct, there is a problem. > > > Though I doubt anything helpful about it can be expected from fs in near > > Actually, fs and mm folks are working together to solve this. > > > future, we have options for instance that gupers periodically release > > their references and re-pin pages after data sync the same way as the > > current flusher does. > > > > That's one idea. I don't see it as viable, given the behavior of, say, > a compute process running OpenCL jobs on a GPU that is connected via > a network or Infiniband card--the idea of "pause" really looks more like > "tear down the complicated multi-driver connection, writeback, then set it > all up again", I suspect. (And if we could easily interrupt the job, we'd > probably really be running with a page-fault-capable GPU plus and IB card > that does ODP, plus HMM, and we wouldn't need to gup-pin anyway...) Well is it OK to shorten the behavior above to "data corruption in writeback is tolerable in practice because of the expensive cost of data sync"? What is the point of writeback? Why can the writeback of long-lived gup-pinned pages not be skipped while data sync can be entirely ignored? > Anyway, this is not amenable to quick fixes, because the problem is > a couple of missing design pieces. Which we're working on putting in. > But meanwhile, smaller changes such as this one are just going to move > the problems to different places, rather than solving them. So it's best > not to do that. > > thanks, > -- > John Hubbard > NVIDIA
On Mon, Nov 04, 2019 at 06:20:50PM +0800, Hillf Danton wrote: > > On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote: > > On 11/3/19 8:34 PM, Hillf Danton wrote: > > ... > > >> > > >> Well, as long as we're counting bits, I've taken 21 bits (!) to track > > >> "gupers". :) More accurately, I'm sharing 31 bits with get_page()...please > > > > > > Would you please specify the reasoning of tracking multiple gupers > > > for a dirty page? Do you mean that it is all fine for guper-A to add > > > changes to guper-B's data without warning and vice versa? > > > > It's generally OK to call get_user_pages() on a page more than once. > > Does this explain that it's generally OK to gup pin a page under > writeback and then start DMA to it behind the flusher's back without > warning? It can happens today, is it ok ... well no but we live in an imperfect world. GUP have been abuse by few device driver over the years and those never checked what it meant to use it so now we are left with existing device driver that we can not break that do wrong thing. I personaly think that we should use bounce page for writeback so that writeback can still happens if a page is GUPed. John's patchset is the first step to be able to identify GUPed page and maybe special case them. > > > And even though we are seeing some work to reduce the number of places > > in the kernel that call get_user_pages(), there are still lots of call sites. > > That means lots of combinations and situations that could result in more > > than one gup call per page. > > > > Furthermore, there is no mechanism, convention, documentation, nor anything > > at all that attempts to enforce "for each page, get_user_pages() may only > > be called once." > > What sense is this making wrt the data corruption resulting specifically > from multiple gup references? Multiple GUP references do not imply corruption. Only one or more devices writing to the page while writeback is happening is a cause of corruption. Multiple device writting in the same page concurrently is like multiple CPU thread doing the same. Either the application/device drivers are doing this rightfully on purpose or the application has a bug. Either way it is not our problem (note here i am talking about userspace portion of the device driver). > > >> I think you must have missed the many contentious debates about the > > >> tension between gup-pinned pages, and writeback. File systems can't > > >> just ignore writeback in all cases. This patch leads to either > > >> system hangs or filesystem corruption, in the presence of long-lasting > > >> gup pins. > > > > > > The current risk of data corruption due to writeback with long-lived > > > gup references all ignored is zeroed out by detecting gup-pinned dirty > > > pages and skipping them; that may lead to problems you mention above. > > > > > > > Here, I believe you're pointing out that the current situation in the > > kernel is already broken, with respect to fs interactions (especially > > writeback) with gup. Yes, you are correct, there is a problem. > > > > > Though I doubt anything helpful about it can be expected from fs in near > > > > Actually, fs and mm folks are working together to solve this. > > > > > future, we have options for instance that gupers periodically release > > > their references and re-pin pages after data sync the same way as the > > > current flusher does. > > > > > > > That's one idea. I don't see it as viable, given the behavior of, say, > > a compute process running OpenCL jobs on a GPU that is connected via > > a network or Infiniband card--the idea of "pause" really looks more like > > "tear down the complicated multi-driver connection, writeback, then set it > > all up again", I suspect. (And if we could easily interrupt the job, we'd > > probably really be running with a page-fault-capable GPU plus and IB card > > that does ODP, plus HMM, and we wouldn't need to gup-pin anyway...) > > Well is it OK to shorten the behavior above to "data corruption in > writeback is tolerable in practice because of the expensive cost of > data sync"? > > What is the point of writeback? Why can the writeback of long-lived > gup-pinned pages not be skipped while data sync can be entirely > ignored? I do not think we want that (skip writeback on GUPed page). I think what we should do is use a bounce page ie take a snapshot of the page and starts writeback with the snapshot. We need a snapshot because fs code expect stable page content for things like encryption or hashing or other crazy fs features :) Cheers, Jérôme
On Mon, 4 Nov 2019 14:03:55 -0500 Jerome Glisse wrote: > > On Mon, Nov 04, 2019 at 06:20:50PM +0800, Hillf Danton wrote: > > > > On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote: > > > On 11/3/19 8:34 PM, Hillf Danton wrote: > > > ... > > > >> > > > >> Well, as long as we're counting bits, I've taken 21 bits (!) to track > > > >> "gupers". :) More accurately, I'm sharing 31 bits with get_page()...please > > > > > > > > Would you please specify the reasoning of tracking multiple gupers > > > > for a dirty page? Do you mean that it is all fine for guper-A to add > > > > changes to guper-B's data without warning and vice versa? > > > > > > It's generally OK to call get_user_pages() on a page more than once. > > > > Does this explain that it's generally OK to gup pin a page under > > writeback and then start DMA to it behind the flusher's back without > > warning? > > It can happens today, is it ok ... well no but we live in an imperfect > world. GUP have been abuse by few device driver over the years and those > never checked what it meant to use it so now we are left with existing > device driver that we can not break that do wrong thing. See your point :) > I personaly think that we should use bounce page for writeback so that > writeback can still happens if a page is GUPed. Gup can be prevented from falling foul of writeback IMHO if the page under writeback, gup pinned or not, remains stable until it is no longer dirty. For that stability, either we can check PageWriteback on gup pinning for instance as the RFC does or drivers can set a gup-pinned page dirty only after DMA and start no more DMA until it is clean again. As long as that stability is ensured writeback will no longer need to take care of gup pin, long-lived or transient. It seems unlike a request too strict to meet in practice wrt data corruption, and bounce page for writeback sounds promising. Does it need to do a memory copy? > John's patchset is the > first step to be able to identify GUPed page and maybe special case them. > > > > > And even though we are seeing some work to reduce the number of places > > > in the kernel that call get_user_pages(), there are still lots of call sites. > > > That means lots of combinations and situations that could result in more > > > than one gup call per page. > > > > > > Furthermore, there is no mechanism, convention, documentation, nor anything > > > at all that attempts to enforce "for each page, get_user_pages() may only > > > be called once." > > > > What sense is this making wrt the data corruption resulting specifically > > from multiple gup references? > > Multiple GUP references do not imply corruption. Only one or more devices > writing to the page while writeback is happening is a cause of corruption. > Multiple device writting in the same page concurrently is like multiple > CPU thread doing the same. Either the application/device drivers are doing > this rightfully on purpose or the application has a bug. Either way it is > not our problem (note here i am talking about userspace portion of the > device driver). > > > > > >> I think you must have missed the many contentious debates about the > > > >> tension between gup-pinned pages, and writeback. File systems can't > > > >> just ignore writeback in all cases. This patch leads to either > > > >> system hangs or filesystem corruption, in the presence of long-lasting > > > >> gup pins. > > > > > > > > The current risk of data corruption due to writeback with long-lived > > > > gup references all ignored is zeroed out by detecting gup-pinned dirty > > > > pages and skipping them; that may lead to problems you mention above. > > > > > > > > > > Here, I believe you're pointing out that the current situation in the > > > kernel is already broken, with respect to fs interactions (especially > > > writeback) with gup. Yes, you are correct, there is a problem. > > > > > > > Though I doubt anything helpful about it can be expected from fs in near > > > > > > Actually, fs and mm folks are working together to solve this. > > > > > > > future, we have options for instance that gupers periodically release > > > > their references and re-pin pages after data sync the same way as the > > > > current flusher does. > > > > > > > > > > That's one idea. I don't see it as viable, given the behavior of, say, > > > a compute process running OpenCL jobs on a GPU that is connected via > > > a network or Infiniband card--the idea of "pause" really looks more like > > > "tear down the complicated multi-driver connection, writeback, then set it > > > all up again", I suspect. (And if we could easily interrupt the job, we'd > > > probably really be running with a page-fault-capable GPU plus and IB card > > > that does ODP, plus HMM, and we wouldn't need to gup-pin anyway...) > > > > Well is it OK to shorten the behavior above to "data corruption in > > writeback is tolerable in practice because of the expensive cost of > > data sync"? > > > > What is the point of writeback? Why can the writeback of long-lived > > gup-pinned pages not be skipped while data sync can be entirely > > ignored? > > I do not think we want that (skip writeback on GUPed page). I think what > we should do is use a bounce page ie take a snapshot of the page and > starts writeback with the snapshot. We need a snapshot because fs code > expect stable page content for things like encryption or hashing or other > crazy fs features :)
On 04.11.19 20:03, Jerome Glisse wrote: > On Mon, Nov 04, 2019 at 06:20:50PM +0800, Hillf Danton wrote: >> >> On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote: >>> On 11/3/19 8:34 PM, Hillf Danton wrote: >>> ... >>>>> >>>>> Well, as long as we're counting bits, I've taken 21 bits (!) to track >>>>> "gupers". :) More accurately, I'm sharing 31 bits with get_page()...please >>>> >>>> Would you please specify the reasoning of tracking multiple gupers >>>> for a dirty page? Do you mean that it is all fine for guper-A to add >>>> changes to guper-B's data without warning and vice versa? >>> >>> It's generally OK to call get_user_pages() on a page more than once. >> >> Does this explain that it's generally OK to gup pin a page under >> writeback and then start DMA to it behind the flusher's back without >> warning? > > It can happens today, is it ok ... well no but we live in an imperfect > world. GUP have been abuse by few device driver over the years and those > never checked what it meant to use it so now we are left with existing > device driver that we can not break that do wrong thing. > > I personaly think that we should use bounce page for writeback so that > writeback can still happens if a page is GUPed. John's patchset is the > first step to be able to identify GUPed page and maybe special case them. > >> >>> And even though we are seeing some work to reduce the number of places >>> in the kernel that call get_user_pages(), there are still lots of call sites. >>> That means lots of combinations and situations that could result in more >>> than one gup call per page. >>> >>> Furthermore, there is no mechanism, convention, documentation, nor anything >>> at all that attempts to enforce "for each page, get_user_pages() may only >>> be called once." >> >> What sense is this making wrt the data corruption resulting specifically >> from multiple gup references? > > Multiple GUP references do not imply corruption. Only one or more devices > writing to the page while writeback is happening is a cause of corruption. > Multiple device writting in the same page concurrently is like multiple > CPU thread doing the same. Either the application/device drivers are doing > this rightfully on purpose or the application has a bug. Either way it is > not our problem (note here i am talking about userspace portion of the > device driver). > If I'm not completely off, we can have multiple GUP references easily by using KVM+VFIO.
On Tue, Nov 05, 2019 at 12:27:55PM +0800, Hillf Danton wrote: > > On Mon, 4 Nov 2019 14:03:55 -0500 Jerome Glisse wrote: > > > > On Mon, Nov 04, 2019 at 06:20:50PM +0800, Hillf Danton wrote: > > > > > > On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote: > > > > On 11/3/19 8:34 PM, Hillf Danton wrote: > > > > ... > > > > >> > > > > >> Well, as long as we're counting bits, I've taken 21 bits (!) to track > > > > >> "gupers". :) More accurately, I'm sharing 31 bits with get_page()...please > > > > > > > > > > Would you please specify the reasoning of tracking multiple gupers > > > > > for a dirty page? Do you mean that it is all fine for guper-A to add > > > > > changes to guper-B's data without warning and vice versa? > > > > > > > > It's generally OK to call get_user_pages() on a page more than once. > > > > > > Does this explain that it's generally OK to gup pin a page under > > > writeback and then start DMA to it behind the flusher's back without > > > warning? > > > > It can happens today, is it ok ... well no but we live in an imperfect > > world. GUP have been abuse by few device driver over the years and those > > never checked what it meant to use it so now we are left with existing > > device driver that we can not break that do wrong thing. > > See your point :) > > > I personaly think that we should use bounce page for writeback so that > > writeback can still happens if a page is GUPed. > > Gup can be prevented from falling foul of writeback IMHO if the page > under writeback, gup pinned or not, remains stable until it is no > longer dirty. > > For that stability, either we can check PageWriteback on gup pinning > for instance as the RFC does or drivers can set a gup-pinned page > dirty only after DMA and start no more DMA until it is clean again. > > As long as that stability is ensured writeback will no longer need to > take care of gup pin, long-lived or transient. > > It seems unlike a request too strict to meet in practice wrt data > corruption, and bounce page for writeback sounds promising. Does it > need to do a memory copy? Once driver has GUP it does not check and re-check the struct page so there is no synchronization whatsoever after GUP happened. In fact for some driver you can not synchronize anything once the device has been program. Many devices are not just simple DMA engine you can start and stop at will (network, GPUs, ...). So once a page is GUP there is just noway to garanty its stability hence the best thing we can do is snapshot it to a bounce page. Cheers, Jérôme
On Tue, 5 Nov 2019 10:54:15 -0500 Jerome Glisse wrote: > > On Tue, Nov 05, 2019 at 12:27:55PM +0800, Hillf Danton wrote: > > > > On Mon, 4 Nov 2019 14:03:55 -0500 Jerome Glisse wrote: > > > > > > On Mon, Nov 04, 2019 at 06:20:50PM +0800, Hillf Danton wrote: > > > > > > > > On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote: > > > > > On 11/3/19 8:34 PM, Hillf Danton wrote: > > > > > ... > > > > > >> > > > > > >> Well, as long as we're counting bits, I've taken 21 bits (!) to track > > > > > >> "gupers". :) More accurately, I'm sharing 31 bits with get_page()...please > > > > > > > > > > > > Would you please specify the reasoning of tracking multiple gupers > > > > > > for a dirty page? Do you mean that it is all fine for guper-A to add > > > > > > changes to guper-B's data without warning and vice versa? > > > > > > > > > > It's generally OK to call get_user_pages() on a page more than once. > > > > > > > > Does this explain that it's generally OK to gup pin a page under > > > > writeback and then start DMA to it behind the flusher's back without > > > > warning? > > > > > > It can happens today, is it ok ... well no but we live in an imperfect > > > world. GUP have been abuse by few device driver over the years and those > > > never checked what it meant to use it so now we are left with existing > > > device driver that we can not break that do wrong thing. > > > > See your point :) > > > > > I personaly think that we should use bounce page for writeback so that > > > writeback can still happens if a page is GUPed. > > > > Gup can be prevented from falling foul of writeback IMHO if the page > > under writeback, gup pinned or not, remains stable until it is no > > longer dirty. > > > > For that stability, either we can check PageWriteback on gup pinning > > for instance as the RFC does or drivers can set a gup-pinned page > > dirty only after DMA and start no more DMA until it is clean again. > > > > As long as that stability is ensured writeback will no longer need to > > take care of gup pin, long-lived or transient. > > > > It seems unlike a request too strict to meet in practice wrt data > > corruption, and bounce page for writeback sounds promising. Does it > > need to do a memory copy? > > Once driver has GUP it does not check and re-check the struct page > so there is no synchronization whatsoever after GUP happened. In > fact for some driver you can not synchronize anything once the device > has been program. Many devices are not just simple DMA engine you > can start and stop at will (network, GPUs, ...). Because "there is no synchronization whatsoever after GUP happened," we need to take another close look at the reasoning for tracking multiple gupers if the chance of their mutual data corruptions exists in the wild. (If any sync mechanism sits between them to avoid data corruption, then it seems single pin is enough.) > So once a page is GUP there is just noway to garanty its stability > hence the best thing we can do is snapshot it to a bounce page. It becomes clearer OTOH that we are more likely than not moving in the incorrect direction, in cases like how to detect gupers and what to do for writeback if page is gup pinned, without a clear picture of the bounce page in the first place. Any plan to post a patch just for idea show? Hillf
On Wed, Nov 06, 2019 at 05:22:40PM +0800, Hillf Danton wrote: > > On Tue, 5 Nov 2019 10:54:15 -0500 Jerome Glisse wrote: > > > > On Tue, Nov 05, 2019 at 12:27:55PM +0800, Hillf Danton wrote: > > > > > > On Mon, 4 Nov 2019 14:03:55 -0500 Jerome Glisse wrote: > > > > > > > > On Mon, Nov 04, 2019 at 06:20:50PM +0800, Hillf Danton wrote: > > > > > > > > > > On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote: > > > > > > On 11/3/19 8:34 PM, Hillf Danton wrote: > > > > > > ... > > > > > > >> > > > > > > >> Well, as long as we're counting bits, I've taken 21 bits (!) to track > > > > > > >> "gupers". :) More accurately, I'm sharing 31 bits with get_page()...please > > > > > > > > > > > > > > Would you please specify the reasoning of tracking multiple gupers > > > > > > > for a dirty page? Do you mean that it is all fine for guper-A to add > > > > > > > changes to guper-B's data without warning and vice versa? > > > > > > > > > > > > It's generally OK to call get_user_pages() on a page more than once. > > > > > > > > > > Does this explain that it's generally OK to gup pin a page under > > > > > writeback and then start DMA to it behind the flusher's back without > > > > > warning? > > > > > > > > It can happens today, is it ok ... well no but we live in an imperfect > > > > world. GUP have been abuse by few device driver over the years and those > > > > never checked what it meant to use it so now we are left with existing > > > > device driver that we can not break that do wrong thing. > > > > > > See your point :) > > > > > > > I personaly think that we should use bounce page for writeback so that > > > > writeback can still happens if a page is GUPed. > > > > > > Gup can be prevented from falling foul of writeback IMHO if the page > > > under writeback, gup pinned or not, remains stable until it is no > > > longer dirty. > > > > > > For that stability, either we can check PageWriteback on gup pinning > > > for instance as the RFC does or drivers can set a gup-pinned page > > > dirty only after DMA and start no more DMA until it is clean again. > > > > > > As long as that stability is ensured writeback will no longer need to > > > take care of gup pin, long-lived or transient. > > > > > > It seems unlike a request too strict to meet in practice wrt data > > > corruption, and bounce page for writeback sounds promising. Does it > > > need to do a memory copy? > > > > Once driver has GUP it does not check and re-check the struct page > > so there is no synchronization whatsoever after GUP happened. In > > fact for some driver you can not synchronize anything once the device > > has been program. Many devices are not just simple DMA engine you > > can start and stop at will (network, GPUs, ...). > > Because "there is no synchronization whatsoever after GUP happened," > we need to take another close look at the reasoning for tracking > multiple gupers if the chance of their mutual data corruptions exists > in the wild. (If any sync mechanism sits between them to avoid data > corruption, then it seems single pin is enough.) It does exist in the wild but the userspace application would be either doing something stupid or something terribly clever. For instance you can have 2 network interface writing to the same GUPed page but that is because the application made the same request over two NICs and both endup writting the samething. You can also have 2 GUPer each writting to different part of the page and never stepping on each others. The point really is that from kernel point of view there is just no way to know if the application is doing something wrong or if it just perfectly fine. This is exactly the same thing as CPU threads, you do not ask the kernel to ascertain wether what application threads are doing is wrong or right. So we have to live with the fact that we can have multiple GUPers and that it is not our problems if that happens and we can do nothing about it. Note that we are removing GUP from some of those driver, ones where the device can abide to mmu notifier. But that is just something orthogonal to all this. > > So once a page is GUP there is just noway to garanty its stability > > hence the best thing we can do is snapshot it to a bounce page. > > It becomes clearer OTOH that we are more likely than not moving in > the incorrect direction, in cases like how to detect gupers and what > to do for writeback if page is gup pinned, without a clear picture > of the bounce page in the first place. Any plan to post a patch just > for idea show? The agreement so far is that we need to be able to identify GUPed pages and this is what John's patchset does. Once we have that piece than we can discuss what to do in respect of write-back. Which is still something where there is no agreement as far as i remember the outcome of the last discussion we had. I expect this will a topic at next LSF/MM or maybe something we can flush out before. In any case my opinion is bounce page is the best thing we can do, from application and FS point of view it mimics the characteristics of regular write-back just as if the write protection window of the write-backed page was infinitly short. Cheers, Jérôme
On Wed, 6 Nov 2019 10:46:29 -0500 Jerome Glisse wrote: > > On Wed, Nov 06, 2019 at 05:22:40PM +0800, Hillf Danton wrote: [...] > > > > > > Once driver has GUP it does not check and re-check the struct page > > > so there is no synchronization whatsoever after GUP happened. In > > > fact for some driver you can not synchronize anything once the device > > > has been program. Many devices are not just simple DMA engine you > > > can start and stop at will (network, GPUs, ...). > > > > Because "there is no synchronization whatsoever after GUP happened," > > we need to take another close look at the reasoning for tracking > > multiple gupers if the chance of their mutual data corruptions exists > > in the wild. (If any sync mechanism sits between them to avoid data > > corruption, then it seems single pin is enough.) > > It does exist in the wild but the userspace application would be either > doing something stupid or something terribly clever. For instance you > can have 2 network interface writing to the same GUPed page but that is > because the application made the same request over two NICs and both > endup writting the samething. > > You can also have 2 GUPer each writting to different part of the page > and never stepping on each others. > > The point really is that from kernel point of view there is just no > way to know if the application is doing something wrong or if it just > perfectly fine. This is exactly the same thing as CPU threads, you do > not ask the kernel to ascertain wether what application threads are > doing is wrong or right. > > So we have to live with the fact that we can have multiple GUPers and > that it is not our problems if that happens and we can do nothing > about it. Ok. Multiple gupers are a must-have, and perhaps their mutal data corruptions as well. > Note that we are removing GUP from some of those driver, ones where > the device can abide to mmu notifier. But that is just something > orthogonal to all this. > > > > > So once a page is GUP there is just noway to garanty its stability > > > hence the best thing we can do is snapshot it to a bounce page. > > > > It becomes clearer OTOH that we are more likely than not moving in > > the incorrect direction, in cases like how to detect gupers and what > > to do for writeback if page is gup pinned, without a clear picture > > of the bounce page in the first place. Any plan to post a patch just > > for idea show? > > The agreement so far is that we need to be able to identify GUPed > pages and this is what John's patchset does. Once we have that piece Oh they are there, and barely trot away in case of long-lived pin. > than we can discuss what to do in respect of write-back. Which is Nobody seems to care what to do in the absence of gup pin. > still something where there is no agreement as far as i remember the > outcome of the last discussion we had. I expect this will a topic > at next LSF/MM or maybe something we can flush out before. These are the restraints we know A, multiple gup pins B, mutual data corruptions C, no break of existing use cases D, zero copy E, feel free to add then what is preventing an agreement like bounce page? Because page migrate and reclaim have been working for a while with gup pin taken into account, detecting it has no priority in any form over the agreement on how to make a witeback page stable. What seems more important, restriction B above makes C hard to meet in any feasible approach trying to keep a writeback page stable, and zero-copy makes it harder AFAICS. > In any case my opinion is bounce page is the best thing we can do, > from application and FS point of view it mimics the characteristics > of regular write-back just as if the write protection window of the > write-backed page was infinitly short. A 100-line patch tells more than a 200-line explanation can and helps to shorten the discussion prior to reaching an agreement. Hillf
On Thu, Nov 07, 2019 at 05:50:17PM +0800, Hillf Danton wrote: > > On Wed, 6 Nov 2019 10:46:29 -0500 Jerome Glisse wrote: > > > > On Wed, Nov 06, 2019 at 05:22:40PM +0800, Hillf Danton wrote: > [...] > > > > > > > > Once driver has GUP it does not check and re-check the struct page > > > > so there is no synchronization whatsoever after GUP happened. In > > > > fact for some driver you can not synchronize anything once the device > > > > has been program. Many devices are not just simple DMA engine you > > > > can start and stop at will (network, GPUs, ...). > > > > > > Because "there is no synchronization whatsoever after GUP happened," > > > we need to take another close look at the reasoning for tracking > > > multiple gupers if the chance of their mutual data corruptions exists > > > in the wild. (If any sync mechanism sits between them to avoid data > > > corruption, then it seems single pin is enough.) > > > > It does exist in the wild but the userspace application would be either > > doing something stupid or something terribly clever. For instance you > > can have 2 network interface writing to the same GUPed page but that is > > because the application made the same request over two NICs and both > > endup writting the samething. > > > > You can also have 2 GUPer each writting to different part of the page > > and never stepping on each others. > > > > The point really is that from kernel point of view there is just no > > way to know if the application is doing something wrong or if it just > > perfectly fine. This is exactly the same thing as CPU threads, you do > > not ask the kernel to ascertain wether what application threads are > > doing is wrong or right. > > > > So we have to live with the fact that we can have multiple GUPers and > > that it is not our problems if that happens and we can do nothing > > about it. > > Ok. Multiple gupers are a must-have, and perhaps their mutal data > corruptions as well. > > > Note that we are removing GUP from some of those driver, ones where > > the device can abide to mmu notifier. But that is just something > > orthogonal to all this. > > > > > > > > So once a page is GUP there is just noway to garanty its stability > > > > hence the best thing we can do is snapshot it to a bounce page. > > > > > > It becomes clearer OTOH that we are more likely than not moving in > > > the incorrect direction, in cases like how to detect gupers and what > > > to do for writeback if page is gup pinned, without a clear picture > > > of the bounce page in the first place. Any plan to post a patch just > > > for idea show? > > > > The agreement so far is that we need to be able to identify GUPed > > pages and this is what John's patchset does. Once we have that piece > > Oh they are there, and barely trot away in case of long-lived pin. > > > than we can discuss what to do in respect of write-back. Which is > > Nobody seems to care what to do in the absence of gup pin. I am not sure i follow ? Today we can not differentiate between GUP and regular get_page(), if you use some combination of specific fs and hardware you might get some BUG_ON() throws at you depending on how lucky/unlucky you are. We can not solve this without being able to differentiate between GUP and regular get_page(). Hence why John's patchset is the first step in the right direction. If there is no GUP on a page then regular writeback happens as it has for years now so in absence of GUP i do not see any issue. > > still something where there is no agreement as far as i remember the > > outcome of the last discussion we had. I expect this will a topic > > at next LSF/MM or maybe something we can flush out before. > > These are the restraints we know > > A, multiple gup pins > B, mutual data corruptions > C, no break of existing use cases > D, zero copy ? What you mean by zero copy ? > E, feel free to add > > then what is preventing an agreement like bounce page? There is 2 sides (AFAIR): - do not write back GUPed page and wait until GUP goes away to write them. But GUP can last as long as the uptime and we can loose data on power failure. - use a bounce page so that there is a chance we have some data on power failure > > Because page migrate and reclaim have been working for a while with > gup pin taken into account, detecting it has no priority in any form > over the agreement on how to make a witeback page stable. migrate just ignore GUPed page and thus there is no issue with migrate. writeback is a special case here because some filesystem need a stable page content and also we need to inhibit some fs specific things that trigger BUG_ON() in set_page_dirty*() > What seems more important, restriction B above makes C hard to meet > in any feasible approach trying to keep a writeback page stable, and > zero-copy makes it harder AFAICS. writeback can use bounce page, zero copy ie not having to use bounce page, is not an issue in fact in some cases we already use bounce page (at the block device level). > > > In any case my opinion is bounce page is the best thing we can do, > > from application and FS point of view it mimics the characteristics > > of regular write-back just as if the write protection window of the > > write-backed page was infinitly short. > > A 100-line patch tells more than a 200-line explanation can and helps > to shorten the discussion prior to reaching an agreement. It is not that trivial, you need to make sure every layer from fs down to block device driver properly behave in front of bounce page. We have such mechanism for bio but it is a the bio level but maybe it can be dumped one level. Cheers, Jérôme
On Thu, 7 Nov 2019 09:57:48 -0500 Jerome Glisse wrote: > > I am not sure i follow ? Today we can not differentiate between GUP > and regular get_page(), if you use some combination of specific fs > and hardware you might get some BUG_ON() throws at you depending on > how lucky/unlucky you are. We can not solve this without being able > to differentiate between GUP and regular get_page(). Hence why John's > patchset is the first step in the right direction. > What is the second one? And when? By who? > If there is no GUP on a page then regular writeback happens as it has > for years now so in absence of GUP i do not see any issue. > > > > > still something where there is no agreement as far as i remember the > > > outcome of the last discussion we had. I expect this will a topic > > > at next LSF/MM or maybe something we can flush out before. > > > > These are the restraints we know > > > > A, multiple gup pins > > B, mutual data corruptions > > C, no break of existing use cases > > D, zero copy > > ? What you mean by zero copy ? > Snippet that can be found at https://lwn.net/Articles/784574/ "get_user_pages() is a way to map user-space memory into the kernel's address space; it will ensure that all of the requested pages have been faulted into RAM (and locked there) and provide a kernel mapping that, in turn, can be used for direct access by the kernel or (more often) to set up zero-copy I/O operations. > > E, feel free to add > > > > then what is preventing an agreement like bounce page? > > There is 2 sides (AFAIR): > - do not write back GUPed page and wait until GUP goes away to > write them. But GUP can last as long as the uptime and we can > loose data on power failure. > - use a bounce page so that there is a chance we have some data > on power failure > > > > > Because page migrate and reclaim have been working for a while with > > gup pin taken into account, detecting it has no priority in any form > > over the agreement on how to make a witeback page stable. > > migrate just ignore GUPed page and thus there is no issue with migrate. > writeback is a special case here because some filesystem need a stable > page content and also we need to inhibit some fs specific things that > trigger BUG_ON() in set_page_dirty*() > Which drivers so far have been snared by the BUG_ON()? Is there any chance to fix them one after another? Otherwise what is making them special (long-lived pin)? After setting page dirty, is there any pending DMA transfer to the dirty page? If yes, what is the point to do writeback for corrupted data? If no, what is preventing the gup pin from being released? > > What seems more important, restriction B above makes C hard to meet > > in any feasible approach trying to keep a writeback page stable, and > > zero-copy makes it harder AFAICS. > > writeback can use bounce page, zero copy ie not having to use bounce > page, is not an issue in fact in some cases we already use bounce page > (at the block device level). > > > > > > In any case my opinion is bounce page is the best thing we can do, > > > from application and FS point of view it mimics the characteristics > > > of regular write-back just as if the write protection window of the > > > write-backed page was infinitly short. > > > > A 100-line patch tells more than a 200-line explanation can and helps > > to shorten the discussion prior to reaching an agreement. > > It is not that trivial, you need to make sure every layer from fs down > to block device driver properly behave in front of bounce page. We have > such mechanism for bio but it is a the bio level but maybe it can be > dumped one level.
On Fri, Nov 08, 2019 at 05:38:37PM +0800, Hillf Danton wrote: > > On Thu, 7 Nov 2019 09:57:48 -0500 Jerome Glisse wrote: > > > > I am not sure i follow ? Today we can not differentiate between GUP > > and regular get_page(), if you use some combination of specific fs > > and hardware you might get some BUG_ON() throws at you depending on > > how lucky/unlucky you are. We can not solve this without being able > > to differentiate between GUP and regular get_page(). Hence why John's > > patchset is the first step in the right direction. > > > What is the second one? And when? By who? Fix current BUG_ON() by not releasing buffer after write-back. Decide what to do for write back and then a page is pin. It will happens once we are done with pin changes and the infrastructure is in place. It will be done by John or others. > > If there is no GUP on a page then regular writeback happens as it has > > for years now so in absence of GUP i do not see any issue. > > > > > > > > still something where there is no agreement as far as i remember the > > > > outcome of the last discussion we had. I expect this will a topic > > > > at next LSF/MM or maybe something we can flush out before. > > > > > > These are the restraints we know > > > > > > A, multiple gup pins > > > B, mutual data corruptions > > > C, no break of existing use cases > > > D, zero copy > > > > ? What you mean by zero copy ? > > > Snippet that can be found at https://lwn.net/Articles/784574/ > > "get_user_pages() is a way to map user-space memory into the kernel's > address space; it will ensure that all of the requested pages have > been faulted into RAM (and locked there) and provide a kernel mapping > that, in turn, can be used for direct access by the kernel or (more > often) to set up zero-copy I/O operations. > > > > E, feel free to add > > > > > > then what is preventing an agreement like bounce page? > > > > There is 2 sides (AFAIR): > > - do not write back GUPed page and wait until GUP goes away to > > write them. But GUP can last as long as the uptime and we can > > loose data on power failure. > > - use a bounce page so that there is a chance we have some data > > on power failure > > > > > > > > Because page migrate and reclaim have been working for a while with > > > gup pin taken into account, detecting it has no priority in any form > > > over the agreement on how to make a witeback page stable. > > > > migrate just ignore GUPed page and thus there is no issue with migrate. > > writeback is a special case here because some filesystem need a stable > > page content and also we need to inhibit some fs specific things that > > trigger BUG_ON() in set_page_dirty*() > > > Which drivers so far have been snared by the BUG_ON()? Is there any > chance to fix them one after another? Otherwise what is making them > special (long-lived pin)? It is a race thing so it does not necesarily happens, the longer the pin the higher risk. It can only happens with some fs (i forgot which ones but you can go read the previous threads). It is easy to fix all we need to do is not release some fs structure of pinned pages after write-back. > After setting page dirty, is there any pending DMA transfer to the > dirty page? If yes, what is the point to do writeback for corrupted > data? If no, what is preventing the gup pin from being released? When user of GUP calls set_page_dirty() they _must_ be done with using the page and it is the case today AFAICT, so no pending DMA. The GUP pin is release after set_page_dirty() by all current users. Note that current users all do that once they are done and they can hold the pages for an _indifinite_ amount of time ie forever. They do dirty pages in their teardown code path. Hence the question that we need ti answer is what to do for dirty pages while they are GUPed. Note that a page can be set dirty while GUPed because CPU access can still happens and thus the regular dirtyness tracking mechanism do operate on such page. So page can go through: - GUPed by someone - write by CPU, mark as dirty - regular write-back kicks in - page is mark clean and fs might release data structure ... any amount of time ... what to do here if more CPU writes ? - GUPed user done and put_page but before call set_page_dirty() this might BUG_ON() inside fs code for some fs if the page was left clean on the CPU since last writeback I would strongly advise to read previous thread this was discussed at length. Cheers, Jérôme
--- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1055,6 +1055,29 @@ static inline void put_page(struct page __put_page(page); } +/* + * @page must be pagecache page + */ +static inline bool page_try_gup_pin(struct page *page) +{ + int count; + + page = compound_head(page); + count = page_ref_count(page); + smp_mb__after_atomic(); + + if (!count || count > page_mapcount(page) + 1 + + page_has_private(page)) + return false; + + if (page_ref_inc_return(page) == count + 1) { + smp_mb__after_atomic(); + return true; + } + put_page(page); + return false; +} + /** * put_user_page() - release a gup-pinned page * @page: pointer to page to be released --- a/mm/gup.c +++ b/mm/gup.c @@ -253,7 +253,11 @@ retry: } if (flags & FOLL_GET) { - if (unlikely(!try_get_page(page))) { + if (page_is_file_cache(page)) { + if (PageWriteback(page) || !page_try_gup_pin(page)) + goto pin_fail; + } else if (unlikely(!try_get_page(page))) { +pin_fail: page = ERR_PTR(-ENOMEM); goto out; } --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2202,6 +2202,9 @@ int write_cache_pages(struct address_spa done_index = page->index; + if (!page_try_gup_pin(page)) + continue; + lock_page(page); /* @@ -2215,6 +2218,7 @@ int write_cache_pages(struct address_spa if (unlikely(page->mapping != mapping)) { continue_unlock: unlock_page(page); + put_page(page); continue; } @@ -2236,6 +2240,11 @@ continue_unlock: trace_wbc_writepage(wbc, inode_to_bdi(mapping->host)); error = (*writepage)(page, wbc, data); + /* + * protection of gup pin is no longer needed after + * putting page under writeback + */ + put_page(page); if (unlikely(error)) { /* * Handle errors according to the type of
A helper is added for mitigating the gup issue described at https://lwn.net/Articles/784574/. It is unsafe to write out a dirty page that is already gup pinned for DMA. In the current writeback context, dirty pages are written out with no detecting whether they have been gup pinned; Nor mark to keep gupers off. In the gup context, file pages can be pinned with other gupers and writeback ignored. The factor, that no room, supposedly even one bit, in the current page struct can be used for tracking gupers, makes the issue harder to tackle. The approach here is, because it makes no sense to allow a file page to have multiple gupers at the same time, looking to make gupers mutually exclusive, and then guper's singulairty helps to tell if a guper is existing by staring at the change in page count. The result of that sigularity is not yet 100% correct but something of "best effort" as the effect of random get_page() is perhaps also folded in it. It is assumed the best effort is feasible/acceptable in practice without the the cost of growing the page struct size by one bit, were it true that something similar has been applied to the page migrate and reclaim contexts for a while. With the helper in place, we skip writing out a dirty page if a guper is detected; On gupping, we give up pinning a file page due to writeback or losing the race to become a guper. The end result is, no gup-pinned page will be put under writeback. It is based on next-20191031. Signed-off-by: Hillf Danton <hdanton@sina.com> --- --