Message ID | 3351099.1675077249@warthog.procyon.org.uk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [GIT,PULL] iov_iter: Improve page extraction (pin or just list) | expand |
On 1/30/23 4:14 AM, David Howells wrote: > Hi Jens, > > Could you consider pulling this patchset into the block tree? I think that > Al's fears wrt to pinned pages being removed from page tables causing deadlock > have been answered. Granted, there is still the issue of how to handle > vmsplice and a bunch of other places to fix, not least skbuff handling. > > I also have patches to fix cifs in a separate branch that I would also like to > push in this merge window - and that requires the first two patches from this > series also, so would it be possible for you to merge at least those two > rather than manually applying them? I've pulled this into a separate branch, but based on the block branch, for-6.3/iov-extract. It's added to for-next as well.
Jens Axboe <axboe@kernel.dk> wrote: > > Hi Jens, > > > > Could you consider pulling this patchset into the block tree? I think that > > Al's fears wrt to pinned pages being removed from page tables causing deadlock > > have been answered. Granted, there is still the issue of how to handle > > vmsplice and a bunch of other places to fix, not least skbuff handling. > > > > I also have patches to fix cifs in a separate branch that I would also like to > > push in this merge window - and that requires the first two patches from this > > series also, so would it be possible for you to merge at least those two > > rather than manually applying them? > > I've pulled this into a separate branch, but based on the block branch, > for-6.3/iov-extract. It's added to for-next as well. Many thanks! David
On 1/30/23 2:33 PM, Jens Axboe wrote: > On 1/30/23 4:14 AM, David Howells wrote: >> Hi Jens, >> >> Could you consider pulling this patchset into the block tree? I think that >> Al's fears wrt to pinned pages being removed from page tables causing deadlock >> have been answered. Granted, there is still the issue of how to handle >> vmsplice and a bunch of other places to fix, not least skbuff handling. >> >> I also have patches to fix cifs in a separate branch that I would also like to >> push in this merge window - and that requires the first two patches from this >> series also, so would it be possible for you to merge at least those two >> rather than manually applying them? > > I've pulled this into a separate branch, but based on the block branch, > for-6.3/iov-extract. It's added to for-next as well. This does cause about a 2.7% regression for me, using O_DIRECT on a raw block device. Looking at a perf diff, here's the top: +2.71% [kernel.vmlinux] [k] mod_node_page_state +2.22% [kernel.vmlinux] [k] iov_iter_extract_pages and these two are gone: 2.14% [kernel.vmlinux] [k] __iov_iter_get_pages_alloc 1.53% [kernel.vmlinux] [k] iov_iter_get_pages rest is mostly in the noise, but mod_node_page_state() sticks out like a sore thumb. They seem to be caused by the node stat accounting done in gup.c for FOLL_PIN. Hmm?
On 1/30/23 2:55 PM, Jens Axboe wrote: > On 1/30/23 2:33 PM, Jens Axboe wrote: >> On 1/30/23 4:14 AM, David Howells wrote: >>> Hi Jens, >>> >>> Could you consider pulling this patchset into the block tree? I think that >>> Al's fears wrt to pinned pages being removed from page tables causing deadlock >>> have been answered. Granted, there is still the issue of how to handle >>> vmsplice and a bunch of other places to fix, not least skbuff handling. >>> >>> I also have patches to fix cifs in a separate branch that I would also like to >>> push in this merge window - and that requires the first two patches from this >>> series also, so would it be possible for you to merge at least those two >>> rather than manually applying them? >> >> I've pulled this into a separate branch, but based on the block branch, >> for-6.3/iov-extract. It's added to for-next as well. > > This does cause about a 2.7% regression for me, using O_DIRECT on a raw > block device. Looking at a perf diff, here's the top: > > +2.71% [kernel.vmlinux] [k] mod_node_page_state > +2.22% [kernel.vmlinux] [k] iov_iter_extract_pages > > and these two are gone: > > 2.14% [kernel.vmlinux] [k] __iov_iter_get_pages_alloc > 1.53% [kernel.vmlinux] [k] iov_iter_get_pages > > rest is mostly in the noise, but mod_node_page_state() sticks out like > a sore thumb. They seem to be caused by the node stat accounting done > in gup.c for FOLL_PIN. Confirmed just disabling the node_stat bits in mm/gup.c and now the performance is back to the same levels as before. An almost 3% regression is a bit hard to swallow...
On 1/30/23 13:57, Jens Axboe wrote: >> This does cause about a 2.7% regression for me, using O_DIRECT on a raw >> block device. Looking at a perf diff, here's the top: >> >> +2.71% [kernel.vmlinux] [k] mod_node_page_state >> +2.22% [kernel.vmlinux] [k] iov_iter_extract_pages >> >> and these two are gone: >> >> 2.14% [kernel.vmlinux] [k] __iov_iter_get_pages_alloc >> 1.53% [kernel.vmlinux] [k] iov_iter_get_pages >> >> rest is mostly in the noise, but mod_node_page_state() sticks out like >> a sore thumb. They seem to be caused by the node stat accounting done >> in gup.c for FOLL_PIN. > > Confirmed just disabling the node_stat bits in mm/gup.c and now the > performance is back to the same levels as before. > > An almost 3% regression is a bit hard to swallow... This is something that we say when adding pin_user_pages_fast(), yes. I doubt that I can quickly find the email thread, but we measured it and weren't immediately able to come up with a way to make it faster. At this point, it's a good time to consider if there is any way to speed it up. But I wanted to confirm that you're absolutely right: the measurement sounds about right, and that's also the hotspot that we say, too. thanks,
On 1/30/23 3:02?PM, John Hubbard wrote: > On 1/30/23 13:57, Jens Axboe wrote: >>> This does cause about a 2.7% regression for me, using O_DIRECT on a raw >>> block device. Looking at a perf diff, here's the top: >>> >>> +2.71% [kernel.vmlinux] [k] mod_node_page_state >>> +2.22% [kernel.vmlinux] [k] iov_iter_extract_pages >>> >>> and these two are gone: >>> >>> 2.14% [kernel.vmlinux] [k] __iov_iter_get_pages_alloc >>> 1.53% [kernel.vmlinux] [k] iov_iter_get_pages >>> >>> rest is mostly in the noise, but mod_node_page_state() sticks out like >>> a sore thumb. They seem to be caused by the node stat accounting done >>> in gup.c for FOLL_PIN. >> >> Confirmed just disabling the node_stat bits in mm/gup.c and now the >> performance is back to the same levels as before. >> >> An almost 3% regression is a bit hard to swallow... > > This is something that we say when adding pin_user_pages_fast(), > yes. I doubt that I can quickly find the email thread, but we > measured it and weren't immediately able to come up with a way > to make it faster. > > At this point, it's a good time to consider if there is any > way to speed it up. But I wanted to confirm that you're absolutely > right: the measurement sounds about right, and that's also the > hotspot that we say, too. From spending all of 5 minutes on this, it must be due to exceeding the pcp stat_threashold, as we then end up doing two atomic_long_adds(). Looking at proc, looks like it's 108. And with this test, then we're hitting that slow path ~80k/second. Uhm...
John Hubbard <jhubbard@nvidia.com> wrote: > This is something that we say when adding pin_user_pages_fast(), > yes. I doubt that I can quickly find the email thread, but we > measured it and weren't immediately able to come up with a way > to make it faster. percpu counters maybe - add them up at the point of viewing? David
On 1/30/23 3:12 PM, David Howells wrote: > John Hubbard <jhubbard@nvidia.com> wrote: > >> This is something that we say when adding pin_user_pages_fast(), >> yes. I doubt that I can quickly find the email thread, but we >> measured it and weren't immediately able to come up with a way >> to make it faster. > > percpu counters maybe - add them up at the point of viewing? They are percpu, see my last email. But for every 108 changes (on my system), they will do two atomic_long_adds(). So not very useful for anything but low frequency modifications.
On 30.01.23 23:15, Jens Axboe wrote: > On 1/30/23 3:12 PM, David Howells wrote: >> John Hubbard <jhubbard@nvidia.com> wrote: >> >>> This is something that we say when adding pin_user_pages_fast(), >>> yes. I doubt that I can quickly find the email thread, but we >>> measured it and weren't immediately able to come up with a way >>> to make it faster. >> >> percpu counters maybe - add them up at the point of viewing? > > They are percpu, see my last email. But for every 108 changes (on > my system), they will do two atomic_long_adds(). So not very > useful for anything but low frequency modifications. > Can we just treat the whole acquired/released accounting as a debug mechanism to detect missing releases and do it only for debug kernels? The pcpu counter is an s8, so we have to flush on a regular basis and cannot really defer it any longer ... but I'm curious if it would be of any help to only have a single PINNED counter that goes into both directions (inc/dec on pin/release), to reduce the flushing. Of course, once we pin/release more than ~108 pages in one go or we switch CPUs frequently it won't be that much of a help ...
On Tue 31-01-23 09:32:27, David Hildenbrand wrote: > On 30.01.23 23:15, Jens Axboe wrote: > > On 1/30/23 3:12 PM, David Howells wrote: > > > John Hubbard <jhubbard@nvidia.com> wrote: > > > > > > > This is something that we say when adding pin_user_pages_fast(), > > > > yes. I doubt that I can quickly find the email thread, but we > > > > measured it and weren't immediately able to come up with a way > > > > to make it faster. > > > > > > percpu counters maybe - add them up at the point of viewing? > > > > They are percpu, see my last email. But for every 108 changes (on > > my system), they will do two atomic_long_adds(). So not very > > useful for anything but low frequency modifications. > > > > Can we just treat the whole acquired/released accounting as a debug > mechanism to detect missing releases and do it only for debug kernels? Yes, AFAIK it is just a debug mechanism for helping to find out issues with page pinning conversions. So I think we can put this behind some debugging ifdef. John? Honza
David Hildenbrand <david@redhat.com> wrote: > >> percpu counters maybe - add them up at the point of viewing? > > They are percpu, see my last email. But for every 108 changes (on > > my system), they will do two atomic_long_adds(). So not very > > useful for anything but low frequency modifications. > > > > Can we just treat the whole acquired/released accounting as a debug mechanism > to detect missing releases and do it only for debug kernels? > > > The pcpu counter is an s8, so we have to flush on a regular basis and cannot > really defer it any longer ... but I'm curious if it would be of any help to > only have a single PINNED counter that goes into both directions (inc/dec on > pin/release), to reduce the flushing. > > Of course, once we pin/release more than ~108 pages in one go or we switch > CPUs frequently it won't be that much of a help ... What are the stats actually used for? Is it just debugging, or do we actually have users for them (control groups spring to mind)? David
On 31.01.23 14:41, David Howells wrote: > David Hildenbrand <david@redhat.com> wrote: > >>>> percpu counters maybe - add them up at the point of viewing? >>> They are percpu, see my last email. But for every 108 changes (on >>> my system), they will do two atomic_long_adds(). So not very >>> useful for anything but low frequency modifications. >>> >> >> Can we just treat the whole acquired/released accounting as a debug mechanism >> to detect missing releases and do it only for debug kernels? >> >> >> The pcpu counter is an s8, so we have to flush on a regular basis and cannot >> really defer it any longer ... but I'm curious if it would be of any help to >> only have a single PINNED counter that goes into both directions (inc/dec on >> pin/release), to reduce the flushing. >> >> Of course, once we pin/release more than ~108 pages in one go or we switch >> CPUs frequently it won't be that much of a help ... > > What are the stats actually used for? Is it just debugging, or do we actually > have users for them (control groups spring to mind)? As it's really just "how many pinning events" vs. "how many unpinning events", I assume it's only for debugging. For example, if you pin the same page twice it would not get accounted as "a single page is pinned".
On 1/31/23 6:48?AM, David Hildenbrand wrote: > On 31.01.23 14:41, David Howells wrote: >> David Hildenbrand <david@redhat.com> wrote: >> >>>>> percpu counters maybe - add them up at the point of viewing? >>>> They are percpu, see my last email. But for every 108 changes (on >>>> my system), they will do two atomic_long_adds(). So not very >>>> useful for anything but low frequency modifications. >>>> >>> >>> Can we just treat the whole acquired/released accounting as a debug mechanism >>> to detect missing releases and do it only for debug kernels? >>> >>> >>> The pcpu counter is an s8, so we have to flush on a regular basis and cannot >>> really defer it any longer ... but I'm curious if it would be of any help to >>> only have a single PINNED counter that goes into both directions (inc/dec on >>> pin/release), to reduce the flushing. >>> >>> Of course, once we pin/release more than ~108 pages in one go or we switch >>> CPUs frequently it won't be that much of a help ... >> >> What are the stats actually used for? Is it just debugging, or do we actually >> have users for them (control groups spring to mind)? > > As it's really just "how many pinning events" vs. "how many unpinning > events", I assume it's only for debugging. > > For example, if you pin the same page twice it would not get accounted > as "a single page is pinned". How about something like the below then? I can send it out as a real patch, will run a sanity check on it first but would be surprised if this doesn't fix it. diff --git a/mm/gup.c b/mm/gup.c index f45a3a5be53a..41abb16286ec 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -168,7 +168,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) */ smp_mb__after_atomic(); +#ifdef CONFIG_DEBUG_VM node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); +#endif return folio; } @@ -180,7 +182,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) { if (flags & FOLL_PIN) { +#ifdef CONFIG_DEBUG_VM node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); +#endif if (folio_test_large(folio)) atomic_sub(refs, folio_pincount_ptr(folio)); else @@ -236,8 +240,9 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) } else { folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); } - +#ifdef CONFIG_DEBUG_VM node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1); +#endif } return 0;
On 31.01.23 15:50, Jens Axboe wrote: > On 1/31/23 6:48?AM, David Hildenbrand wrote: >> On 31.01.23 14:41, David Howells wrote: >>> David Hildenbrand <david@redhat.com> wrote: >>> >>>>>> percpu counters maybe - add them up at the point of viewing? >>>>> They are percpu, see my last email. But for every 108 changes (on >>>>> my system), they will do two atomic_long_adds(). So not very >>>>> useful for anything but low frequency modifications. >>>>> >>>> >>>> Can we just treat the whole acquired/released accounting as a debug mechanism >>>> to detect missing releases and do it only for debug kernels? >>>> >>>> >>>> The pcpu counter is an s8, so we have to flush on a regular basis and cannot >>>> really defer it any longer ... but I'm curious if it would be of any help to >>>> only have a single PINNED counter that goes into both directions (inc/dec on >>>> pin/release), to reduce the flushing. >>>> >>>> Of course, once we pin/release more than ~108 pages in one go or we switch >>>> CPUs frequently it won't be that much of a help ... >>> >>> What are the stats actually used for? Is it just debugging, or do we actually >>> have users for them (control groups spring to mind)? >> >> As it's really just "how many pinning events" vs. "how many unpinning >> events", I assume it's only for debugging. >> >> For example, if you pin the same page twice it would not get accounted >> as "a single page is pinned". > > How about something like the below then? I can send it out as a real > patch, will run a sanity check on it first but would be surprised if > this doesn't fix it. > > > diff --git a/mm/gup.c b/mm/gup.c > index f45a3a5be53a..41abb16286ec 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -168,7 +168,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > */ > smp_mb__after_atomic(); > > +#ifdef CONFIG_DEBUG_VM > node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); > +#endif > > return folio; > } > @@ -180,7 +182,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) > { > if (flags & FOLL_PIN) { > +#ifdef CONFIG_DEBUG_VM > node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); > +#endif > if (folio_test_large(folio)) > atomic_sub(refs, folio_pincount_ptr(folio)); > else > @@ -236,8 +240,9 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) > } else { > folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); > } > - > +#ifdef CONFIG_DEBUG_VM > node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1); > +#endif > } > > return 0; > We might want to hide the counters completely by defining them only with CONFIG_DEBUG_VM.
On 1/31/23 8:02?AM, David Hildenbrand wrote: > On 31.01.23 15:50, Jens Axboe wrote: >> On 1/31/23 6:48?AM, David Hildenbrand wrote: >>> On 31.01.23 14:41, David Howells wrote: >>>> David Hildenbrand <david@redhat.com> wrote: >>>> >>>>>>> percpu counters maybe - add them up at the point of viewing? >>>>>> They are percpu, see my last email. But for every 108 changes (on >>>>>> my system), they will do two atomic_long_adds(). So not very >>>>>> useful for anything but low frequency modifications. >>>>>> >>>>> >>>>> Can we just treat the whole acquired/released accounting as a debug mechanism >>>>> to detect missing releases and do it only for debug kernels? >>>>> >>>>> >>>>> The pcpu counter is an s8, so we have to flush on a regular basis and cannot >>>>> really defer it any longer ... but I'm curious if it would be of any help to >>>>> only have a single PINNED counter that goes into both directions (inc/dec on >>>>> pin/release), to reduce the flushing. >>>>> >>>>> Of course, once we pin/release more than ~108 pages in one go or we switch >>>>> CPUs frequently it won't be that much of a help ... >>>> >>>> What are the stats actually used for? Is it just debugging, or do we actually >>>> have users for them (control groups spring to mind)? >>> >>> As it's really just "how many pinning events" vs. "how many unpinning >>> events", I assume it's only for debugging. >>> >>> For example, if you pin the same page twice it would not get accounted >>> as "a single page is pinned". >> >> How about something like the below then? I can send it out as a real >> patch, will run a sanity check on it first but would be surprised if >> this doesn't fix it. >> >> >> diff --git a/mm/gup.c b/mm/gup.c >> index f45a3a5be53a..41abb16286ec 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -168,7 +168,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) >> */ >> smp_mb__after_atomic(); >> +#ifdef CONFIG_DEBUG_VM >> node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); >> +#endif >> return folio; >> } >> @@ -180,7 +182,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) >> static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) >> { >> if (flags & FOLL_PIN) { >> +#ifdef CONFIG_DEBUG_VM >> node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); >> +#endif >> if (folio_test_large(folio)) >> atomic_sub(refs, folio_pincount_ptr(folio)); >> else >> @@ -236,8 +240,9 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) >> } else { >> folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); >> } >> - >> +#ifdef CONFIG_DEBUG_VM >> node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1); >> +#endif >> } >> return 0; >> > > We might want to hide the counters completely by defining them only > with CONFIG_DEBUG_VM. Are all of them debug aids only? If so, yes we should just have node_stat_* under CONFIG_DEBUG_VM.
On 31.01.23 16:04, Jens Axboe wrote: > On 1/31/23 8:02?AM, David Hildenbrand wrote: >> On 31.01.23 15:50, Jens Axboe wrote: >>> On 1/31/23 6:48?AM, David Hildenbrand wrote: >>>> On 31.01.23 14:41, David Howells wrote: >>>>> David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>>>>> percpu counters maybe - add them up at the point of viewing? >>>>>>> They are percpu, see my last email. But for every 108 changes (on >>>>>>> my system), they will do two atomic_long_adds(). So not very >>>>>>> useful for anything but low frequency modifications. >>>>>>> >>>>>> >>>>>> Can we just treat the whole acquired/released accounting as a debug mechanism >>>>>> to detect missing releases and do it only for debug kernels? >>>>>> >>>>>> >>>>>> The pcpu counter is an s8, so we have to flush on a regular basis and cannot >>>>>> really defer it any longer ... but I'm curious if it would be of any help to >>>>>> only have a single PINNED counter that goes into both directions (inc/dec on >>>>>> pin/release), to reduce the flushing. >>>>>> >>>>>> Of course, once we pin/release more than ~108 pages in one go or we switch >>>>>> CPUs frequently it won't be that much of a help ... >>>>> >>>>> What are the stats actually used for? Is it just debugging, or do we actually >>>>> have users for them (control groups spring to mind)? >>>> >>>> As it's really just "how many pinning events" vs. "how many unpinning >>>> events", I assume it's only for debugging. >>>> >>>> For example, if you pin the same page twice it would not get accounted >>>> as "a single page is pinned". >>> >>> How about something like the below then? I can send it out as a real >>> patch, will run a sanity check on it first but would be surprised if >>> this doesn't fix it. >>> >>> >>> diff --git a/mm/gup.c b/mm/gup.c >>> index f45a3a5be53a..41abb16286ec 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -168,7 +168,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) >>> */ >>> smp_mb__after_atomic(); >>> +#ifdef CONFIG_DEBUG_VM >>> node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); >>> +#endif >>> return folio; >>> } >>> @@ -180,7 +182,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) >>> static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) >>> { >>> if (flags & FOLL_PIN) { >>> +#ifdef CONFIG_DEBUG_VM >>> node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); >>> +#endif >>> if (folio_test_large(folio)) >>> atomic_sub(refs, folio_pincount_ptr(folio)); >>> else >>> @@ -236,8 +240,9 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) >>> } else { >>> folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); >>> } >>> - >>> +#ifdef CONFIG_DEBUG_VM >>> node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1); >>> +#endif >>> } >>> return 0; >>> >> >> We might want to hide the counters completely by defining them only >> with CONFIG_DEBUG_VM. > > Are all of them debug aids only? If so, yes we should just have > node_stat_* under CONFIG_DEBUG_VM. > Rather only these 2. Smth like: diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 815c7c2edf45..a526964b65ce 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -196,8 +196,10 @@ enum node_stat_item { NR_WRITTEN, /* page writings since bootup */ NR_THROTTLED_WRITTEN, /* NR_WRITTEN while reclaim throttled */ NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */ +#ifdef CONFIG_DEBUG_VM NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */ NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */ +#endif NR_KERNEL_STACK_KB, /* measured in KiB */ #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) NR_KERNEL_SCS_KB, /* measured in KiB */ diff --git a/mm/vmstat.c b/mm/vmstat.c index 1ea6a5ce1c41..5cbd9a1924bf 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1227,8 +1227,10 @@ const char * const vmstat_text[] = { "nr_written", "nr_throttled_written", "nr_kernel_misc_reclaimable", +#ifdef CONFIG_DEBUG_VM "nr_foll_pin_acquired", "nr_foll_pin_released", +#endif "nr_kernel_stack", #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) "nr_shadow_call_stack",
On 1/31/23 8:10 AM, David Hildenbrand wrote: > On 31.01.23 16:04, Jens Axboe wrote: >> On 1/31/23 8:02?AM, David Hildenbrand wrote: >>> On 31.01.23 15:50, Jens Axboe wrote: >>>> On 1/31/23 6:48?AM, David Hildenbrand wrote: >>>>> On 31.01.23 14:41, David Howells wrote: >>>>>> David Hildenbrand <david@redhat.com> wrote: >>>>>> >>>>>>>>> percpu counters maybe - add them up at the point of viewing? >>>>>>>> They are percpu, see my last email. But for every 108 changes (on >>>>>>>> my system), they will do two atomic_long_adds(). So not very >>>>>>>> useful for anything but low frequency modifications. >>>>>>>> >>>>>>> >>>>>>> Can we just treat the whole acquired/released accounting as a debug mechanism >>>>>>> to detect missing releases and do it only for debug kernels? >>>>>>> >>>>>>> >>>>>>> The pcpu counter is an s8, so we have to flush on a regular basis and cannot >>>>>>> really defer it any longer ... but I'm curious if it would be of any help to >>>>>>> only have a single PINNED counter that goes into both directions (inc/dec on >>>>>>> pin/release), to reduce the flushing. >>>>>>> >>>>>>> Of course, once we pin/release more than ~108 pages in one go or we switch >>>>>>> CPUs frequently it won't be that much of a help ... >>>>>> >>>>>> What are the stats actually used for? Is it just debugging, or do we actually >>>>>> have users for them (control groups spring to mind)? >>>>> >>>>> As it's really just "how many pinning events" vs. "how many unpinning >>>>> events", I assume it's only for debugging. >>>>> >>>>> For example, if you pin the same page twice it would not get accounted >>>>> as "a single page is pinned". >>>> >>>> How about something like the below then? I can send it out as a real >>>> patch, will run a sanity check on it first but would be surprised if >>>> this doesn't fix it. >>>> >>>> >>>> diff --git a/mm/gup.c b/mm/gup.c >>>> index f45a3a5be53a..41abb16286ec 100644 >>>> --- a/mm/gup.c >>>> +++ b/mm/gup.c >>>> @@ -168,7 +168,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) >>>> */ >>>> smp_mb__after_atomic(); >>>> +#ifdef CONFIG_DEBUG_VM >>>> node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs); >>>> +#endif >>>> return folio; >>>> } >>>> @@ -180,7 +182,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) >>>> static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) >>>> { >>>> if (flags & FOLL_PIN) { >>>> +#ifdef CONFIG_DEBUG_VM >>>> node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); >>>> +#endif >>>> if (folio_test_large(folio)) >>>> atomic_sub(refs, folio_pincount_ptr(folio)); >>>> else >>>> @@ -236,8 +240,9 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) >>>> } else { >>>> folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); >>>> } >>>> - >>>> +#ifdef CONFIG_DEBUG_VM >>>> node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1); >>>> +#endif >>>> } >>>> return 0; >>>> >>> >>> We might want to hide the counters completely by defining them only >>> with CONFIG_DEBUG_VM. >> >> Are all of them debug aids only? If so, yes we should just have >> node_stat_* under CONFIG_DEBUG_VM. >> > > Rather only these 2. Smth like: Ah gotcha, that makes more sense to me. Will update the patch and send it out.
On 1/31/23 04:28, Jan Kara wrote: > On Tue 31-01-23 09:32:27, David Hildenbrand wrote: >> On 30.01.23 23:15, Jens Axboe wrote: >>> On 1/30/23 3:12 PM, David Howells wrote: >>>> John Hubbard <jhubbard@nvidia.com> wrote: >>>> >>>>> This is something that we say when adding pin_user_pages_fast(), >>>>> yes. I doubt that I can quickly find the email thread, but we >>>>> measured it and weren't immediately able to come up with a way >>>>> to make it faster. >>>> >>>> percpu counters maybe - add them up at the point of viewing? >>> >>> They are percpu, see my last email. But for every 108 changes (on >>> my system), they will do two atomic_long_adds(). So not very >>> useful for anything but low frequency modifications. >>> >> >> Can we just treat the whole acquired/released accounting as a debug >> mechanism to detect missing releases and do it only for debug kernels? > > Yes, AFAIK it is just a debug mechanism for helping to find out issues with > page pinning conversions. So I think we can put this behind some debugging > ifdef. John? > Yes, just for debugging. I wrote a little note just now in response to the patch about how we ended up here: "yes, it's time to hide these behind an ifdef". thanks,