Message ID | 152066493247.40260.10849841915366086021.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
On Fri, Mar 09, 2018 at 10:55:32PM -0800, Dan Williams wrote: > Add a generic facility for awaiting an atomic_t to reach a value of 1. > > Page reference counts typically need to reach 0 to be considered a > free / inactive page. However, ZONE_DEVICE pages allocated via > devm_memremap_pages() are never 'onlined', i.e. the put_page() typically > done at init time to assign pages to the page allocator is skipped. > > These pages will have their reference count elevated > 1 by > get_user_pages() when they are under DMA. In order to coordinate DMA to > these pages vs filesytem operations like hole-punch and truncate the > filesystem-dax implementation needs to capture the DMA-idle event i.e. > the 2 to 1 count transition). > > For now, this implementation does not have functional behavior change, > follow-on patches will add waiters for these page-idle events. Argh, no no no.. That whole wait_for_atomic_t thing is a giant trainwreck already and now you're making it worse still. Please have a look here: https://lkml.kernel.org/r/20171101190644.chwhfpoz3ywxx2m7@hirez.programming.kicks-ass.net -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Mar 11, 2018 at 4:27 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Mar 09, 2018 at 10:55:32PM -0800, Dan Williams wrote: >> Add a generic facility for awaiting an atomic_t to reach a value of 1. >> >> Page reference counts typically need to reach 0 to be considered a >> free / inactive page. However, ZONE_DEVICE pages allocated via >> devm_memremap_pages() are never 'onlined', i.e. the put_page() typically >> done at init time to assign pages to the page allocator is skipped. >> >> These pages will have their reference count elevated > 1 by >> get_user_pages() when they are under DMA. In order to coordinate DMA to >> these pages vs filesytem operations like hole-punch and truncate the >> filesystem-dax implementation needs to capture the DMA-idle event i.e. >> the 2 to 1 count transition). >> >> For now, this implementation does not have functional behavior change, >> follow-on patches will add waiters for these page-idle events. > > Argh, no no no.. That whole wait_for_atomic_t thing is a giant > trainwreck already and now you're making it worse still. > > Please have a look here: > > https://lkml.kernel.org/r/20171101190644.chwhfpoz3ywxx2m7@hirez.programming.kicks-ass.net That thread seems to be worried about the object disappearing the moment it's reference count reaches a target. That isn't the case with the memmap / struct page objects for ZONE_DEVICE pages. I understand wait_for_atomic_one() is broken in the general case, but as far as I can see it works fine specifically for ZONE_DEVICE page busy tracking, just not generic object lifetime. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Mar 11, 2018 at 10:15 AM, Dan Williams <dan.j.williams@intel.com> wrote: > On Sun, Mar 11, 2018 at 4:27 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> On Fri, Mar 09, 2018 at 10:55:32PM -0800, Dan Williams wrote: >>> Add a generic facility for awaiting an atomic_t to reach a value of 1. >>> >>> Page reference counts typically need to reach 0 to be considered a >>> free / inactive page. However, ZONE_DEVICE pages allocated via >>> devm_memremap_pages() are never 'onlined', i.e. the put_page() typically >>> done at init time to assign pages to the page allocator is skipped. >>> >>> These pages will have their reference count elevated > 1 by >>> get_user_pages() when they are under DMA. In order to coordinate DMA to >>> these pages vs filesytem operations like hole-punch and truncate the >>> filesystem-dax implementation needs to capture the DMA-idle event i.e. >>> the 2 to 1 count transition). >>> >>> For now, this implementation does not have functional behavior change, >>> follow-on patches will add waiters for these page-idle events. >> >> Argh, no no no.. That whole wait_for_atomic_t thing is a giant >> trainwreck already and now you're making it worse still. >> >> Please have a look here: >> >> https://lkml.kernel.org/r/20171101190644.chwhfpoz3ywxx2m7@hirez.programming.kicks-ass.net > > That thread seems to be worried about the object disappearing the > moment it's reference count reaches a target. That isn't the case with > the memmap / struct page objects for ZONE_DEVICE pages. I understand > wait_for_atomic_one() is broken in the general case, but as far as I > can see it works fine specifically for ZONE_DEVICE page busy tracking, > just not generic object lifetime. Ok, that thread is also concerned with cleaning up the wait_for_atomic_* pattern to also do something more idiomatic with wait_event(). I agree that would be better, but I'm running short of time to go refactor this aou for 4.17 inclusion, especially as I expect another couple rounds of review on this more urgent data corruption fix series that depends on this new api. I think the addition of wait_for_atomic_one() makes it clear that we need a way to pass a conditional expression rather than create a variant api for each different condition. Can you help me out with an attempt of your own, or at least point in a direction that you would accept for solving the "Except the current wait_event() doesn't do the whole key part that makes the hash-table 'work'." problem that you highlighted? -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 619b1ed6434c..7e10fa3460e2 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -167,7 +167,7 @@ struct dax_device { #if IS_ENABLED(CONFIG_FS_DAX) static void generic_dax_pagefree(struct page *page, void *data) { - /* TODO: wakeup page-idle waiters */ + wake_up_atomic_one(&page->_refcount); } struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner) diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h index 61b39eaf7cad..564c9a0141cd 100644 --- a/include/linux/wait_bit.h +++ b/include/linux/wait_bit.h @@ -33,10 +33,15 @@ int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry * int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode); void wake_up_bit(void *word, int bit); void wake_up_atomic_t(atomic_t *p); +static inline void wake_up_atomic_one(atomic_t *p) +{ + wake_up_atomic_t(p); +} int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode); int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout); int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode); int out_of_line_wait_on_atomic_t(atomic_t *p, wait_atomic_t_action_f action, unsigned int mode); +int out_of_line_wait_on_atomic_one(atomic_t *p, wait_atomic_t_action_f action, unsigned int mode); struct wait_queue_head *bit_waitqueue(void *word, int bit); extern void __init wait_bit_init(void); @@ -262,4 +267,12 @@ int wait_on_atomic_t(atomic_t *val, wait_atomic_t_action_f action, unsigned mode return out_of_line_wait_on_atomic_t(val, action, mode); } +static inline +int wait_on_atomic_one(atomic_t *val, wait_atomic_t_action_f action, unsigned mode) +{ + might_sleep(); + if (atomic_read(val) == 1) + return 0; + return out_of_line_wait_on_atomic_one(val, action, mode); +} #endif /* _LINUX_WAIT_BIT_H */ diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c index 84cb3acd9260..8739b1e50df5 100644 --- a/kernel/sched/wait_bit.c +++ b/kernel/sched/wait_bit.c @@ -162,28 +162,47 @@ static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p) return bit_waitqueue(p, 0); } -static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, - void *arg) +static struct wait_bit_queue_entry *to_wait_bit_q( + struct wait_queue_entry *wq_entry) +{ + return container_of(wq_entry, struct wait_bit_queue_entry, wq_entry); +} + +static int __wake_atomic_t_function(struct wait_queue_entry *wq_entry, + unsigned mode, int sync, void *arg, int target) { struct wait_bit_key *key = arg; - struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry); + struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry); atomic_t *val = key->flags; if (wait_bit->key.flags != key->flags || wait_bit->key.bit_nr != key->bit_nr || - atomic_read(val) != 0) + atomic_read(val) != target) return 0; return autoremove_wake_function(wq_entry, mode, sync, key); } +static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, + unsigned mode, int sync, void *arg) +{ + return __wake_atomic_t_function(wq_entry, mode, sync, arg, 0); +} + +static int wake_atomic_one_function(struct wait_queue_entry *wq_entry, + unsigned mode, int sync, void *arg) +{ + return __wake_atomic_t_function(wq_entry, mode, sync, arg, 1); +} + /* * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting, * the actions of __wait_on_atomic_t() are permitted return codes. Nonzero * return codes halt waiting and return. */ static __sched -int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, - wait_atomic_t_action_f action, unsigned int mode) +int __wait_on_atomic_t(struct wait_queue_head *wq_head, + struct wait_bit_queue_entry *wbq_entry, + wait_atomic_t_action_f action, unsigned int mode, int target) { atomic_t *val; int ret = 0; @@ -191,10 +210,10 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en do { prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode); val = wbq_entry->key.flags; - if (atomic_read(val) == 0) + if (atomic_read(val) == target) break; ret = (*action)(val, mode); - } while (!ret && atomic_read(val) != 0); + } while (!ret && atomic_read(val) != target); finish_wait(wq_head, &wbq_entry->wq_entry); return ret; } @@ -210,6 +229,17 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en }, \ } +#define DEFINE_WAIT_ATOMIC_ONE(name, p) \ + struct wait_bit_queue_entry name = { \ + .key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p), \ + .wq_entry = { \ + .private = current, \ + .func = wake_atomic_one_function, \ + .entry = \ + LIST_HEAD_INIT((name).wq_entry.entry), \ + }, \ + } + __sched int out_of_line_wait_on_atomic_t(atomic_t *p, wait_atomic_t_action_f action, unsigned int mode) @@ -217,7 +247,7 @@ __sched int out_of_line_wait_on_atomic_t(atomic_t *p, struct wait_queue_head *wq_head = atomic_t_waitqueue(p); DEFINE_WAIT_ATOMIC_T(wq_entry, p); - return __wait_on_atomic_t(wq_head, &wq_entry, action, mode); + return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 0); } EXPORT_SYMBOL(out_of_line_wait_on_atomic_t); @@ -230,6 +260,17 @@ __sched int atomic_t_wait(atomic_t *counter, unsigned int mode) } EXPORT_SYMBOL(atomic_t_wait); +__sched int out_of_line_wait_on_atomic_one(atomic_t *p, + wait_atomic_t_action_f action, + unsigned int mode) +{ + struct wait_queue_head *wq_head = atomic_t_waitqueue(p); + DEFINE_WAIT_ATOMIC_ONE(wq_entry, p); + + return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 1); +} +EXPORT_SYMBOL(out_of_line_wait_on_atomic_one); + /** * wake_up_atomic_t - Wake up a waiter on a atomic_t * @p: The atomic_t being waited on, a kernel virtual address