diff mbox series

[driver-core,v6,2/9] async: Add support for queueing on specific NUMA node

Message ID 154170041079.12967.13132220574997822111.stgit@ahduyck-desk1.jf.intel.com (mailing list archive)
State Superseded
Headers show
Series Add NUMA aware async_schedule calls | expand

Commit Message

Alexander Duyck Nov. 8, 2018, 6:06 p.m. UTC
Introduce four new variants of the async_schedule_ functions that allow
scheduling on a specific NUMA node.

The first two functions are async_schedule_near and
async_schedule_near_domain end up mapping to async_schedule and
async_schedule_domain, but provide NUMA node specific functionality. They
replace the original functions which were moved to inline function
definitions that call the new functions while passing NUMA_NO_NODE.

The second two functions are async_schedule_dev and
async_schedule_dev_domain which provide NUMA specific functionality when
passing a device as the data member and that device has a NUMA node other
than NUMA_NO_NODE.

The main motivation behind this is to address the need to be able to
schedule device specific init work on specific NUMA nodes in order to
improve performance of memory initialization.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/async.h |   82 +++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/async.c        |   53 +++++++++++++++++---------------
 2 files changed, 108 insertions(+), 27 deletions(-)

Comments

Bart Van Assche Nov. 8, 2018, 11:36 p.m. UTC | #1
On Thu, 2018-11-08 at 10:06 -0800, Alexander Duyck wrote:
> Introduce four new variants of the async_schedule_ functions that allow
> scheduling on a specific NUMA node.
> 
> The first two functions are async_schedule_near and
> async_schedule_near_domain end up mapping to async_schedule and
> async_schedule_domain, but provide NUMA node specific functionality. They
> replace the original functions which were moved to inline function
> definitions that call the new functions while passing NUMA_NO_NODE.
> 
> The second two functions are async_schedule_dev and
> async_schedule_dev_domain which provide NUMA specific functionality when
> passing a device as the data member and that device has a NUMA node other
> than NUMA_NO_NODE.
> 
> The main motivation behind this is to address the need to be able to
> schedule device specific init work on specific NUMA nodes in order to
> improve performance of memory initialization.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Greg KH Nov. 11, 2018, 7:32 p.m. UTC | #2
On Thu, Nov 08, 2018 at 10:06:50AM -0800, Alexander Duyck wrote:
> Introduce four new variants of the async_schedule_ functions that allow
> scheduling on a specific NUMA node.
> 
> The first two functions are async_schedule_near and
> async_schedule_near_domain end up mapping to async_schedule and
> async_schedule_domain, but provide NUMA node specific functionality. They
> replace the original functions which were moved to inline function
> definitions that call the new functions while passing NUMA_NO_NODE.
> 
> The second two functions are async_schedule_dev and
> async_schedule_dev_domain which provide NUMA specific functionality when
> passing a device as the data member and that device has a NUMA node other
> than NUMA_NO_NODE.
> 
> The main motivation behind this is to address the need to be able to
> schedule device specific init work on specific NUMA nodes in order to
> improve performance of memory initialization.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---

No one else from Intel has reviewed/verified this code at all?

Please take advantages of the resources you have that most people do
not, get reviewes from your coworkers please before you send this out
again, as they can give you valuable help before the community has to
review the code...

thanks,

greg k-h
Dan Williams Nov. 11, 2018, 7:53 p.m. UTC | #3
On Sun, Nov 11, 2018 at 11:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Nov 08, 2018 at 10:06:50AM -0800, Alexander Duyck wrote:
> > Introduce four new variants of the async_schedule_ functions that allow
> > scheduling on a specific NUMA node.
> >
> > The first two functions are async_schedule_near and
> > async_schedule_near_domain end up mapping to async_schedule and
> > async_schedule_domain, but provide NUMA node specific functionality. They
> > replace the original functions which were moved to inline function
> > definitions that call the new functions while passing NUMA_NO_NODE.
> >
> > The second two functions are async_schedule_dev and
> > async_schedule_dev_domain which provide NUMA specific functionality when
> > passing a device as the data member and that device has a NUMA node other
> > than NUMA_NO_NODE.
> >
> > The main motivation behind this is to address the need to be able to
> > schedule device specific init work on specific NUMA nodes in order to
> > improve performance of memory initialization.
> >
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > ---
>
> No one else from Intel has reviewed/verified this code at all?
>
> Please take advantages of the resources you have that most people do
> not, get reviewes from your coworkers please before you send this out
> again, as they can give you valuable help before the community has to
> review the code...

I tend to be suspicious of code that arrives on the mailing list
day-one with a series of company-internal reviewed-by tags. Sometimes
there is preliminary work that can be done internally, but I think we
should prefer to do review in the open as much as possible where it
does not waste community time. Alex and I did reach a general internal
consensus to send this out and get community feedback, but I assumed
to do the bulk of the review in parallel with everyone else. That said
I think it's fine to ask for some other acks before you take a look,
but let's do that in the open.
Greg KH Nov. 11, 2018, 8:33 p.m. UTC | #4
On Sun, Nov 11, 2018 at 08:59:03PM +0100, Pavel Machek wrote:
> On Sun 2018-11-11 11:32:08, Greg KH wrote:
> > On Thu, Nov 08, 2018 at 10:06:50AM -0800, Alexander Duyck wrote:
> > > Introduce four new variants of the async_schedule_ functions that allow
> > > scheduling on a specific NUMA node.
> > > 
> > > The first two functions are async_schedule_near and
> > > async_schedule_near_domain end up mapping to async_schedule and
> > > async_schedule_domain, but provide NUMA node specific functionality. They
> > > replace the original functions which were moved to inline function
> > > definitions that call the new functions while passing NUMA_NO_NODE.
> > > 
> > > The second two functions are async_schedule_dev and
> > > async_schedule_dev_domain which provide NUMA specific functionality when
> > > passing a device as the data member and that device has a NUMA node other
> > > than NUMA_NO_NODE.
> > > 
> > > The main motivation behind this is to address the need to be able to
> > > schedule device specific init work on specific NUMA nodes in order to
> > > improve performance of memory initialization.
> > > 
> > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > ---
> > 
> > No one else from Intel has reviewed/verified this code at all?
> > 
> > Please take advantages of the resources you have that most people do
> > not, get reviewes from your coworkers please before you send this out
> > again, as they can give you valuable help before the community has to
> > review the code...
> 
> We always said to companies we want to see code as soon as
> possible. You don't have to review their code, but discouraging the
> posting seems wrong.

I have a long history of Intel using me for their basic "find the
obvious bugs" review process for new driver subsystems and core changes.
When I see new major patches show up from an Intel developer without
_any_ other review from anyone else, directed at me, I get suspicious it
is happening again.

If you note, Intel is the _only_ company I say this to their developers
because of this problem combined with the fact that they have a whole
load of developers that they should be running things by first.

And yes, to answer Dan's point, we do want to do review in public.  But
this is v6 of a core patchset and there has been NO review from anyone
else at Intel on this.  So if that review was going to happen, one would
have thought it would have by now, instead of relying on me to do it.

And yes, I am grumpy, but I am grumpy because of the history here.  I am
not trying to discourage anything, only to take ADVANTAGE of resources
that almost no other company provides.

Hope this helps explain my statement here.

thanks,

greg k-h
Greg KH Nov. 11, 2018, 8:35 p.m. UTC | #5
On Sun, Nov 11, 2018 at 11:53:20AM -0800, Dan Williams wrote:
> On Sun, Nov 11, 2018 at 11:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, Nov 08, 2018 at 10:06:50AM -0800, Alexander Duyck wrote:
> > > Introduce four new variants of the async_schedule_ functions that allow
> > > scheduling on a specific NUMA node.
> > >
> > > The first two functions are async_schedule_near and
> > > async_schedule_near_domain end up mapping to async_schedule and
> > > async_schedule_domain, but provide NUMA node specific functionality. They
> > > replace the original functions which were moved to inline function
> > > definitions that call the new functions while passing NUMA_NO_NODE.
> > >
> > > The second two functions are async_schedule_dev and
> > > async_schedule_dev_domain which provide NUMA specific functionality when
> > > passing a device as the data member and that device has a NUMA node other
> > > than NUMA_NO_NODE.
> > >
> > > The main motivation behind this is to address the need to be able to
> > > schedule device specific init work on specific NUMA nodes in order to
> > > improve performance of memory initialization.
> > >
> > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > ---
> >
> > No one else from Intel has reviewed/verified this code at all?
> >
> > Please take advantages of the resources you have that most people do
> > not, get reviewes from your coworkers please before you send this out
> > again, as they can give you valuable help before the community has to
> > review the code...
> 
> I tend to be suspicious of code that arrives on the mailing list
> day-one with a series of company-internal reviewed-by tags. Sometimes
> there is preliminary work that can be done internally, but I think we
> should prefer to do review in the open as much as possible where it
> does not waste community time. Alex and I did reach a general internal
> consensus to send this out and get community feedback, but I assumed
> to do the bulk of the review in parallel with everyone else. That said
> I think it's fine to ask for some other acks before you take a look,
> but let's do that in the open.

Doing it in the open is great, see my response to Pavel for the history
of why I am normally suspicious of this, and why I wrote the above.

Also this patchset has had a long history of me asking for things, and
not seeing the changes happen (hint, where are the benchmark numbers I
asked for a long time ago?)  Touching the driver core like this is
tricky, and without others helping in review and test, it makes me
suspicious that it is not happening.

This would be a great time for some other people to do that review :)

thanks,

greg k-h
Bart Van Assche Nov. 11, 2018, 9:24 p.m. UTC | #6
On 11/11/18 12:33 PM, Greg KH wrote:
> When I see new major patches show up from an Intel developer without
> _any_ other review from anyone else, directed at me, I get suspicious it
> is happening again.

Hi Greg,

Please note that I reviewed this patch four days ago. See also
https://lore.kernel.org/lkml/1541720197.196084.231.camel@acm.org/.

Thanks,

Bart.
Dan Williams Nov. 11, 2018, 10:17 p.m. UTC | #7
On Sun, Nov 11, 2018 at 12:35 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Sun, Nov 11, 2018 at 11:53:20AM -0800, Dan Williams wrote:
> > On Sun, Nov 11, 2018 at 11:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
[..]
> This would be a great time for some other people to do that review :)

Message received, I'll block some time.
Alexander Duyck Nov. 11, 2018, 11:27 p.m. UTC | #8
On 11/11/2018 12:35 PM, Greg KH wrote:
> On Sun, Nov 11, 2018 at 11:53:20AM -0800, Dan Williams wrote:
>> On Sun, Nov 11, 2018 at 11:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>>>
>>> On Thu, Nov 08, 2018 at 10:06:50AM -0800, Alexander Duyck wrote:
>>>> Introduce four new variants of the async_schedule_ functions that allow
>>>> scheduling on a specific NUMA node.
>>>>
>>>> The first two functions are async_schedule_near and
>>>> async_schedule_near_domain end up mapping to async_schedule and
>>>> async_schedule_domain, but provide NUMA node specific functionality. They
>>>> replace the original functions which were moved to inline function
>>>> definitions that call the new functions while passing NUMA_NO_NODE.
>>>>
>>>> The second two functions are async_schedule_dev and
>>>> async_schedule_dev_domain which provide NUMA specific functionality when
>>>> passing a device as the data member and that device has a NUMA node other
>>>> than NUMA_NO_NODE.
>>>>
>>>> The main motivation behind this is to address the need to be able to
>>>> schedule device specific init work on specific NUMA nodes in order to
>>>> improve performance of memory initialization.
>>>>
>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>>> ---
>>>
>>> No one else from Intel has reviewed/verified this code at all?
>>>
>>> Please take advantages of the resources you have that most people do
>>> not, get reviewes from your coworkers please before you send this out
>>> again, as they can give you valuable help before the community has to
>>> review the code...
>>
>> I tend to be suspicious of code that arrives on the mailing list
>> day-one with a series of company-internal reviewed-by tags. Sometimes
>> there is preliminary work that can be done internally, but I think we
>> should prefer to do review in the open as much as possible where it
>> does not waste community time. Alex and I did reach a general internal
>> consensus to send this out and get community feedback, but I assumed
>> to do the bulk of the review in parallel with everyone else. That said
>> I think it's fine to ask for some other acks before you take a look,
>> but let's do that in the open.
> 
> Doing it in the open is great, see my response to Pavel for the history
> of why I am normally suspicious of this, and why I wrote the above.
> 
> Also this patchset has had a long history of me asking for things, and
> not seeing the changes happen (hint, where are the benchmark numbers I
> asked for a long time ago?)  Touching the driver core like this is
> tricky, and without others helping in review and test, it makes me
> suspicious that it is not happening.
> 
> This would be a great time for some other people to do that review :)
> 
> thanks,
> 
> greg k-h

Is there any specific benchmark test you were wanting me to run? As far 
as crude numbers this patch set started out specifically focused on 
patch 9/9, but I thought it best to apply it more generically as I found 
we could still run into the issue if we enabled async_probe.

What I have seen on several systems is a pretty significant improvement 
in initialization time for persistent memory. In the case of 3TB of 
memory being initialized on a single node the improvement in the worst 
case was from about 36s down to 26s for total initialization time.

I plan to resubmit this set after plumber's since there were a few typos 
and bits of comment left over in a patch description that needed to be 
sorted out. I will try to make certain to have any benchmark data I have 
included with the set the next time I put it out.

Thanks.

- Alex
Dan Williams Nov. 27, 2018, 1:10 a.m. UTC | #9
On Thu, Nov 8, 2018 at 10:06 AM Alexander Duyck
<alexander.h.duyck@linux.intel.com> wrote:
>
> Introduce four new variants of the async_schedule_ functions that allow
> scheduling on a specific NUMA node.
>
> The first two functions are async_schedule_near and
> async_schedule_near_domain end up mapping to async_schedule and
> async_schedule_domain, but provide NUMA node specific functionality. They
> replace the original functions which were moved to inline function
> definitions that call the new functions while passing NUMA_NO_NODE.
>
> The second two functions are async_schedule_dev and
> async_schedule_dev_domain which provide NUMA specific functionality when
> passing a device as the data member and that device has a NUMA node other
> than NUMA_NO_NODE.
>
> The main motivation behind this is to address the need to be able to
> schedule device specific init work on specific NUMA nodes in order to
> improve performance of memory initialization.

What Andrew tends to due when an enhancement is spread over multiple
patches is to duplicate the motivation in each patch. So here you
could include the few sentences you wrote about the performance
benefits of this work:

"What I have seen on several systems is a pretty significant improvement
in initialization time for persistent memory. In the case of 3TB of
memory being initialized on a single node the improvement in the worst
case was from about 36s down to 26s for total initialization time."

...and conclude that the data shows a general benefit for affinitizing
async work to a specific numa node.

With that changelog clarification:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
diff mbox series

Patch

diff --git a/include/linux/async.h b/include/linux/async.h
index 6b0226bdaadc..f81d6dbffe68 100644
--- a/include/linux/async.h
+++ b/include/linux/async.h
@@ -14,6 +14,8 @@ 
 
 #include <linux/types.h>
 #include <linux/list.h>
+#include <linux/numa.h>
+#include <linux/device.h>
 
 typedef u64 async_cookie_t;
 typedef void (*async_func_t) (void *data, async_cookie_t cookie);
@@ -37,9 +39,83 @@  struct async_domain {
 	struct async_domain _name = { .pending = LIST_HEAD_INIT(_name.pending), \
 				      .registered = 0 }
 
-extern async_cookie_t async_schedule(async_func_t func, void *data);
-extern async_cookie_t async_schedule_domain(async_func_t func, void *data,
-					    struct async_domain *domain);
+async_cookie_t async_schedule_node(async_func_t func, void *data,
+				   int node);
+async_cookie_t async_schedule_node_domain(async_func_t func, void *data,
+					  int node,
+					  struct async_domain *domain);
+
+/**
+ * async_schedule - schedule a function for asynchronous execution
+ * @func: function to execute asynchronously
+ * @data: data pointer to pass to the function
+ *
+ * Returns an async_cookie_t that may be used for checkpointing later.
+ * Note: This function may be called from atomic or non-atomic contexts.
+ */
+static inline async_cookie_t async_schedule(async_func_t func, void *data)
+{
+	return async_schedule_node(func, data, NUMA_NO_NODE);
+}
+
+/**
+ * async_schedule_domain - schedule a function for asynchronous execution within a certain domain
+ * @func: function to execute asynchronously
+ * @data: data pointer to pass to the function
+ * @domain: the domain
+ *
+ * Returns an async_cookie_t that may be used for checkpointing later.
+ * @domain may be used in the async_synchronize_*_domain() functions to
+ * wait within a certain synchronization domain rather than globally.
+ * Note: This function may be called from atomic or non-atomic contexts.
+ */
+static inline async_cookie_t
+async_schedule_domain(async_func_t func, void *data,
+		      struct async_domain *domain)
+{
+	return async_schedule_node_domain(func, data, NUMA_NO_NODE, domain);
+}
+
+/**
+ * async_schedule_dev - A device specific version of async_schedule
+ * @func: function to execute asynchronously
+ * @dev: device argument to be passed to function
+ *
+ * Returns an async_cookie_t that may be used for checkpointing later.
+ * @dev is used as both the argument for the function and to provide NUMA
+ * context for where to run the function. By doing this we can try to
+ * provide for the best possible outcome by operating on the device on the
+ * CPUs closest to the device.
+ * Note: This function may be called from atomic or non-atomic contexts.
+ */
+static inline async_cookie_t
+async_schedule_dev(async_func_t func, struct device *dev)
+{
+	return async_schedule_node(func, dev, dev_to_node(dev));
+}
+
+/**
+ * async_schedule_dev_domain - A device specific version of async_schedule_domain
+ * @func: function to execute asynchronously
+ * @dev: device argument to be passed to function
+ * @domain: the domain
+ *
+ * Returns an async_cookie_t that may be used for checkpointing later.
+ * @dev is used as both the argument for the function and to provide NUMA
+ * context for where to run the function. By doing this we can try to
+ * provide for the best possible outcome by operating on the device on the
+ * CPUs closest to the device.
+ * @domain may be used in the async_synchronize_*_domain() functions to
+ * wait within a certain synchronization domain rather than globally.
+ * Note: This function may be called from atomic or non-atomic contexts.
+ */
+static inline async_cookie_t
+async_schedule_dev_domain(async_func_t func, struct device *dev,
+			  struct async_domain *domain)
+{
+	return async_schedule_node_domain(func, dev, dev_to_node(dev), domain);
+}
+
 void async_unregister_domain(struct async_domain *domain);
 extern void async_synchronize_full(void);
 extern void async_synchronize_full_domain(struct async_domain *domain);
diff --git a/kernel/async.c b/kernel/async.c
index a893d6170944..f6bd0d9885e1 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -149,7 +149,25 @@  static void async_run_entry_fn(struct work_struct *work)
 	wake_up(&async_done);
 }
 
-static async_cookie_t __async_schedule(async_func_t func, void *data, struct async_domain *domain)
+/**
+ * async_schedule_node_domain - NUMA specific version of async_schedule_domain
+ * @func: function to execute asynchronously
+ * @data: data pointer to pass to the function
+ * @node: NUMA node that we want to schedule this on or close to
+ * @domain: the domain
+ *
+ * Returns an async_cookie_t that may be used for checkpointing later.
+ * @domain may be used in the async_synchronize_*_domain() functions to
+ * wait within a certain synchronization domain rather than globally.
+ *
+ * Note: This function may be called from atomic or non-atomic contexts.
+ *
+ * The node requested will be honored on a best effort basis. If the node
+ * has no CPUs associated with it then the work is distributed among all
+ * available CPUs.
+ */
+async_cookie_t async_schedule_node_domain(async_func_t func, void *data,
+					  int node, struct async_domain *domain)
 {
 	struct async_entry *entry;
 	unsigned long flags;
@@ -195,43 +213,30 @@  static async_cookie_t __async_schedule(async_func_t func, void *data, struct asy
 	current->flags |= PF_USED_ASYNC;
 
 	/* schedule for execution */
-	queue_work(system_unbound_wq, &entry->work);
+	queue_work_node(node, system_unbound_wq, &entry->work);
 
 	return newcookie;
 }
+EXPORT_SYMBOL_GPL(async_schedule_node_domain);
 
 /**
- * async_schedule - schedule a function for asynchronous execution
+ * async_schedule_node - NUMA specific version of async_schedule
  * @func: function to execute asynchronously
  * @data: data pointer to pass to the function
+ * @node: NUMA node that we want to schedule this on or close to
  *
  * Returns an async_cookie_t that may be used for checkpointing later.
  * Note: This function may be called from atomic or non-atomic contexts.
- */
-async_cookie_t async_schedule(async_func_t func, void *data)
-{
-	return __async_schedule(func, data, &async_dfl_domain);
-}
-EXPORT_SYMBOL_GPL(async_schedule);
-
-/**
- * async_schedule_domain - schedule a function for asynchronous execution within a certain domain
- * @func: function to execute asynchronously
- * @data: data pointer to pass to the function
- * @domain: the domain
  *
- * Returns an async_cookie_t that may be used for checkpointing later.
- * @domain may be used in the async_synchronize_*_domain() functions to
- * wait within a certain synchronization domain rather than globally.  A
- * synchronization domain is specified via @domain.  Note: This function
- * may be called from atomic or non-atomic contexts.
+ * The node requested will be honored on a best effort basis. If the node
+ * has no CPUs associated with it then the work is distributed among all
+ * available CPUs.
  */
-async_cookie_t async_schedule_domain(async_func_t func, void *data,
-				     struct async_domain *domain)
+async_cookie_t async_schedule_node(async_func_t func, void *data, int node)
 {
-	return __async_schedule(func, data, domain);
+	return async_schedule_node_domain(func, data, node, &async_dfl_domain);
 }
-EXPORT_SYMBOL_GPL(async_schedule_domain);
+EXPORT_SYMBOL_GPL(async_schedule_node);
 
 /**
  * async_synchronize_full - synchronize all asynchronous function calls