diff mbox series

[2/5] genirq/affinity: allow driver to setup managed IRQ's affinity

Message ID 20190125095347.17950-3-ming.lei@redhat.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show
Series genirq/affinity: introduce .setup_affinity to support allocating interrupt sets | expand

Commit Message

Ming Lei Jan. 25, 2019, 9:53 a.m. UTC
This patch introduces callback of .setup_affinity into 'struct
irq_affinity', so that:

1) allow drivers to customize the affinity for managed IRQ, for
example, now NVMe has special requirement for read queues & poll
queues

2) 6da4b3ab9a6e9 ("genirq/affinity: Add support for allocating interrupt sets")
makes pci_alloc_irq_vectors_affinity() a bit difficult to use for
allocating interrupt sets: 'max_vecs' is required to same with 'min_vecs'.

With this patch, driver can implement their own .setup_affinity to
customize the affinity, then the above thing can be solved easily.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/interrupt.h | 26 +++++++++++++++++---------
 kernel/irq/affinity.c     |  6 ++++++
 2 files changed, 23 insertions(+), 9 deletions(-)

Comments

Bjorn Helgaas Feb. 7, 2019, 10:21 p.m. UTC | #1
On Fri, Jan 25, 2019 at 05:53:44PM +0800, Ming Lei wrote:
> This patch introduces callback of .setup_affinity into 'struct
> irq_affinity', so that:
> 
> 1) allow drivers to customize the affinity for managed IRQ, for
> example, now NVMe has special requirement for read queues & poll
> queues
> 
> 2) 6da4b3ab9a6e9 ("genirq/affinity: Add support for allocating interrupt sets")
> makes pci_alloc_irq_vectors_affinity() a bit difficult to use for
> allocating interrupt sets: 'max_vecs' is required to same with 'min_vecs'.

s/is required to same with/is required to be equal to/

> With this patch, driver can implement their own .setup_affinity to
> customize the affinity, then the above thing can be solved easily.

s/driver/drivers/

> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  include/linux/interrupt.h | 26 +++++++++++++++++---------
>  kernel/irq/affinity.c     |  6 ++++++
>  2 files changed, 23 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index c672f34235e7..f6cea778cf50 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -242,30 +242,38 @@ struct irq_affinity_notify {
>  };
>  
>  /**
> + * struct irq_affinity_desc - Interrupt affinity descriptor
> + * @mask:	cpumask to hold the affinity assignment
> + */
> +struct irq_affinity_desc {
> +	struct cpumask	mask;
> +	unsigned int	is_managed : 1;
> +};

I was going to comment that "managed" already has a common usage
related to the devm_*() functions, but I don't think that's what you
mean here.  But then I noticed that you're only *moving* this code,
so you couldn't change it anyway.

But I still wonder what "managed" means here.

> +
> +/**
>   * struct irq_affinity - Description for automatic irq affinity assignements
>   * @pre_vectors:	Don't apply affinity to @pre_vectors at beginning of
>   *			the MSI(-X) vector space
>   * @post_vectors:	Don't apply affinity to @post_vectors at end of
>   *			the MSI(-X) vector space
> + * @setup_affinity:	Use driver's method to setup irq vectors affinity,
> + * 			and driver has to handle pre_vectors & post_vectors
> + * 			correctly, set 'is_managed' flag correct too

s/irq vectors/irq vector/
s/correct/correctly/

In general I don't think "correctly" is very useful in changelogs
and comments.  Usually it just means "how *I* think it should be
done", but it doesn't tell anybody else exactly *how* it should be
done.

What does it mean for a driver to handle pre_vectors & post_vectors
"correctly"?  The driver's .setup_affinity() method receives an array
of struct irq_affinity; maybe it means that method should set the
cpumask for each element as it desires.  For @pre_vectors and
@post_vectors, I suppose that means their cpumask would be
irq_default_affinity?

But I guess the .setup_affinity() method means the driver would have
complete flexibility for each vector, and it could use
irq_default_affinity for arbitrary vectors, not just the first few
(@pre_vectors) and the last few (@post_vectors)?

What's the rule for how a driver sets "is_managed"?

> + * @priv:		Private data of @setup_affinity
>   * @nr_sets:		Length of passed in *sets array
>   * @sets:		Number of affinitized sets
>   */
>  struct irq_affinity {
>  	int	pre_vectors;
>  	int	post_vectors;
> +	int	(*setup_affinity)(const struct irq_affinity *,
> +				  struct irq_affinity_desc *,
> +				  unsigned int);
> +	void    *priv;
>  	int	nr_sets;
>  	int	*sets;
>  };
>  
> -/**
> - * struct irq_affinity_desc - Interrupt affinity descriptor
> - * @mask:	cpumask to hold the affinity assignment
> - */
> -struct irq_affinity_desc {
> -	struct cpumask	mask;
> -	unsigned int	is_managed : 1;
> -};
> -
>  #if defined(CONFIG_SMP)
>  
>  extern cpumask_var_t irq_default_affinity;
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index 118b66d64a53..7b77cbdf739c 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -257,6 +257,12 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>  	if (!masks)
>  		return NULL;
>  
> +	if (affd->setup_affinity) {
> +		if (affd->setup_affinity(affd, masks, nvecs))
> +			return NULL;
> +		return masks;
> +	}

>  	/* Fill out vectors at the beginning that don't need affinity */
>  	for (curvec = 0; curvec < affd->pre_vectors; curvec++)
>  		cpumask_copy(&masks[curvec].mask, irq_default_affinity);
> -- 
> 2.9.5
>
Ming Lei Feb. 10, 2019, 9:22 a.m. UTC | #2
On Thu, Feb 07, 2019 at 04:21:30PM -0600, Bjorn Helgaas wrote:
> On Fri, Jan 25, 2019 at 05:53:44PM +0800, Ming Lei wrote:
> > This patch introduces callback of .setup_affinity into 'struct
> > irq_affinity', so that:
> > 
> > 1) allow drivers to customize the affinity for managed IRQ, for
> > example, now NVMe has special requirement for read queues & poll
> > queues
> > 
> > 2) 6da4b3ab9a6e9 ("genirq/affinity: Add support for allocating interrupt sets")
> > makes pci_alloc_irq_vectors_affinity() a bit difficult to use for
> > allocating interrupt sets: 'max_vecs' is required to same with 'min_vecs'.
> 
> s/is required to same with/is required to be equal to/
> 
> > With this patch, driver can implement their own .setup_affinity to
> > customize the affinity, then the above thing can be solved easily.
> 
> s/driver/drivers/
> 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  include/linux/interrupt.h | 26 +++++++++++++++++---------
> >  kernel/irq/affinity.c     |  6 ++++++
> >  2 files changed, 23 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> > index c672f34235e7..f6cea778cf50 100644
> > --- a/include/linux/interrupt.h
> > +++ b/include/linux/interrupt.h
> > @@ -242,30 +242,38 @@ struct irq_affinity_notify {
> >  };
> >  
> >  /**
> > + * struct irq_affinity_desc - Interrupt affinity descriptor
> > + * @mask:	cpumask to hold the affinity assignment
> > + */
> > +struct irq_affinity_desc {
> > +	struct cpumask	mask;
> > +	unsigned int	is_managed : 1;
> > +};
> 
> I was going to comment that "managed" already has a common usage
> related to the devm_*() functions, but I don't think that's what you
> mean here.  But then I noticed that you're only *moving* this code,
> so you couldn't change it anyway.
> 
> But I still wonder what "managed" means here.

'managed' irq's affinity is managed by kernel, and user space can't change
it any more via /proc/irq/...

> 
> > +
> > +/**
> >   * struct irq_affinity - Description for automatic irq affinity assignements
> >   * @pre_vectors:	Don't apply affinity to @pre_vectors at beginning of
> >   *			the MSI(-X) vector space
> >   * @post_vectors:	Don't apply affinity to @post_vectors at end of
> >   *			the MSI(-X) vector space
> > + * @setup_affinity:	Use driver's method to setup irq vectors affinity,
> > + * 			and driver has to handle pre_vectors & post_vectors
> > + * 			correctly, set 'is_managed' flag correct too
> 
> s/irq vectors/irq vector/
> s/correct/correctly/
> 
> In general I don't think "correctly" is very useful in changelogs
> and comments.  Usually it just means "how *I* think it should be
> done", but it doesn't tell anybody else exactly *how* it should be
> done.

That is one nice question.

So far there are at least two rules related with correctness:

1) 'is_managed' flag needs to be set

2) for all managed irq vectors in one interrupt set of one device,
all possible CPUs should be covered in the setup affinities, meantime
one CPU can't be allocated to two irq vectors. The interrupt set is
unique for NVMe actually now, such as, all READ queues' interrupts
belong to one set, and other queues belong to another set. For other
device, we can think there is only one set for one device.

> 
> What does it mean for a driver to handle pre_vectors & post_vectors
> "correctly"?  The driver's .setup_affinity() method receives an array
> of struct irq_affinity; maybe it means that method should set the
> cpumask for each element as it desires.  For @pre_vectors and
> @post_vectors, I suppose that means their cpumask would be
> irq_default_affinity?

So far the default affinity for pre_vectors & post_vectors is to use
irq_default_affinity, and this patch changes it to cpu_possible_mask,
and this change won't be one big deal from driver's view.

> 
> But I guess the .setup_affinity() method means the driver would have
> complete flexibility for each vector, and it could use

Yeah, together with simplification in both driver and genirq/affinity
code.

> irq_default_affinity for arbitrary vectors, not just the first few
> (@pre_vectors) and the last few (@post_vectors)?
> 
> What's the rule for how a driver sets "is_managed"?

Please see above, and I plan to enhance the rule in genirq/affinity code
if this approach may be accepted.

Thanks,
Ming
Thomas Gleixner Feb. 10, 2019, 4:30 p.m. UTC | #3
Ming,

On Fri, 25 Jan 2019, Ming Lei wrote:

> This patch introduces callback of .setup_affinity into 'struct
> irq_affinity', so that:

Please see Documentation/process/submitting-patches.rst. Search for 'This
patch' ....

> 
> 1) allow drivers to customize the affinity for managed IRQ, for
> example, now NVMe has special requirement for read queues & poll
> queues

That's nothing new and already handled today.

> 2) 6da4b3ab9a6e9 ("genirq/affinity: Add support for allocating interrupt sets")
> makes pci_alloc_irq_vectors_affinity() a bit difficult to use for
> allocating interrupt sets: 'max_vecs' is required to same with 'min_vecs'.

So it's a bit difficult, but you fail to explain why it's not sufficient.

> With this patch, driver can implement their own .setup_affinity to
> customize the affinity, then the above thing can be solved easily.

Well, I don't really understand what is solved easily and you are merily
describing the fact that the new callback allows drivers to customize
something. What's the rationale? If it's just the 'bit difficult' part,
then what is the reason for not making the core functionality easier to use
instead of moving stuff into driver space again?

NVME is not special and all this achieves is that all drivers writers will
claim that their device is special and needs its own affinity setter
routine. The whole point of having the generic code is to exactly avoid
that. If it has shortcomings, then they need to be addressed, but not
worked around with random driver callbacks.

Thanks,

	tglx
Ming Lei Feb. 11, 2019, 3:54 a.m. UTC | #4
Hello Thomas,

On Sun, Feb 10, 2019 at 05:30:41PM +0100, Thomas Gleixner wrote:
> Ming,
> 
> On Fri, 25 Jan 2019, Ming Lei wrote:
> 
> > This patch introduces callback of .setup_affinity into 'struct
> > irq_affinity', so that:
> 
> Please see Documentation/process/submitting-patches.rst. Search for 'This
> patch' ....

Sorry for that, because I am not a native English speaker and it looks a bit
difficult for me to understand the subtle difference.

> 
> > 
> > 1) allow drivers to customize the affinity for managed IRQ, for
> > example, now NVMe has special requirement for read queues & poll
> > queues
> 
> That's nothing new and already handled today.
> 
> > 2) 6da4b3ab9a6e9 ("genirq/affinity: Add support for allocating interrupt sets")
> > makes pci_alloc_irq_vectors_affinity() a bit difficult to use for
> > allocating interrupt sets: 'max_vecs' is required to same with 'min_vecs'.
> 
> So it's a bit difficult, but you fail to explain why it's not sufficient.

The introduced limit is that 'max_vecs' has to be same with 'min_vecs' for
pci_alloc_irq_vectors_affinity() wrt. NVMe's use case since commit
6da4b3ab9a6e9, then NVMe has to deal with irq vectors allocation failure in
the awkward way of retrying.

And the topic has been discussed in the following links:

https://marc.info/?l=linux-pci&m=154655595615575&w=2
https://marc.info/?l=linux-pci&m=154646930922174&w=2

Bjorn and Keith thought this usage/interface is a bit awkward because the passed
'min_vecs' should have avoided driver's retrying.

For NVMe, when irq vectors are run out of from pci_alloc_irq_vectors_affinity(),
the requested number has to be decreased and retry until it succeeds, then the
allocated irq vectors has to be re-distributed among the whole irq sets. Turns
out the re-distribution need driver's knowledge, that is why the callback is
introduced.

> 
> > With this patch, driver can implement their own .setup_affinity to
> > customize the affinity, then the above thing can be solved easily.
> 
> Well, I don't really understand what is solved easily and you are merily
> describing the fact that the new callback allows drivers to customize
> something. What's the rationale? If it's just the 'bit difficult' part,
> then what is the reason for not making the core functionality easier to use
> instead of moving stuff into driver space again?

Another solution mentioned in previous discussion is to split building & setting up
affinities from allocating irq vectors, but one big problem is that allocating
'irq_desc' needs the affinity mask for figuring out 'node', see alloc_descs().

> 
> NVME is not special and all this achieves is that all drivers writers will

I mean that NVMe is the only user of irq sets.

> claim that their device is special and needs its own affinity setter
> routine. The whole point of having the generic code is to exactly avoid
> that. If it has shortcomings, then they need to be addressed, but not
> worked around with random driver callbacks.

Understood.

Thanks,
Ming
Bjorn Helgaas Feb. 11, 2019, 2:39 p.m. UTC | #5
On Mon, Feb 11, 2019 at 11:54:00AM +0800, Ming Lei wrote:
> On Sun, Feb 10, 2019 at 05:30:41PM +0100, Thomas Gleixner wrote:
> > On Fri, 25 Jan 2019, Ming Lei wrote:
> > 
> > > This patch introduces callback of .setup_affinity into 'struct
> > > irq_affinity', so that:
> > 
> > Please see Documentation/process/submitting-patches.rst. Search for 'This
> > patch' ....
> 
> Sorry for that, because I am not a native English speaker and it looks a bit
> difficult for me to understand the subtle difference.

I think Thomas is saying that instead of "This patch introduces
callback ...", you could say "Introduce callback of ...".

The changelog is *part* of the patch, so the context is obvious and
there's no need to include the words "This patch".

I make the same changes to patches I receive.  In fact, I would go
even further and say "Add callback .setup_affinity() ..." because "add"
means the same as "introduce" but is shorter and simpler.

Bjorn
Thomas Gleixner Feb. 11, 2019, 10:38 p.m. UTC | #6
Ming,

On Mon, 11 Feb 2019, Bjorn Helgaas wrote:

> On Mon, Feb 11, 2019 at 11:54:00AM +0800, Ming Lei wrote:
> > On Sun, Feb 10, 2019 at 05:30:41PM +0100, Thomas Gleixner wrote:
> > > On Fri, 25 Jan 2019, Ming Lei wrote:
> > > 
> > > > This patch introduces callback of .setup_affinity into 'struct
> > > > irq_affinity', so that:
> > > 
> > > Please see Documentation/process/submitting-patches.rst. Search for 'This
> > > patch' ....
> > 
> > Sorry for that, because I am not a native English speaker and it looks a bit
> > difficult for me to understand the subtle difference.

Sorry I was a bit terse.

> I think Thomas is saying that instead of "This patch introduces
> callback ...", you could say "Introduce callback of ...".
> 
> The changelog is *part* of the patch, so the context is obvious and
> there's no need to include the words "This patch".
> 
> I make the same changes to patches I receive.  In fact, I would go
> even further and say "Add callback .setup_affinity() ..." because "add"
> means the same as "introduce" but is shorter and simpler.

Thanks for the explanation, Bjorn!

There is another point here. It's not only the 'This patch introduces ...'
part. It's also good practice to structure the changelog so it provides
context and reasoning first and then tells what the change is, e.g.:

   The current handling of multiple interrupt sets in the core interrupt
   affinity logic, requires the driver to do .......  This is necessary
   because ....

   This handling should be in the core code, but the core implementation
   has no way to recompute the interrupt sets for a given number of
   vectors.

   Add an optional callback to struct affd, which lets the driver recompute
   the interrupt set before the interrupt affinity spreading takes place.

The first paragraph provides context, i.e. the status quo, The second
paragraph provides reasoning what is missing and the last one tells what's
done to solve it.

That's pretty much the same for all changelogs, even if you fix a bug. Just
in the bug case, the second paragraph describes the details of the bug and
the possible consequences.

You really need to look at the changelog as a stand alone information
source. That's important when you look at a commit as an outsider or even
if you look at your own patch 6 month down the road when you already paged
out all the details.

Hope that clarifies it.

Thanks,

	tglx
Ming Lei Feb. 12, 2019, 11:17 a.m. UTC | #7
Hi Thomas,

On Mon, Feb 11, 2019 at 11:38:07PM +0100, Thomas Gleixner wrote:
> Ming,
> 
> On Mon, 11 Feb 2019, Bjorn Helgaas wrote:
> 
> > On Mon, Feb 11, 2019 at 11:54:00AM +0800, Ming Lei wrote:
> > > On Sun, Feb 10, 2019 at 05:30:41PM +0100, Thomas Gleixner wrote:
> > > > On Fri, 25 Jan 2019, Ming Lei wrote:
> > > > 
> > > > > This patch introduces callback of .setup_affinity into 'struct
> > > > > irq_affinity', so that:
> > > > 
> > > > Please see Documentation/process/submitting-patches.rst. Search for 'This
> > > > patch' ....
> > > 
> > > Sorry for that, because I am not a native English speaker and it looks a bit
> > > difficult for me to understand the subtle difference.
> 
> Sorry I was a bit terse.
> 
> > I think Thomas is saying that instead of "This patch introduces
> > callback ...", you could say "Introduce callback of ...".
> > 
> > The changelog is *part* of the patch, so the context is obvious and
> > there's no need to include the words "This patch".
> > 
> > I make the same changes to patches I receive.  In fact, I would go
> > even further and say "Add callback .setup_affinity() ..." because "add"
> > means the same as "introduce" but is shorter and simpler.
> 
> Thanks for the explanation, Bjorn!
> 
> There is another point here. It's not only the 'This patch introduces ...'
> part. It's also good practice to structure the changelog so it provides
> context and reasoning first and then tells what the change is, e.g.:
> 
>    The current handling of multiple interrupt sets in the core interrupt
>    affinity logic, requires the driver to do .......  This is necessary
>    because ....
> 
>    This handling should be in the core code, but the core implementation
>    has no way to recompute the interrupt sets for a given number of
>    vectors.
> 
>    Add an optional callback to struct affd, which lets the driver recompute
>    the interrupt set before the interrupt affinity spreading takes place.
> 
> The first paragraph provides context, i.e. the status quo, The second
> paragraph provides reasoning what is missing and the last one tells what's
> done to solve it.
> 
> That's pretty much the same for all changelogs, even if you fix a bug. Just
> in the bug case, the second paragraph describes the details of the bug and
> the possible consequences.
> 
> You really need to look at the changelog as a stand alone information
> source. That's important when you look at a commit as an outsider or even
> if you look at your own patch 6 month down the road when you already paged
> out all the details.
> 
> Hope that clarifies it.

Your description about how to write changelog is really helpful and useful
for me, thanks!

Maybe you can add it into Documentation/SubmittingPatches, so that lots
of people can benefit from the guide.


Thanks,
Ming
diff mbox series

Patch

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index c672f34235e7..f6cea778cf50 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -242,30 +242,38 @@  struct irq_affinity_notify {
 };
 
 /**
+ * struct irq_affinity_desc - Interrupt affinity descriptor
+ * @mask:	cpumask to hold the affinity assignment
+ */
+struct irq_affinity_desc {
+	struct cpumask	mask;
+	unsigned int	is_managed : 1;
+};
+
+/**
  * struct irq_affinity - Description for automatic irq affinity assignements
  * @pre_vectors:	Don't apply affinity to @pre_vectors at beginning of
  *			the MSI(-X) vector space
  * @post_vectors:	Don't apply affinity to @post_vectors at end of
  *			the MSI(-X) vector space
+ * @setup_affinity:	Use driver's method to setup irq vectors affinity,
+ * 			and driver has to handle pre_vectors & post_vectors
+ * 			correctly, set 'is_managed' flag correct too
+ * @priv:		Private data of @setup_affinity
  * @nr_sets:		Length of passed in *sets array
  * @sets:		Number of affinitized sets
  */
 struct irq_affinity {
 	int	pre_vectors;
 	int	post_vectors;
+	int	(*setup_affinity)(const struct irq_affinity *,
+				  struct irq_affinity_desc *,
+				  unsigned int);
+	void    *priv;
 	int	nr_sets;
 	int	*sets;
 };
 
-/**
- * struct irq_affinity_desc - Interrupt affinity descriptor
- * @mask:	cpumask to hold the affinity assignment
- */
-struct irq_affinity_desc {
-	struct cpumask	mask;
-	unsigned int	is_managed : 1;
-};
-
 #if defined(CONFIG_SMP)
 
 extern cpumask_var_t irq_default_affinity;
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 118b66d64a53..7b77cbdf739c 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -257,6 +257,12 @@  irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	if (!masks)
 		return NULL;
 
+	if (affd->setup_affinity) {
+		if (affd->setup_affinity(affd, masks, nvecs))
+			return NULL;
+		return masks;
+	}
+
 	/* Fill out vectors at the beginning that don't need affinity */
 	for (curvec = 0; curvec < affd->pre_vectors; curvec++)
 		cpumask_copy(&masks[curvec].mask, irq_default_affinity);