Message ID | 20190125095347.17950-3-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | genirq/affinity: introduce .setup_affinity to support allocating interrupt sets | expand |
On Fri, Jan 25, 2019 at 05:53:44PM +0800, Ming Lei wrote: > This patch introduces callback of .setup_affinity into 'struct > irq_affinity', so that: > > 1) allow drivers to customize the affinity for managed IRQ, for > example, now NVMe has special requirement for read queues & poll > queues > > 2) 6da4b3ab9a6e9 ("genirq/affinity: Add support for allocating interrupt sets") > makes pci_alloc_irq_vectors_affinity() a bit difficult to use for > allocating interrupt sets: 'max_vecs' is required to same with 'min_vecs'. s/is required to same with/is required to be equal to/ > With this patch, driver can implement their own .setup_affinity to > customize the affinity, then the above thing can be solved easily. s/driver/drivers/ > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > include/linux/interrupt.h | 26 +++++++++++++++++--------- > kernel/irq/affinity.c | 6 ++++++ > 2 files changed, 23 insertions(+), 9 deletions(-) > > diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h > index c672f34235e7..f6cea778cf50 100644 > --- a/include/linux/interrupt.h > +++ b/include/linux/interrupt.h > @@ -242,30 +242,38 @@ struct irq_affinity_notify { > }; > > /** > + * struct irq_affinity_desc - Interrupt affinity descriptor > + * @mask: cpumask to hold the affinity assignment > + */ > +struct irq_affinity_desc { > + struct cpumask mask; > + unsigned int is_managed : 1; > +}; I was going to comment that "managed" already has a common usage related to the devm_*() functions, but I don't think that's what you mean here. But then I noticed that you're only *moving* this code, so you couldn't change it anyway. But I still wonder what "managed" means here. > + > +/** > * struct irq_affinity - Description for automatic irq affinity assignements > * @pre_vectors: Don't apply affinity to @pre_vectors at beginning of > * the MSI(-X) vector space > * @post_vectors: Don't apply affinity to @post_vectors at end of > * the MSI(-X) vector space > + * @setup_affinity: Use driver's method to setup irq vectors affinity, > + * and driver has to handle pre_vectors & post_vectors > + * correctly, set 'is_managed' flag correct too s/irq vectors/irq vector/ s/correct/correctly/ In general I don't think "correctly" is very useful in changelogs and comments. Usually it just means "how *I* think it should be done", but it doesn't tell anybody else exactly *how* it should be done. What does it mean for a driver to handle pre_vectors & post_vectors "correctly"? The driver's .setup_affinity() method receives an array of struct irq_affinity; maybe it means that method should set the cpumask for each element as it desires. For @pre_vectors and @post_vectors, I suppose that means their cpumask would be irq_default_affinity? But I guess the .setup_affinity() method means the driver would have complete flexibility for each vector, and it could use irq_default_affinity for arbitrary vectors, not just the first few (@pre_vectors) and the last few (@post_vectors)? What's the rule for how a driver sets "is_managed"? > + * @priv: Private data of @setup_affinity > * @nr_sets: Length of passed in *sets array > * @sets: Number of affinitized sets > */ > struct irq_affinity { > int pre_vectors; > int post_vectors; > + int (*setup_affinity)(const struct irq_affinity *, > + struct irq_affinity_desc *, > + unsigned int); > + void *priv; > int nr_sets; > int *sets; > }; > > -/** > - * struct irq_affinity_desc - Interrupt affinity descriptor > - * @mask: cpumask to hold the affinity assignment > - */ > -struct irq_affinity_desc { > - struct cpumask mask; > - unsigned int is_managed : 1; > -}; > - > #if defined(CONFIG_SMP) > > extern cpumask_var_t irq_default_affinity; > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c > index 118b66d64a53..7b77cbdf739c 100644 > --- a/kernel/irq/affinity.c > +++ b/kernel/irq/affinity.c > @@ -257,6 +257,12 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) > if (!masks) > return NULL; > > + if (affd->setup_affinity) { > + if (affd->setup_affinity(affd, masks, nvecs)) > + return NULL; > + return masks; > + } > /* Fill out vectors at the beginning that don't need affinity */ > for (curvec = 0; curvec < affd->pre_vectors; curvec++) > cpumask_copy(&masks[curvec].mask, irq_default_affinity); > -- > 2.9.5 >
On Thu, Feb 07, 2019 at 04:21:30PM -0600, Bjorn Helgaas wrote: > On Fri, Jan 25, 2019 at 05:53:44PM +0800, Ming Lei wrote: > > This patch introduces callback of .setup_affinity into 'struct > > irq_affinity', so that: > > > > 1) allow drivers to customize the affinity for managed IRQ, for > > example, now NVMe has special requirement for read queues & poll > > queues > > > > 2) 6da4b3ab9a6e9 ("genirq/affinity: Add support for allocating interrupt sets") > > makes pci_alloc_irq_vectors_affinity() a bit difficult to use for > > allocating interrupt sets: 'max_vecs' is required to same with 'min_vecs'. > > s/is required to same with/is required to be equal to/ > > > With this patch, driver can implement their own .setup_affinity to > > customize the affinity, then the above thing can be solved easily. > > s/driver/drivers/ > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > include/linux/interrupt.h | 26 +++++++++++++++++--------- > > kernel/irq/affinity.c | 6 ++++++ > > 2 files changed, 23 insertions(+), 9 deletions(-) > > > > diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h > > index c672f34235e7..f6cea778cf50 100644 > > --- a/include/linux/interrupt.h > > +++ b/include/linux/interrupt.h > > @@ -242,30 +242,38 @@ struct irq_affinity_notify { > > }; > > > > /** > > + * struct irq_affinity_desc - Interrupt affinity descriptor > > + * @mask: cpumask to hold the affinity assignment > > + */ > > +struct irq_affinity_desc { > > + struct cpumask mask; > > + unsigned int is_managed : 1; > > +}; > > I was going to comment that "managed" already has a common usage > related to the devm_*() functions, but I don't think that's what you > mean here. But then I noticed that you're only *moving* this code, > so you couldn't change it anyway. > > But I still wonder what "managed" means here. 'managed' irq's affinity is managed by kernel, and user space can't change it any more via /proc/irq/... > > > + > > +/** > > * struct irq_affinity - Description for automatic irq affinity assignements > > * @pre_vectors: Don't apply affinity to @pre_vectors at beginning of > > * the MSI(-X) vector space > > * @post_vectors: Don't apply affinity to @post_vectors at end of > > * the MSI(-X) vector space > > + * @setup_affinity: Use driver's method to setup irq vectors affinity, > > + * and driver has to handle pre_vectors & post_vectors > > + * correctly, set 'is_managed' flag correct too > > s/irq vectors/irq vector/ > s/correct/correctly/ > > In general I don't think "correctly" is very useful in changelogs > and comments. Usually it just means "how *I* think it should be > done", but it doesn't tell anybody else exactly *how* it should be > done. That is one nice question. So far there are at least two rules related with correctness: 1) 'is_managed' flag needs to be set 2) for all managed irq vectors in one interrupt set of one device, all possible CPUs should be covered in the setup affinities, meantime one CPU can't be allocated to two irq vectors. The interrupt set is unique for NVMe actually now, such as, all READ queues' interrupts belong to one set, and other queues belong to another set. For other device, we can think there is only one set for one device. > > What does it mean for a driver to handle pre_vectors & post_vectors > "correctly"? The driver's .setup_affinity() method receives an array > of struct irq_affinity; maybe it means that method should set the > cpumask for each element as it desires. For @pre_vectors and > @post_vectors, I suppose that means their cpumask would be > irq_default_affinity? So far the default affinity for pre_vectors & post_vectors is to use irq_default_affinity, and this patch changes it to cpu_possible_mask, and this change won't be one big deal from driver's view. > > But I guess the .setup_affinity() method means the driver would have > complete flexibility for each vector, and it could use Yeah, together with simplification in both driver and genirq/affinity code. > irq_default_affinity for arbitrary vectors, not just the first few > (@pre_vectors) and the last few (@post_vectors)? > > What's the rule for how a driver sets "is_managed"? Please see above, and I plan to enhance the rule in genirq/affinity code if this approach may be accepted. Thanks, Ming
Ming, On Fri, 25 Jan 2019, Ming Lei wrote: > This patch introduces callback of .setup_affinity into 'struct > irq_affinity', so that: Please see Documentation/process/submitting-patches.rst. Search for 'This patch' .... > > 1) allow drivers to customize the affinity for managed IRQ, for > example, now NVMe has special requirement for read queues & poll > queues That's nothing new and already handled today. > 2) 6da4b3ab9a6e9 ("genirq/affinity: Add support for allocating interrupt sets") > makes pci_alloc_irq_vectors_affinity() a bit difficult to use for > allocating interrupt sets: 'max_vecs' is required to same with 'min_vecs'. So it's a bit difficult, but you fail to explain why it's not sufficient. > With this patch, driver can implement their own .setup_affinity to > customize the affinity, then the above thing can be solved easily. Well, I don't really understand what is solved easily and you are merily describing the fact that the new callback allows drivers to customize something. What's the rationale? If it's just the 'bit difficult' part, then what is the reason for not making the core functionality easier to use instead of moving stuff into driver space again? NVME is not special and all this achieves is that all drivers writers will claim that their device is special and needs its own affinity setter routine. The whole point of having the generic code is to exactly avoid that. If it has shortcomings, then they need to be addressed, but not worked around with random driver callbacks. Thanks, tglx
Hello Thomas, On Sun, Feb 10, 2019 at 05:30:41PM +0100, Thomas Gleixner wrote: > Ming, > > On Fri, 25 Jan 2019, Ming Lei wrote: > > > This patch introduces callback of .setup_affinity into 'struct > > irq_affinity', so that: > > Please see Documentation/process/submitting-patches.rst. Search for 'This > patch' .... Sorry for that, because I am not a native English speaker and it looks a bit difficult for me to understand the subtle difference. > > > > > 1) allow drivers to customize the affinity for managed IRQ, for > > example, now NVMe has special requirement for read queues & poll > > queues > > That's nothing new and already handled today. > > > 2) 6da4b3ab9a6e9 ("genirq/affinity: Add support for allocating interrupt sets") > > makes pci_alloc_irq_vectors_affinity() a bit difficult to use for > > allocating interrupt sets: 'max_vecs' is required to same with 'min_vecs'. > > So it's a bit difficult, but you fail to explain why it's not sufficient. The introduced limit is that 'max_vecs' has to be same with 'min_vecs' for pci_alloc_irq_vectors_affinity() wrt. NVMe's use case since commit 6da4b3ab9a6e9, then NVMe has to deal with irq vectors allocation failure in the awkward way of retrying. And the topic has been discussed in the following links: https://marc.info/?l=linux-pci&m=154655595615575&w=2 https://marc.info/?l=linux-pci&m=154646930922174&w=2 Bjorn and Keith thought this usage/interface is a bit awkward because the passed 'min_vecs' should have avoided driver's retrying. For NVMe, when irq vectors are run out of from pci_alloc_irq_vectors_affinity(), the requested number has to be decreased and retry until it succeeds, then the allocated irq vectors has to be re-distributed among the whole irq sets. Turns out the re-distribution need driver's knowledge, that is why the callback is introduced. > > > With this patch, driver can implement their own .setup_affinity to > > customize the affinity, then the above thing can be solved easily. > > Well, I don't really understand what is solved easily and you are merily > describing the fact that the new callback allows drivers to customize > something. What's the rationale? If it's just the 'bit difficult' part, > then what is the reason for not making the core functionality easier to use > instead of moving stuff into driver space again? Another solution mentioned in previous discussion is to split building & setting up affinities from allocating irq vectors, but one big problem is that allocating 'irq_desc' needs the affinity mask for figuring out 'node', see alloc_descs(). > > NVME is not special and all this achieves is that all drivers writers will I mean that NVMe is the only user of irq sets. > claim that their device is special and needs its own affinity setter > routine. The whole point of having the generic code is to exactly avoid > that. If it has shortcomings, then they need to be addressed, but not > worked around with random driver callbacks. Understood. Thanks, Ming
On Mon, Feb 11, 2019 at 11:54:00AM +0800, Ming Lei wrote: > On Sun, Feb 10, 2019 at 05:30:41PM +0100, Thomas Gleixner wrote: > > On Fri, 25 Jan 2019, Ming Lei wrote: > > > > > This patch introduces callback of .setup_affinity into 'struct > > > irq_affinity', so that: > > > > Please see Documentation/process/submitting-patches.rst. Search for 'This > > patch' .... > > Sorry for that, because I am not a native English speaker and it looks a bit > difficult for me to understand the subtle difference. I think Thomas is saying that instead of "This patch introduces callback ...", you could say "Introduce callback of ...". The changelog is *part* of the patch, so the context is obvious and there's no need to include the words "This patch". I make the same changes to patches I receive. In fact, I would go even further and say "Add callback .setup_affinity() ..." because "add" means the same as "introduce" but is shorter and simpler. Bjorn
Ming, On Mon, 11 Feb 2019, Bjorn Helgaas wrote: > On Mon, Feb 11, 2019 at 11:54:00AM +0800, Ming Lei wrote: > > On Sun, Feb 10, 2019 at 05:30:41PM +0100, Thomas Gleixner wrote: > > > On Fri, 25 Jan 2019, Ming Lei wrote: > > > > > > > This patch introduces callback of .setup_affinity into 'struct > > > > irq_affinity', so that: > > > > > > Please see Documentation/process/submitting-patches.rst. Search for 'This > > > patch' .... > > > > Sorry for that, because I am not a native English speaker and it looks a bit > > difficult for me to understand the subtle difference. Sorry I was a bit terse. > I think Thomas is saying that instead of "This patch introduces > callback ...", you could say "Introduce callback of ...". > > The changelog is *part* of the patch, so the context is obvious and > there's no need to include the words "This patch". > > I make the same changes to patches I receive. In fact, I would go > even further and say "Add callback .setup_affinity() ..." because "add" > means the same as "introduce" but is shorter and simpler. Thanks for the explanation, Bjorn! There is another point here. It's not only the 'This patch introduces ...' part. It's also good practice to structure the changelog so it provides context and reasoning first and then tells what the change is, e.g.: The current handling of multiple interrupt sets in the core interrupt affinity logic, requires the driver to do ....... This is necessary because .... This handling should be in the core code, but the core implementation has no way to recompute the interrupt sets for a given number of vectors. Add an optional callback to struct affd, which lets the driver recompute the interrupt set before the interrupt affinity spreading takes place. The first paragraph provides context, i.e. the status quo, The second paragraph provides reasoning what is missing and the last one tells what's done to solve it. That's pretty much the same for all changelogs, even if you fix a bug. Just in the bug case, the second paragraph describes the details of the bug and the possible consequences. You really need to look at the changelog as a stand alone information source. That's important when you look at a commit as an outsider or even if you look at your own patch 6 month down the road when you already paged out all the details. Hope that clarifies it. Thanks, tglx
Hi Thomas, On Mon, Feb 11, 2019 at 11:38:07PM +0100, Thomas Gleixner wrote: > Ming, > > On Mon, 11 Feb 2019, Bjorn Helgaas wrote: > > > On Mon, Feb 11, 2019 at 11:54:00AM +0800, Ming Lei wrote: > > > On Sun, Feb 10, 2019 at 05:30:41PM +0100, Thomas Gleixner wrote: > > > > On Fri, 25 Jan 2019, Ming Lei wrote: > > > > > > > > > This patch introduces callback of .setup_affinity into 'struct > > > > > irq_affinity', so that: > > > > > > > > Please see Documentation/process/submitting-patches.rst. Search for 'This > > > > patch' .... > > > > > > Sorry for that, because I am not a native English speaker and it looks a bit > > > difficult for me to understand the subtle difference. > > Sorry I was a bit terse. > > > I think Thomas is saying that instead of "This patch introduces > > callback ...", you could say "Introduce callback of ...". > > > > The changelog is *part* of the patch, so the context is obvious and > > there's no need to include the words "This patch". > > > > I make the same changes to patches I receive. In fact, I would go > > even further and say "Add callback .setup_affinity() ..." because "add" > > means the same as "introduce" but is shorter and simpler. > > Thanks for the explanation, Bjorn! > > There is another point here. It's not only the 'This patch introduces ...' > part. It's also good practice to structure the changelog so it provides > context and reasoning first and then tells what the change is, e.g.: > > The current handling of multiple interrupt sets in the core interrupt > affinity logic, requires the driver to do ....... This is necessary > because .... > > This handling should be in the core code, but the core implementation > has no way to recompute the interrupt sets for a given number of > vectors. > > Add an optional callback to struct affd, which lets the driver recompute > the interrupt set before the interrupt affinity spreading takes place. > > The first paragraph provides context, i.e. the status quo, The second > paragraph provides reasoning what is missing and the last one tells what's > done to solve it. > > That's pretty much the same for all changelogs, even if you fix a bug. Just > in the bug case, the second paragraph describes the details of the bug and > the possible consequences. > > You really need to look at the changelog as a stand alone information > source. That's important when you look at a commit as an outsider or even > if you look at your own patch 6 month down the road when you already paged > out all the details. > > Hope that clarifies it. Your description about how to write changelog is really helpful and useful for me, thanks! Maybe you can add it into Documentation/SubmittingPatches, so that lots of people can benefit from the guide. Thanks, Ming
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index c672f34235e7..f6cea778cf50 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -242,30 +242,38 @@ struct irq_affinity_notify { }; /** + * struct irq_affinity_desc - Interrupt affinity descriptor + * @mask: cpumask to hold the affinity assignment + */ +struct irq_affinity_desc { + struct cpumask mask; + unsigned int is_managed : 1; +}; + +/** * struct irq_affinity - Description for automatic irq affinity assignements * @pre_vectors: Don't apply affinity to @pre_vectors at beginning of * the MSI(-X) vector space * @post_vectors: Don't apply affinity to @post_vectors at end of * the MSI(-X) vector space + * @setup_affinity: Use driver's method to setup irq vectors affinity, + * and driver has to handle pre_vectors & post_vectors + * correctly, set 'is_managed' flag correct too + * @priv: Private data of @setup_affinity * @nr_sets: Length of passed in *sets array * @sets: Number of affinitized sets */ struct irq_affinity { int pre_vectors; int post_vectors; + int (*setup_affinity)(const struct irq_affinity *, + struct irq_affinity_desc *, + unsigned int); + void *priv; int nr_sets; int *sets; }; -/** - * struct irq_affinity_desc - Interrupt affinity descriptor - * @mask: cpumask to hold the affinity assignment - */ -struct irq_affinity_desc { - struct cpumask mask; - unsigned int is_managed : 1; -}; - #if defined(CONFIG_SMP) extern cpumask_var_t irq_default_affinity; diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 118b66d64a53..7b77cbdf739c 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -257,6 +257,12 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) if (!masks) return NULL; + if (affd->setup_affinity) { + if (affd->setup_affinity(affd, masks, nvecs)) + return NULL; + return masks; + } + /* Fill out vectors at the beginning that don't need affinity */ for (curvec = 0; curvec < affd->pre_vectors; curvec++) cpumask_copy(&masks[curvec].mask, irq_default_affinity);
This patch introduces callback of .setup_affinity into 'struct irq_affinity', so that: 1) allow drivers to customize the affinity for managed IRQ, for example, now NVMe has special requirement for read queues & poll queues 2) 6da4b3ab9a6e9 ("genirq/affinity: Add support for allocating interrupt sets") makes pci_alloc_irq_vectors_affinity() a bit difficult to use for allocating interrupt sets: 'max_vecs' is required to same with 'min_vecs'. With this patch, driver can implement their own .setup_affinity to customize the affinity, then the above thing can be solved easily. Signed-off-by: Ming Lei <ming.lei@redhat.com> --- include/linux/interrupt.h | 26 +++++++++++++++++--------- kernel/irq/affinity.c | 6 ++++++ 2 files changed, 23 insertions(+), 9 deletions(-)