Message ID | 20181029163738.10172-12-axboe@kernel.dk (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | blk-mq: Add support for multiple queue maps | expand |
Jens, On Mon, 29 Oct 2018, Jens Axboe wrote: > A driver may have a need to allocate multiple sets of MSI/MSI-X > interrupts, and have them appropriately affinitized. Add support for > defining a number of sets in the irq_affinity structure, of varying > sizes, and get each set affinitized correctly across the machine. > > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: linux-kernel@vger.kernel.org > Reviewed-by: Hannes Reinecke <hare@suse.com> > Signed-off-by: Jens Axboe <axboe@kernel.dk> This looks good. Vs. merge logistics: I'm expecting some other changes in that area as per discussion with megasas (IIRC) folks. So I'd like to apply that myself right after -rc1 and provide it to you as a single commit to pull from so we can avoid collisions in next and the merge window. Thanks, tglx
On 10/29/18 11:08 AM, Thomas Gleixner wrote: > Jens, > > On Mon, 29 Oct 2018, Jens Axboe wrote: > >> A driver may have a need to allocate multiple sets of MSI/MSI-X >> interrupts, and have them appropriately affinitized. Add support for >> defining a number of sets in the irq_affinity structure, of varying >> sizes, and get each set affinitized correctly across the machine. >> >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: linux-kernel@vger.kernel.org >> Reviewed-by: Hannes Reinecke <hare@suse.com> >> Signed-off-by: Jens Axboe <axboe@kernel.dk> > > This looks good. > > Vs. merge logistics: I'm expecting some other changes in that area as per > discussion with megasas (IIRC) folks. So I'd like to apply that myself > right after -rc1 and provide it to you as a single commit to pull from so > we can avoid collisions in next and the merge window. That sounds fine, thanks Thomas!
On Mon, Oct 29, 2018 at 10:37:35AM -0600, Jens Axboe wrote: > A driver may have a need to allocate multiple sets of MSI/MSI-X > interrupts, and have them appropriately affinitized. Add support for > defining a number of sets in the irq_affinity structure, of varying > sizes, and get each set affinitized correctly across the machine. > > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: linux-kernel@vger.kernel.org > Reviewed-by: Hannes Reinecke <hare@suse.com> > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > include/linux/interrupt.h | 4 ++++ > kernel/irq/affinity.c | 40 ++++++++++++++++++++++++++++++--------- > 2 files changed, 35 insertions(+), 9 deletions(-) > > diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h > index 1d6711c28271..ca397ff40836 100644 > --- a/include/linux/interrupt.h > +++ b/include/linux/interrupt.h > @@ -247,10 +247,14 @@ struct irq_affinity_notify { > * the MSI(-X) vector space > * @post_vectors: Don't apply affinity to @post_vectors at end of > * the MSI(-X) vector space > + * @nr_sets: Length of passed in *sets array > + * @sets: Number of affinitized sets > */ > struct irq_affinity { > int pre_vectors; > int post_vectors; > + int nr_sets; > + int *sets; > }; > > #if defined(CONFIG_SMP) > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c > index f4f29b9d90ee..2046a0f0f0f1 100644 > --- a/kernel/irq/affinity.c > +++ b/kernel/irq/affinity.c > @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) > int curvec, usedvecs; > cpumask_var_t nmsk, npresmsk, *node_to_cpumask; > struct cpumask *masks = NULL; > + int i, nr_sets; > > /* > * If there aren't any vectors left after applying the pre/post > @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) > get_online_cpus(); > build_node_to_cpumask(node_to_cpumask); > > - /* Spread on present CPUs starting from affd->pre_vectors */ > - usedvecs = irq_build_affinity_masks(affd, curvec, affvecs, > - node_to_cpumask, cpu_present_mask, > - nmsk, masks); > + /* > + * Spread on present CPUs starting from affd->pre_vectors. If we > + * have multiple sets, build each sets affinity mask separately. > + */ > + nr_sets = affd->nr_sets; > + if (!nr_sets) > + nr_sets = 1; > + > + for (i = 0, usedvecs = 0; i < nr_sets; i++) { > + int this_vecs = affd->sets ? affd->sets[i] : affvecs; > + int nr; > + > + nr = irq_build_affinity_masks(affd, curvec, this_vecs, > + node_to_cpumask, cpu_present_mask, > + nmsk, masks + usedvecs); > + usedvecs += nr; > + } > > /* > * Spread on non present CPUs starting from the next vector to be > @@ -258,13 +272,21 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity > { > int resv = affd->pre_vectors + affd->post_vectors; > int vecs = maxvec - resv; > - int ret; > + int set_vecs; > > if (resv > minvec) > return 0; > > - get_online_cpus(); > - ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv; > - put_online_cpus(); > - return ret; > + if (affd->nr_sets) { > + int i; > + > + for (i = 0, set_vecs = 0; i < affd->nr_sets; i++) > + set_vecs += affd->sets[i]; > + } else { > + get_online_cpus(); > + set_vecs = cpumask_weight(cpu_possible_mask); > + put_online_cpus(); > + } > + > + return resv + min(set_vecs, vecs); > } > -- > 2.17.1 > Looks fine: Reviewed-by: Ming Lei <ming.lei@redhat.com>
On Mon, Oct 29, 2018 at 10:37:35AM -0600, Jens Axboe wrote: > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c > index f4f29b9d90ee..2046a0f0f0f1 100644 > --- a/kernel/irq/affinity.c > +++ b/kernel/irq/affinity.c > @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) > int curvec, usedvecs; > cpumask_var_t nmsk, npresmsk, *node_to_cpumask; > struct cpumask *masks = NULL; > + int i, nr_sets; > > /* > * If there aren't any vectors left after applying the pre/post > @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) > get_online_cpus(); > build_node_to_cpumask(node_to_cpumask); > > - /* Spread on present CPUs starting from affd->pre_vectors */ > - usedvecs = irq_build_affinity_masks(affd, curvec, affvecs, > - node_to_cpumask, cpu_present_mask, > - nmsk, masks); > + /* > + * Spread on present CPUs starting from affd->pre_vectors. If we > + * have multiple sets, build each sets affinity mask separately. > + */ > + nr_sets = affd->nr_sets; > + if (!nr_sets) > + nr_sets = 1; > + > + for (i = 0, usedvecs = 0; i < nr_sets; i++) { > + int this_vecs = affd->sets ? affd->sets[i] : affvecs; > + int nr; > + > + nr = irq_build_affinity_masks(affd, curvec, this_vecs, > + node_to_cpumask, cpu_present_mask, > + nmsk, masks + usedvecs); > + usedvecs += nr; > + } While the code below returns the appropriate number of possible vectors when a set requested too many, the above code is still using the value from the set, which may exceed 'nvecs' used to kcalloc 'masks', so 'masks + usedvecs' may go out of bounds. > /* > * Spread on non present CPUs starting from the next vector to be > @@ -258,13 +272,21 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity > { > int resv = affd->pre_vectors + affd->post_vectors; > int vecs = maxvec - resv; > - int ret; > + int set_vecs; > > if (resv > minvec) > return 0; > > - get_online_cpus(); > - ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv; > - put_online_cpus(); > - return ret; > + if (affd->nr_sets) { > + int i; > + > + for (i = 0, set_vecs = 0; i < affd->nr_sets; i++) > + set_vecs += affd->sets[i]; > + } else { > + get_online_cpus(); > + set_vecs = cpumask_weight(cpu_possible_mask); > + put_online_cpus(); > + } > + > + return resv + min(set_vecs, vecs); > }
On 10/30/18 8:26 AM, Keith Busch wrote: > On Mon, Oct 29, 2018 at 10:37:35AM -0600, Jens Axboe wrote: >> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c >> index f4f29b9d90ee..2046a0f0f0f1 100644 >> --- a/kernel/irq/affinity.c >> +++ b/kernel/irq/affinity.c >> @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) >> int curvec, usedvecs; >> cpumask_var_t nmsk, npresmsk, *node_to_cpumask; >> struct cpumask *masks = NULL; >> + int i, nr_sets; >> >> /* >> * If there aren't any vectors left after applying the pre/post >> @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) >> get_online_cpus(); >> build_node_to_cpumask(node_to_cpumask); >> >> - /* Spread on present CPUs starting from affd->pre_vectors */ >> - usedvecs = irq_build_affinity_masks(affd, curvec, affvecs, >> - node_to_cpumask, cpu_present_mask, >> - nmsk, masks); >> + /* >> + * Spread on present CPUs starting from affd->pre_vectors. If we >> + * have multiple sets, build each sets affinity mask separately. >> + */ >> + nr_sets = affd->nr_sets; >> + if (!nr_sets) >> + nr_sets = 1; >> + >> + for (i = 0, usedvecs = 0; i < nr_sets; i++) { >> + int this_vecs = affd->sets ? affd->sets[i] : affvecs; >> + int nr; >> + >> + nr = irq_build_affinity_masks(affd, curvec, this_vecs, >> + node_to_cpumask, cpu_present_mask, >> + nmsk, masks + usedvecs); >> + usedvecs += nr; >> + } > > > While the code below returns the appropriate number of possible vectors > when a set requested too many, the above code is still using the value > from the set, which may exceed 'nvecs' used to kcalloc 'masks', so > 'masks + usedvecs' may go out of bounds. How so? nvecs must the max number of vecs, the sum of the sets can't exceed that value.
On Tue, Oct 30, 2018 at 08:36:35AM -0600, Jens Axboe wrote: > On 10/30/18 8:26 AM, Keith Busch wrote: > > On Mon, Oct 29, 2018 at 10:37:35AM -0600, Jens Axboe wrote: > >> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c > >> index f4f29b9d90ee..2046a0f0f0f1 100644 > >> --- a/kernel/irq/affinity.c > >> +++ b/kernel/irq/affinity.c > >> @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) > >> int curvec, usedvecs; > >> cpumask_var_t nmsk, npresmsk, *node_to_cpumask; > >> struct cpumask *masks = NULL; > >> + int i, nr_sets; > >> > >> /* > >> * If there aren't any vectors left after applying the pre/post > >> @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) > >> get_online_cpus(); > >> build_node_to_cpumask(node_to_cpumask); > >> > >> - /* Spread on present CPUs starting from affd->pre_vectors */ > >> - usedvecs = irq_build_affinity_masks(affd, curvec, affvecs, > >> - node_to_cpumask, cpu_present_mask, > >> - nmsk, masks); > >> + /* > >> + * Spread on present CPUs starting from affd->pre_vectors. If we > >> + * have multiple sets, build each sets affinity mask separately. > >> + */ > >> + nr_sets = affd->nr_sets; > >> + if (!nr_sets) > >> + nr_sets = 1; > >> + > >> + for (i = 0, usedvecs = 0; i < nr_sets; i++) { > >> + int this_vecs = affd->sets ? affd->sets[i] : affvecs; > >> + int nr; > >> + > >> + nr = irq_build_affinity_masks(affd, curvec, this_vecs, > >> + node_to_cpumask, cpu_present_mask, > >> + nmsk, masks + usedvecs); > >> + usedvecs += nr; > >> + } > > > > > > While the code below returns the appropriate number of possible vectors > > when a set requested too many, the above code is still using the value > > from the set, which may exceed 'nvecs' used to kcalloc 'masks', so > > 'masks + usedvecs' may go out of bounds. > > How so? nvecs must the max number of vecs, the sum of the sets can't > exceed that value. 'nvecs' is what irq_calc_affinity_vectors() returns, which is the min of either the requested max or the sum of the set, and the sum of the set isn't guaranteed to be the smaller value.
On 10/30/18 8:45 AM, Keith Busch wrote: > On Tue, Oct 30, 2018 at 08:36:35AM -0600, Jens Axboe wrote: >> On 10/30/18 8:26 AM, Keith Busch wrote: >>> On Mon, Oct 29, 2018 at 10:37:35AM -0600, Jens Axboe wrote: >>>> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c >>>> index f4f29b9d90ee..2046a0f0f0f1 100644 >>>> --- a/kernel/irq/affinity.c >>>> +++ b/kernel/irq/affinity.c >>>> @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) >>>> int curvec, usedvecs; >>>> cpumask_var_t nmsk, npresmsk, *node_to_cpumask; >>>> struct cpumask *masks = NULL; >>>> + int i, nr_sets; >>>> >>>> /* >>>> * If there aren't any vectors left after applying the pre/post >>>> @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) >>>> get_online_cpus(); >>>> build_node_to_cpumask(node_to_cpumask); >>>> >>>> - /* Spread on present CPUs starting from affd->pre_vectors */ >>>> - usedvecs = irq_build_affinity_masks(affd, curvec, affvecs, >>>> - node_to_cpumask, cpu_present_mask, >>>> - nmsk, masks); >>>> + /* >>>> + * Spread on present CPUs starting from affd->pre_vectors. If we >>>> + * have multiple sets, build each sets affinity mask separately. >>>> + */ >>>> + nr_sets = affd->nr_sets; >>>> + if (!nr_sets) >>>> + nr_sets = 1; >>>> + >>>> + for (i = 0, usedvecs = 0; i < nr_sets; i++) { >>>> + int this_vecs = affd->sets ? affd->sets[i] : affvecs; >>>> + int nr; >>>> + >>>> + nr = irq_build_affinity_masks(affd, curvec, this_vecs, >>>> + node_to_cpumask, cpu_present_mask, >>>> + nmsk, masks + usedvecs); >>>> + usedvecs += nr; >>>> + } >>> >>> >>> While the code below returns the appropriate number of possible vectors >>> when a set requested too many, the above code is still using the value >>> from the set, which may exceed 'nvecs' used to kcalloc 'masks', so >>> 'masks + usedvecs' may go out of bounds. >> >> How so? nvecs must the max number of vecs, the sum of the sets can't >> exceed that value. > > 'nvecs' is what irq_calc_affinity_vectors() returns, which is the min > of either the requested max or the sum of the set, and the sum of the set > isn't guaranteed to be the smaller value. The sum of the set can't exceed the nvecs passed in, the nvecs passed in should be the less than or equal to nvecs. Granted this isn't enforced, and perhaps that should be the case.
On Tue, Oct 30, 2018 at 08:53:37AM -0600, Jens Axboe wrote: > The sum of the set can't exceed the nvecs passed in, the nvecs passed in > should be the less than or equal to nvecs. Granted this isn't enforced, > and perhaps that should be the case. That should at least initially be true for a proper functioning driver. It's not enforced as you mentioned, but that's only related to the issue I'm referring to. The problem is pci_alloc_irq_vectors_affinity() takes a range, min_vecs and max_vecs, but a range of allowable vector allocations doesn't make sense when using sets.
On 10/30/18 9:08 AM, Keith Busch wrote: > On Tue, Oct 30, 2018 at 08:53:37AM -0600, Jens Axboe wrote: >> The sum of the set can't exceed the nvecs passed in, the nvecs passed in >> should be the less than or equal to nvecs. Granted this isn't enforced, >> and perhaps that should be the case. > > That should at least initially be true for a proper functioning > driver. It's not enforced as you mentioned, but that's only related to > the issue I'm referring to. > > The problem is pci_alloc_irq_vectors_affinity() takes a range, min_vecs > and max_vecs, but a range of allowable vector allocations doesn't make > sense when using sets. I feel like we're going in circles here, not sure what you feel the issue is now? The range is fine, whoever uses sets will need to adjust their sets based on what pci_alloc_irq_vectors_affinity() returns, if it didn't return the passed in desired max.
On Tue, Oct 30, 2018 at 09:18:05AM -0600, Jens Axboe wrote: > On 10/30/18 9:08 AM, Keith Busch wrote: > > On Tue, Oct 30, 2018 at 08:53:37AM -0600, Jens Axboe wrote: > >> The sum of the set can't exceed the nvecs passed in, the nvecs passed in > >> should be the less than or equal to nvecs. Granted this isn't enforced, > >> and perhaps that should be the case. > > > > That should at least initially be true for a proper functioning > > driver. It's not enforced as you mentioned, but that's only related to > > the issue I'm referring to. > > > > The problem is pci_alloc_irq_vectors_affinity() takes a range, min_vecs > > and max_vecs, but a range of allowable vector allocations doesn't make > > sense when using sets. > > I feel like we're going in circles here, not sure what you feel the > issue is now? The range is fine, whoever uses sets will need to adjust > their sets based on what pci_alloc_irq_vectors_affinity() returns, > if it didn't return the passed in desired max. Sorry, let me to try again. pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If that doesn't work, it will iterate down to min_vecs without returning to the caller. The caller doesn't have a chance to adjust its sets between iterations when you provide a range. The 'masks' overrun problem happens if the caller provides min_vecs as a smaller value than the sum of the set (plus any reserved). If it's up to the caller to ensure that doesn't happen, then min and max must both be the same value, and that value must also be the same as the set sum + reserved vectors. The range just becomes redundant since it is already bounded by the set. Using the nvme example, it would need something like this to prevent the 'masks' overrun: --- diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index a8747b956e43..625eff570eaa 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2120,7 +2120,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev) * setting up the full range we need. */ pci_free_irq_vectors(pdev); - result = pci_alloc_irq_vectors_affinity(pdev, 1, nr_io_queues, + result = pci_alloc_irq_vectors_affinity(pdev, nr_io_queues, nr_io_queues, PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd); if (result <= 0) return -EIO; --
On 10/30/18 10:02 AM, Keith Busch wrote: > On Tue, Oct 30, 2018 at 09:18:05AM -0600, Jens Axboe wrote: >> On 10/30/18 9:08 AM, Keith Busch wrote: >>> On Tue, Oct 30, 2018 at 08:53:37AM -0600, Jens Axboe wrote: >>>> The sum of the set can't exceed the nvecs passed in, the nvecs passed in >>>> should be the less than or equal to nvecs. Granted this isn't enforced, >>>> and perhaps that should be the case. >>> >>> That should at least initially be true for a proper functioning >>> driver. It's not enforced as you mentioned, but that's only related to >>> the issue I'm referring to. >>> >>> The problem is pci_alloc_irq_vectors_affinity() takes a range, min_vecs >>> and max_vecs, but a range of allowable vector allocations doesn't make >>> sense when using sets. >> >> I feel like we're going in circles here, not sure what you feel the >> issue is now? The range is fine, whoever uses sets will need to adjust >> their sets based on what pci_alloc_irq_vectors_affinity() returns, >> if it didn't return the passed in desired max. > > Sorry, let me to try again. > > pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If > that doesn't work, it will iterate down to min_vecs without returning to > the caller. The caller doesn't have a chance to adjust its sets between > iterations when you provide a range. > > The 'masks' overrun problem happens if the caller provides min_vecs > as a smaller value than the sum of the set (plus any reserved). > > If it's up to the caller to ensure that doesn't happen, then min and > max must both be the same value, and that value must also be the same as > the set sum + reserved vectors. The range just becomes redundant since > it is already bounded by the set. > > Using the nvme example, it would need something like this to prevent the > 'masks' overrun: OK, now I hear what you are saying. And you are right, the callers needs to provide minvec == maxvec for sets, and then have a loop around that to adjust as needed. I'll make that change in nvme.
On 10/30/18 10:42 AM, Jens Axboe wrote: > On 10/30/18 10:02 AM, Keith Busch wrote: >> On Tue, Oct 30, 2018 at 09:18:05AM -0600, Jens Axboe wrote: >>> On 10/30/18 9:08 AM, Keith Busch wrote: >>>> On Tue, Oct 30, 2018 at 08:53:37AM -0600, Jens Axboe wrote: >>>>> The sum of the set can't exceed the nvecs passed in, the nvecs passed in >>>>> should be the less than or equal to nvecs. Granted this isn't enforced, >>>>> and perhaps that should be the case. >>>> >>>> That should at least initially be true for a proper functioning >>>> driver. It's not enforced as you mentioned, but that's only related to >>>> the issue I'm referring to. >>>> >>>> The problem is pci_alloc_irq_vectors_affinity() takes a range, min_vecs >>>> and max_vecs, but a range of allowable vector allocations doesn't make >>>> sense when using sets. >>> >>> I feel like we're going in circles here, not sure what you feel the >>> issue is now? The range is fine, whoever uses sets will need to adjust >>> their sets based on what pci_alloc_irq_vectors_affinity() returns, >>> if it didn't return the passed in desired max. >> >> Sorry, let me to try again. >> >> pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If >> that doesn't work, it will iterate down to min_vecs without returning to >> the caller. The caller doesn't have a chance to adjust its sets between >> iterations when you provide a range. >> >> The 'masks' overrun problem happens if the caller provides min_vecs >> as a smaller value than the sum of the set (plus any reserved). >> >> If it's up to the caller to ensure that doesn't happen, then min and >> max must both be the same value, and that value must also be the same as >> the set sum + reserved vectors. The range just becomes redundant since >> it is already bounded by the set. >> >> Using the nvme example, it would need something like this to prevent the >> 'masks' overrun: > > OK, now I hear what you are saying. And you are right, the callers needs > to provide minvec == maxvec for sets, and then have a loop around that > to adjust as needed. > > I'll make that change in nvme. Pretty trivial, below. This also keeps the queue mapping calculations more clean, as we don't have to do one after we're done allocating IRQs. commit e8a35d023a192e34540c60f779fe755970b8eeb2 Author: Jens Axboe <axboe@kernel.dk> Date: Tue Oct 30 11:06:29 2018 -0600 nvme: utilize two queue maps, one for reads and one for writes NVMe does round-robin between queues by default, which means that sharing a queue map for both reads and writes can be problematic in terms of read servicing. It's much easier to flood the queue with writes and reduce the read servicing. Implement two queue maps, one for reads and one for writes. The write queue count is configurable through the 'write_queues' parameter. By default, we retain the previous behavior of having a single queue set, shared between reads and writes. Setting 'write_queues' to a non-zero value will create two queue sets, one for reads and one for writes, the latter using the configurable number of queues (hardware queue counts permitting). Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index e5d783cb6937..17170686105f 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -74,11 +74,29 @@ static int io_queue_depth = 1024; module_param_cb(io_queue_depth, &io_queue_depth_ops, &io_queue_depth, 0644); MODULE_PARM_DESC(io_queue_depth, "set io queue depth, should >= 2"); +static int queue_count_set(const char *val, const struct kernel_param *kp); +static const struct kernel_param_ops queue_count_ops = { + .set = queue_count_set, + .get = param_get_int, +}; + +static int write_queues; +module_param_cb(write_queues, &queue_count_ops, &write_queues, 0644); +MODULE_PARM_DESC(write_queues, + "Number of queues to use for writes. If not set, reads and writes " + "will share a queue set."); + struct nvme_dev; struct nvme_queue; static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown); +enum { + NVMEQ_TYPE_READ, + NVMEQ_TYPE_WRITE, + NVMEQ_TYPE_NR, +}; + /* * Represents an NVM Express device. Each nvme_dev is a PCI function. */ @@ -92,6 +110,7 @@ struct nvme_dev { struct dma_pool *prp_small_pool; unsigned online_queues; unsigned max_qid; + unsigned io_queues[NVMEQ_TYPE_NR]; unsigned int num_vecs; int q_depth; u32 db_stride; @@ -134,6 +153,17 @@ static int io_queue_depth_set(const char *val, const struct kernel_param *kp) return param_set_int(val, kp); } +static int queue_count_set(const char *val, const struct kernel_param *kp) +{ + int n = 0, ret; + + ret = kstrtoint(val, 10, &n); + if (n > num_possible_cpus()) + n = num_possible_cpus(); + + return param_set_int(val, kp); +} + static inline unsigned int sq_idx(unsigned int qid, u32 stride) { return qid * 2 * stride; @@ -218,9 +248,20 @@ static inline void _nvme_check_size(void) BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64); } +static unsigned int max_io_queues(void) +{ + return num_possible_cpus() + write_queues; +} + +static unsigned int max_queue_count(void) +{ + /* IO queues + admin queue */ + return 1 + max_io_queues(); +} + static inline unsigned int nvme_dbbuf_size(u32 stride) { - return ((num_possible_cpus() + 1) * 8 * stride); + return (max_queue_count() * 8 * stride); } static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev) @@ -431,12 +472,41 @@ static int nvme_init_request(struct blk_mq_tag_set *set, struct request *req, return 0; } +static int queue_irq_offset(struct nvme_dev *dev) +{ + /* if we have more than 1 vec, admin queue offsets us 1 */ + if (dev->num_vecs > 1) + return 1; + + return 0; +} + static int nvme_pci_map_queues(struct blk_mq_tag_set *set) { struct nvme_dev *dev = set->driver_data; + int i, qoff, offset; + + offset = queue_irq_offset(dev); + for (i = 0, qoff = 0; i < set->nr_maps; i++) { + struct blk_mq_queue_map *map = &set->map[i]; - return blk_mq_pci_map_queues(&set->map[0], to_pci_dev(dev->dev), - dev->num_vecs > 1 ? 1 /* admin queue */ : 0); + map->nr_queues = dev->io_queues[i]; + if (!map->nr_queues) { + BUG_ON(i == NVMEQ_TYPE_READ); + + /* shared set, resuse read set parameters */ + map->nr_queues = dev->io_queues[NVMEQ_TYPE_READ]; + qoff = 0; + offset = queue_irq_offset(dev); + } + + map->queue_offset = qoff; + blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset); + qoff += map->nr_queues; + offset += map->nr_queues; + } + + return 0; } /** @@ -849,6 +919,14 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx, return ret; } +static int nvme_flags_to_type(struct request_queue *q, unsigned int flags) +{ + if ((flags & REQ_OP_MASK) == REQ_OP_READ) + return NVMEQ_TYPE_READ; + + return NVMEQ_TYPE_WRITE; +} + static void nvme_pci_complete_rq(struct request *req) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); @@ -1476,6 +1554,7 @@ static const struct blk_mq_ops nvme_mq_admin_ops = { static const struct blk_mq_ops nvme_mq_ops = { .queue_rq = nvme_queue_rq, + .flags_to_type = nvme_flags_to_type, .complete = nvme_pci_complete_rq, .init_hctx = nvme_init_hctx, .init_request = nvme_init_request, @@ -1888,18 +1967,53 @@ static int nvme_setup_host_mem(struct nvme_dev *dev) return ret; } +static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int nr_io_queues) +{ + unsigned int this_w_queues = write_queues; + + /* + * Setup read/write queue split + */ + if (nr_io_queues == 1) { + dev->io_queues[NVMEQ_TYPE_READ] = 1; + dev->io_queues[NVMEQ_TYPE_WRITE] = 0; + return; + } + + /* + * If 'write_queues' is set, ensure it leaves room for at least + * one read queue + */ + if (this_w_queues >= nr_io_queues) + this_w_queues = nr_io_queues - 1; + + /* + * If 'write_queues' is set to zero, reads and writes will share + * a queue set. + */ + if (!this_w_queues) { + dev->io_queues[NVMEQ_TYPE_WRITE] = 0; + dev->io_queues[NVMEQ_TYPE_READ] = nr_io_queues; + } else { + dev->io_queues[NVMEQ_TYPE_WRITE] = this_w_queues; + dev->io_queues[NVMEQ_TYPE_READ] = nr_io_queues - this_w_queues; + } +} + static int nvme_setup_io_queues(struct nvme_dev *dev) { struct nvme_queue *adminq = &dev->queues[0]; struct pci_dev *pdev = to_pci_dev(dev->dev); int result, nr_io_queues; unsigned long size; - + int irq_sets[2]; struct irq_affinity affd = { - .pre_vectors = 1 + .pre_vectors = 1, + .nr_sets = ARRAY_SIZE(irq_sets), + .sets = irq_sets, }; - nr_io_queues = num_possible_cpus(); + nr_io_queues = max_io_queues(); result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues); if (result < 0) return result; @@ -1934,13 +2048,48 @@ static int nvme_setup_io_queues(struct nvme_dev *dev) * setting up the full range we need. */ pci_free_irq_vectors(pdev); - result = pci_alloc_irq_vectors_affinity(pdev, 1, nr_io_queues + 1, - PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd); - if (result <= 0) - return -EIO; + + /* + * For irq sets, we have to ask for minvec == maxvec. This passes + * any reduction back to us, so we can adjust our queue counts and + * IRQ vector needs. + */ + do { + nvme_calc_io_queues(dev, nr_io_queues); + irq_sets[0] = dev->io_queues[NVMEQ_TYPE_READ]; + irq_sets[1] = dev->io_queues[NVMEQ_TYPE_WRITE]; + if (!irq_sets[1]) + affd.nr_sets = 1; + + /* + * Need IRQs for read+write queues, and one for the admin queue + */ + nr_io_queues = irq_sets[0] + irq_sets[1] + 1; + + result = pci_alloc_irq_vectors_affinity(pdev, nr_io_queues, + nr_io_queues, + PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd); + + /* + * Need to reduce our vec counts + */ + if (result == -ENOSPC) { + nr_io_queues--; + if (!nr_io_queues) + return result; + continue; + } else if (result <= 0) + return -EIO; + break; + } while (1); + dev->num_vecs = result; dev->max_qid = max(result - 1, 1); + dev_info(dev->ctrl.device, "%d/%d/%d read/write queues\n", + dev->io_queues[NVMEQ_TYPE_READ], + dev->io_queues[NVMEQ_TYPE_WRITE]); + /* * Should investigate if there's a performance win from allocating * more queues than interrupt vectors; it might allow the submission @@ -2042,6 +2191,7 @@ static int nvme_dev_add(struct nvme_dev *dev) if (!dev->ctrl.tagset) { dev->tagset.ops = &nvme_mq_ops; dev->tagset.nr_hw_queues = dev->online_queues - 1; + dev->tagset.nr_maps = NVMEQ_TYPE_NR; dev->tagset.timeout = NVME_IO_TIMEOUT; dev->tagset.numa_node = dev_to_node(dev->dev); dev->tagset.queue_depth = @@ -2489,8 +2639,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) if (!dev) return -ENOMEM; - dev->queues = kcalloc_node(num_possible_cpus() + 1, - sizeof(struct nvme_queue), GFP_KERNEL, node); + dev->queues = kcalloc_node(max_queue_count(), sizeof(struct nvme_queue), + GFP_KERNEL, node); if (!dev->queues) goto free;
On Tue, Oct 30, 2018 at 11:09:04AM -0600, Jens Axboe wrote: > Pretty trivial, below. This also keeps the queue mapping calculations > more clean, as we don't have to do one after we're done allocating > IRQs. Yep, this addresses my concern. It less efficient than PCI since PCI can usually jump straight to a valid vector count in a single iteration where this only subtracts by 1. I really can't be bothered to care for optimizing that, so this works for me! :)
Jens, On Tue, 30 Oct 2018, Jens Axboe wrote: > On 10/30/18 10:02 AM, Keith Busch wrote: > > pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If > > that doesn't work, it will iterate down to min_vecs without returning to > > the caller. The caller doesn't have a chance to adjust its sets between > > iterations when you provide a range. > > > > The 'masks' overrun problem happens if the caller provides min_vecs > > as a smaller value than the sum of the set (plus any reserved). > > > > If it's up to the caller to ensure that doesn't happen, then min and > > max must both be the same value, and that value must also be the same as > > the set sum + reserved vectors. The range just becomes redundant since > > it is already bounded by the set. > > > > Using the nvme example, it would need something like this to prevent the > > 'masks' overrun: > > OK, now I hear what you are saying. And you are right, the callers needs > to provide minvec == maxvec for sets, and then have a loop around that > to adjust as needed. But then we should enforce it in the core code, right? Thanks, tglx
On 10/30/18 11:22 AM, Keith Busch wrote: > On Tue, Oct 30, 2018 at 11:09:04AM -0600, Jens Axboe wrote: >> Pretty trivial, below. This also keeps the queue mapping calculations >> more clean, as we don't have to do one after we're done allocating >> IRQs. > > Yep, this addresses my concern. It less efficient than PCI since PCI > can usually jump straight to a valid vector count in a single iteration > where this only subtracts by 1. I really can't be bothered to care for > optimizing that, so this works for me! :) It definitely is less efficient than just getting the count that we can support, but it's at probe time so I could not really be bothered either. Can I add your reviewed-by?
On 10/30/18 11:25 AM, Thomas Gleixner wrote: > Jens, > > On Tue, 30 Oct 2018, Jens Axboe wrote: >> On 10/30/18 10:02 AM, Keith Busch wrote: >>> pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If >>> that doesn't work, it will iterate down to min_vecs without returning to >>> the caller. The caller doesn't have a chance to adjust its sets between >>> iterations when you provide a range. >>> >>> The 'masks' overrun problem happens if the caller provides min_vecs >>> as a smaller value than the sum of the set (plus any reserved). >>> >>> If it's up to the caller to ensure that doesn't happen, then min and >>> max must both be the same value, and that value must also be the same as >>> the set sum + reserved vectors. The range just becomes redundant since >>> it is already bounded by the set. >>> >>> Using the nvme example, it would need something like this to prevent the >>> 'masks' overrun: >> >> OK, now I hear what you are saying. And you are right, the callers needs >> to provide minvec == maxvec for sets, and then have a loop around that >> to adjust as needed. > > But then we should enforce it in the core code, right? Yes, I was going to ask you if you want a followup patch for that, or an updated version of the original?
On Tue, Oct 30, 2018 at 11:33:51AM -0600, Jens Axboe wrote: > On 10/30/18 11:22 AM, Keith Busch wrote: > > On Tue, Oct 30, 2018 at 11:09:04AM -0600, Jens Axboe wrote: > >> Pretty trivial, below. This also keeps the queue mapping calculations > >> more clean, as we don't have to do one after we're done allocating > >> IRQs. > > > > Yep, this addresses my concern. It less efficient than PCI since PCI > > can usually jump straight to a valid vector count in a single iteration > > where this only subtracts by 1. I really can't be bothered to care for > > optimizing that, so this works for me! :) > > It definitely is less efficient than just getting the count that we > can support, but it's at probe time so I could not really be bothered > either. > > Can I add your reviewed-by? Yes, please. Reviewed-by: Keith Busch <keith.busch@intel.com> > -- > Jens Axboe
On 10/30/18 11:34 AM, Jens Axboe wrote: > On 10/30/18 11:25 AM, Thomas Gleixner wrote: >> Jens, >> >> On Tue, 30 Oct 2018, Jens Axboe wrote: >>> On 10/30/18 10:02 AM, Keith Busch wrote: >>>> pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If >>>> that doesn't work, it will iterate down to min_vecs without returning to >>>> the caller. The caller doesn't have a chance to adjust its sets between >>>> iterations when you provide a range. >>>> >>>> The 'masks' overrun problem happens if the caller provides min_vecs >>>> as a smaller value than the sum of the set (plus any reserved). >>>> >>>> If it's up to the caller to ensure that doesn't happen, then min and >>>> max must both be the same value, and that value must also be the same as >>>> the set sum + reserved vectors. The range just becomes redundant since >>>> it is already bounded by the set. >>>> >>>> Using the nvme example, it would need something like this to prevent the >>>> 'masks' overrun: >>> >>> OK, now I hear what you are saying. And you are right, the callers needs >>> to provide minvec == maxvec for sets, and then have a loop around that >>> to adjust as needed. >> >> But then we should enforce it in the core code, right? > > Yes, I was going to ask you if you want a followup patch for that, or > an updated version of the original? Here's an incremental, I'm going to fold this into the original unless I hear otherwise. diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index af24ed50a245..e6c6e10b9ceb 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -1036,6 +1036,13 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, if (maxvec < minvec) return -ERANGE; + /* + * If the caller is passing in sets, we can't support a range of + * vectors. The caller needs to handle that. + */ + if (affd->nr_sets && minvec != maxvec) + return -EINVAL; + if (WARN_ON_ONCE(dev->msi_enabled)) return -EINVAL; @@ -1087,6 +1094,13 @@ static int __pci_enable_msix_range(struct pci_dev *dev, if (maxvec < minvec) return -ERANGE; + /* + * If the caller is passing in sets, we can't support a range of + * supported vectors. The caller needs to handle that. + */ + if (affd->nr_sets && minvec != maxvec) + return -EINVAL; + if (WARN_ON_ONCE(dev->msix_enabled)) return -EINVAL;
On Tue, 30 Oct 2018, Jens Axboe wrote: > On 10/30/18 11:25 AM, Thomas Gleixner wrote: > > Jens, > > > > On Tue, 30 Oct 2018, Jens Axboe wrote: > >> On 10/30/18 10:02 AM, Keith Busch wrote: > >>> pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If > >>> that doesn't work, it will iterate down to min_vecs without returning to > >>> the caller. The caller doesn't have a chance to adjust its sets between > >>> iterations when you provide a range. > >>> > >>> The 'masks' overrun problem happens if the caller provides min_vecs > >>> as a smaller value than the sum of the set (plus any reserved). > >>> > >>> If it's up to the caller to ensure that doesn't happen, then min and > >>> max must both be the same value, and that value must also be the same as > >>> the set sum + reserved vectors. The range just becomes redundant since > >>> it is already bounded by the set. > >>> > >>> Using the nvme example, it would need something like this to prevent the > >>> 'masks' overrun: > >> > >> OK, now I hear what you are saying. And you are right, the callers needs > >> to provide minvec == maxvec for sets, and then have a loop around that > >> to adjust as needed. > > > > But then we should enforce it in the core code, right? > > Yes, I was going to ask you if you want a followup patch for that, or > an updated version of the original? Updated combo patch would be nice :) Thanks lazytglx
On 10/30/18 11:46 AM, Thomas Gleixner wrote: > On Tue, 30 Oct 2018, Jens Axboe wrote: >> On 10/30/18 11:25 AM, Thomas Gleixner wrote: >>> Jens, >>> >>> On Tue, 30 Oct 2018, Jens Axboe wrote: >>>> On 10/30/18 10:02 AM, Keith Busch wrote: >>>>> pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If >>>>> that doesn't work, it will iterate down to min_vecs without returning to >>>>> the caller. The caller doesn't have a chance to adjust its sets between >>>>> iterations when you provide a range. >>>>> >>>>> The 'masks' overrun problem happens if the caller provides min_vecs >>>>> as a smaller value than the sum of the set (plus any reserved). >>>>> >>>>> If it's up to the caller to ensure that doesn't happen, then min and >>>>> max must both be the same value, and that value must also be the same as >>>>> the set sum + reserved vectors. The range just becomes redundant since >>>>> it is already bounded by the set. >>>>> >>>>> Using the nvme example, it would need something like this to prevent the >>>>> 'masks' overrun: >>>> >>>> OK, now I hear what you are saying. And you are right, the callers needs >>>> to provide minvec == maxvec for sets, and then have a loop around that >>>> to adjust as needed. >>> >>> But then we should enforce it in the core code, right? >> >> Yes, I was going to ask you if you want a followup patch for that, or >> an updated version of the original? > > Updated combo patch would be nice :) I'll re-post the series with the updated combo some time later today. > lazytglx I understand :-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 1d6711c28271..ca397ff40836 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -247,10 +247,14 @@ struct irq_affinity_notify { * the MSI(-X) vector space * @post_vectors: Don't apply affinity to @post_vectors at end of * the MSI(-X) vector space + * @nr_sets: Length of passed in *sets array + * @sets: Number of affinitized sets */ struct irq_affinity { int pre_vectors; int post_vectors; + int nr_sets; + int *sets; }; #if defined(CONFIG_SMP) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index f4f29b9d90ee..2046a0f0f0f1 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) int curvec, usedvecs; cpumask_var_t nmsk, npresmsk, *node_to_cpumask; struct cpumask *masks = NULL; + int i, nr_sets; /* * If there aren't any vectors left after applying the pre/post @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) get_online_cpus(); build_node_to_cpumask(node_to_cpumask); - /* Spread on present CPUs starting from affd->pre_vectors */ - usedvecs = irq_build_affinity_masks(affd, curvec, affvecs, - node_to_cpumask, cpu_present_mask, - nmsk, masks); + /* + * Spread on present CPUs starting from affd->pre_vectors. If we + * have multiple sets, build each sets affinity mask separately. + */ + nr_sets = affd->nr_sets; + if (!nr_sets) + nr_sets = 1; + + for (i = 0, usedvecs = 0; i < nr_sets; i++) { + int this_vecs = affd->sets ? affd->sets[i] : affvecs; + int nr; + + nr = irq_build_affinity_masks(affd, curvec, this_vecs, + node_to_cpumask, cpu_present_mask, + nmsk, masks + usedvecs); + usedvecs += nr; + } /* * Spread on non present CPUs starting from the next vector to be @@ -258,13 +272,21 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity { int resv = affd->pre_vectors + affd->post_vectors; int vecs = maxvec - resv; - int ret; + int set_vecs; if (resv > minvec) return 0; - get_online_cpus(); - ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv; - put_online_cpus(); - return ret; + if (affd->nr_sets) { + int i; + + for (i = 0, set_vecs = 0; i < affd->nr_sets; i++) + set_vecs += affd->sets[i]; + } else { + get_online_cpus(); + set_vecs = cpumask_weight(cpu_possible_mask); + put_online_cpus(); + } + + return resv + min(set_vecs, vecs); }