Message ID | 1529912859-10475-3-git-send-email-kernelfans@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Mon, Jun 25, 2018 at 03:47:39PM +0800, Pingfan Liu wrote: > commit 52cdbdd49853 ("driver core: correct device's shutdown order") > introduces supplier<-consumer order in devices_kset. The commit tries > to cleverly maintain both parent<-child and supplier<-consumer order by > reordering a device when probing. This method makes things simple and > clean, but unfortunately, breaks parent<-child order in some case, > which is described in next patch in this series. There is no "next patch in this series" :( > Here this patch tries to resolve supplier<-consumer by only reordering a > device when it has suppliers, and takes care of the following scenario: > [consumer, children] [ ... potential ... ] supplier > ^ ^ > After moving the consumer and its children after the supplier, the > potentail section may contain consumers whose supplier is inside > children, and this poses the requirement to dry out all consumpers in > the section recursively. > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Grygorii Strashko <grygorii.strashko@ti.com> > Cc: Christoph Hellwig <hch@infradead.org> > Cc: Bjorn Helgaas <helgaas@kernel.org> > Cc: Dave Young <dyoung@redhat.com> > Cc: linux-pci@vger.kernel.org > Cc: linuxppc-dev@lists.ozlabs.org > Signed-off-by: Pingfan Liu <kernelfans@gmail.com> > --- > note: there is lock issue in this patch, should be fixed in next version Please send patches that you know are correct, why would I want to review this if you know it is not correct? And if the original commit is causing problems for you, why not just revert that instead of adding this much-increased complexity? > > --- > drivers/base/core.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 129 insertions(+), 3 deletions(-) > > diff --git a/drivers/base/core.c b/drivers/base/core.c > index 66f06ff..db30e86 100644 > --- a/drivers/base/core.c > +++ b/drivers/base/core.c > @@ -123,12 +123,138 @@ static int device_is_dependent(struct device *dev, void *target) > return ret; > } > > -/* a temporary place holder to mark out the root cause of the bug. > - * The proposal algorithm will come in next patch > +struct pos_info { > + struct device *pos; > + struct device *tail; > +}; > + > +/* caller takes the devices_kset->list_lock */ > +static int descendants_reorder_after_pos(struct device *dev, > + void *data) Why are you wrapping lines that do not need to be wrapped? What does this function do? > +{ > + struct device *pos; > + struct pos_info *p = data; > + > + pos = p->pos; > + pr_debug("devices_kset: Moving %s after %s\n", > + dev_name(dev), dev_name(pos)); You have a device, use it for debugging, i.e. dev_dbg(). > + device_for_each_child(dev, p, descendants_reorder_after_pos); Recursive? > + /* children at the tail */ > + list_move(&dev->kobj.entry, &pos->kobj.entry); > + /* record the right boundary of the section */ > + if (p->tail == NULL) > + p->tail = dev; > + return 0; > +} I really do not understand what the above code is supposed to be doing :( > + > +/* iterate over an open section */ > +#define list_opensect_for_each_reverse(cur, left, right) \ > + for (cur = right->prev; cur == left; cur = cur->prev) > + > +static bool is_consumer(struct device *query, struct device *supplier) > +{ > + struct device_link *link; > + /* todo, lock protection */ Always run checkpatch.pl on patches so you do not get grumpy maintainers telling you to run checkpatch.pl :( > + list_for_each_entry(link, &supplier->links.consumers, s_node) > + if (link->consumer == query) > + return true; > + return false; > +} > + > +/* recursively move the potential consumers in open section (left, right) > + * after the barrier What barrier? I'm stopping here as I have no idea what is going on, and this needs a lot more work at the basic level of "it handles locking correctly"... If you are working on this for power9, I'm guessing you work for IBM? If so, please run this through your internal patch review process before sending it out again... thanks, greg k-h
On Mon, Jun 25, 2018 at 6:45 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Mon, Jun 25, 2018 at 03:47:39PM +0800, Pingfan Liu wrote: > > commit 52cdbdd49853 ("driver core: correct device's shutdown order") > > introduces supplier<-consumer order in devices_kset. The commit tries > > to cleverly maintain both parent<-child and supplier<-consumer order by > > reordering a device when probing. This method makes things simple and > > clean, but unfortunately, breaks parent<-child order in some case, > > which is described in next patch in this series. > > There is no "next patch in this series" :( > Oh, re-arrange the patches, and forget the comment in log > > Here this patch tries to resolve supplier<-consumer by only reordering a > > device when it has suppliers, and takes care of the following scenario: > > [consumer, children] [ ... potential ... ] supplier > > ^ ^ > > After moving the consumer and its children after the supplier, the > > potentail section may contain consumers whose supplier is inside > > children, and this poses the requirement to dry out all consumpers in > > the section recursively. > > > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > Cc: Grygorii Strashko <grygorii.strashko@ti.com> > > Cc: Christoph Hellwig <hch@infradead.org> > > Cc: Bjorn Helgaas <helgaas@kernel.org> > > Cc: Dave Young <dyoung@redhat.com> > > Cc: linux-pci@vger.kernel.org > > Cc: linuxppc-dev@lists.ozlabs.org > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com> > > --- > > note: there is lock issue in this patch, should be fixed in next version > > Please send patches that you know are correct, why would I want to > review this if you know it is not correct? > > And if the original commit is causing problems for you, why not just > revert that instead of adding this much-increased complexity? > Revert the original commit, then it will expose the error order "consumer <- supplier" again. This patch tries to resolve the error and fix the following scenario: step0: before the consumer device's probing, (note child_a is a supplier of consumer_a, etc) [ consumer-X, child_a, ...., child_z] [.... consumer_a, ..., consumer_z, ....] supplier-X ^^^ affected range during moving^^^ step1: When probing, moving consumer-X after supplier-X [ child_a, ...., child_z] [.... consumer_a, ..., consumer_z, ....] supplier-X, consumer-X But it breaks "parent <-child" seq now, and should be fixed like: step2: [.... consumer_a, ..., consumer_z, ....] supplier-X [ consumer-X, child_a, ...., child_z] <--- descendants_reorder_after_pos() does it. Again, the seq "consumer_a <- child_a" breaks the "supplier<-consumer" order, should be fixed like: step3: [.... consumer_z, .....] supplier-X [ consumer-X, child_a, consumer_a ...., child_z] <--- __device_reorder_consumer() does it. ^^ affected range^^ The moving of consumer_a brings us to face the same scenario of step1, hence we need an external recursion. Each round of step3, __device_reorder_consumer() resolves its "local affected range", which is a fraction of the "whole affected range". Hence finally, we have all potential consumers in affected range resolved. (Maybe I can split patch at step2 and step3 to ease the review for the next version) Since __device_reorder_consumer() has already hold devices_kset's spin lock, and need to get srcu lock on devices->links.consumers. This needs a breakage of spin lock, and will incur much effort. If the above algorithm is fine, I can do it. > > > > > > --- > > drivers/base/core.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++-- > > 1 file changed, 129 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/base/core.c b/drivers/base/core.c > > index 66f06ff..db30e86 100644 > > --- a/drivers/base/core.c > > +++ b/drivers/base/core.c > > @@ -123,12 +123,138 @@ static int device_is_dependent(struct device *dev, void *target) > > return ret; > > } > > > > -/* a temporary place holder to mark out the root cause of the bug. > > - * The proposal algorithm will come in next patch > > +struct pos_info { > > + struct device *pos; > > + struct device *tail; > > +}; > > + > > +/* caller takes the devices_kset->list_lock */ > > +static int descendants_reorder_after_pos(struct device *dev, > > + void *data) > > Why are you wrapping lines that do not need to be wrapped? > OK, will fix. > What does this function do? > As the name implies, reordering dev and its children after a position. When moving a consumer after a supplier, we break down the order of "parent <-child" order of consumer and its children in devices_kset. Hence we should move the children too. The param "data" contains the position info, and its name is not illuminated :(, since the func proto is required by device_for_each_child(), may be better to name it as postion_info > > +{ > > + struct device *pos; > > + struct pos_info *p = data; > > + > > + pos = p->pos; > > + pr_debug("devices_kset: Moving %s after %s\n", > > + dev_name(dev), dev_name(pos)); > > You have a device, use it for debugging, i.e. dev_dbg(). > But here we have two devices. > > + device_for_each_child(dev, p, descendants_reorder_after_pos); > > Recursive? > Yes, in order to move all children of the consumer. > > + /* children at the tail */ > > + list_move(&dev->kobj.entry, &pos->kobj.entry); > > + /* record the right boundary of the section */ > > + if (p->tail == NULL) > > + p->tail = dev; > > + return 0; > > +} > > I really do not understand what the above code is supposed to be doing :( > The moved consumer's children may be suppliers of devices, [.... consumer_a, ..., consumer_z, ....] supplier-X [ consumer-X, child_a, ............, child_z] ^^^ potential consumers ^^^^^^ ^^potential suppliers^^ Now, consumer_a and its supplier child_a violate the order "supplier<-consumer". To pick out such violation, we need to check the potential suppliers against potential consumers. And p->tail helps to record the new moved position of child_z. > > + > > +/* iterate over an open section */ > > +#define list_opensect_for_each_reverse(cur, left, right) \ > > + for (cur = right->prev; cur == left; cur = cur->prev) > > + > > +static bool is_consumer(struct device *query, struct device *supplier) > > +{ > > + struct device_link *link; > > + /* todo, lock protection */ > > Always run checkpatch.pl on patches so you do not get grumpy maintainers > telling you to run checkpatch.pl :( > Yes, I had run it, and only got a warning: WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON() #167: FILE: drivers/base/core.c:245: + BUG_ON(!ret); total: 0 errors, 1 warnings, 141 lines checked > > + list_for_each_entry(link, &supplier->links.consumers, s_node) > > + if (link->consumer == query) > > + return true; > > + return false; > > +} > > + > > +/* recursively move the potential consumers in open section (left, right) > > + * after the barrier > > What barrier? > A position that moved devices can not cross before. > I'm stopping here as I have no idea what is going on, and this needs a > lot more work at the basic level of "it handles locking correctly"... > > If you are working on this for power9, I'm guessing you work for IBM? No. I just hit this bug. > If so, please run this through your internal patch review process before > sending it out again... > I will try my best to find some guys to review. But is the assumption of step0 and the following algorithm worth to try? Thanks and regards, Pingfan
On Tue, Jun 26, 2018 at 11:29:48AM +0800, Pingfan Liu wrote: > On Mon, Jun 25, 2018 at 6:45 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Mon, Jun 25, 2018 at 03:47:39PM +0800, Pingfan Liu wrote: > > > commit 52cdbdd49853 ("driver core: correct device's shutdown order") > > > introduces supplier<-consumer order in devices_kset. The commit tries > > > to cleverly maintain both parent<-child and supplier<-consumer order by > > > reordering a device when probing. This method makes things simple and > > > clean, but unfortunately, breaks parent<-child order in some case, > > > which is described in next patch in this series. > > > > There is no "next patch in this series" :( > > > Oh, re-arrange the patches, and forget the comment in log > > > > Here this patch tries to resolve supplier<-consumer by only reordering a > > > device when it has suppliers, and takes care of the following scenario: > > > [consumer, children] [ ... potential ... ] supplier > > > ^ ^ > > > After moving the consumer and its children after the supplier, the > > > potentail section may contain consumers whose supplier is inside > > > children, and this poses the requirement to dry out all consumpers in > > > the section recursively. > > > > > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > > Cc: Grygorii Strashko <grygorii.strashko@ti.com> > > > Cc: Christoph Hellwig <hch@infradead.org> > > > Cc: Bjorn Helgaas <helgaas@kernel.org> > > > Cc: Dave Young <dyoung@redhat.com> > > > Cc: linux-pci@vger.kernel.org > > > Cc: linuxppc-dev@lists.ozlabs.org > > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com> > > > --- > > > note: there is lock issue in this patch, should be fixed in next version > > > > Please send patches that you know are correct, why would I want to > > review this if you know it is not correct? > > > > And if the original commit is causing problems for you, why not just > > revert that instead of adding this much-increased complexity? > > > Revert the original commit, then it will expose the error order > "consumer <- supplier" again. > This patch tries to resolve the error and fix the following scenario: > step0: before the consumer device's probing, (note child_a is a > supplier of consumer_a, etc) > [ consumer-X, child_a, ...., child_z] [.... consumer_a, ..., > consumer_z, ....] supplier-X > ^^^ > affected range during moving^^^ > step1: When probing, moving consumer-X after supplier-X > [ child_a, ...., child_z] [.... consumer_a, ..., consumer_z, > ....] supplier-X, consumer-X > But it breaks "parent <-child" seq now, and should be fixed like: > step2: > [.... consumer_a, ..., consumer_z, ....] supplier-X [ > consumer-X, child_a, ...., child_z] <--- > descendants_reorder_after_pos() does it. > Again, the seq "consumer_a <- child_a" breaks the "supplier<-consumer" > order, should be fixed like: > step3: > [.... consumer_z, .....] supplier-X [ consumer-X, child_a, > consumer_a ...., child_z] <--- __device_reorder_consumer() does it. > ^^ affected range^^ > The moving of consumer_a brings us to face the same scenario of step1, > hence we need an external recursion. Something really got messed up here, and this all does not make any sense :( Can you try again? Also, please cc: Rafael on all of this, as he wrote all of this consumer/supplier logic and I am not that familiar with it at all. thanks, greg k-h
On Tue, Jun 26, 2018 at 7:54 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Tue, Jun 26, 2018 at 11:29:48AM +0800, Pingfan Liu wrote: > > On Mon, Jun 25, 2018 at 6:45 PM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > > > > On Mon, Jun 25, 2018 at 03:47:39PM +0800, Pingfan Liu wrote: > > > > commit 52cdbdd49853 ("driver core: correct device's shutdown order") > > > > introduces supplier<-consumer order in devices_kset. The commit tries > > > > to cleverly maintain both parent<-child and supplier<-consumer order by > > > > reordering a device when probing. This method makes things simple and > > > > clean, but unfortunately, breaks parent<-child order in some case, > > > > which is described in next patch in this series. > > > > > > There is no "next patch in this series" :( > > > > > Oh, re-arrange the patches, and forget the comment in log > > > > > > Here this patch tries to resolve supplier<-consumer by only reordering a > > > > device when it has suppliers, and takes care of the following scenario: > > > > [consumer, children] [ ... potential ... ] supplier > > > > ^ ^ > > > > After moving the consumer and its children after the supplier, the > > > > potentail section may contain consumers whose supplier is inside > > > > children, and this poses the requirement to dry out all consumpers in > > > > the section recursively. > > > > > > > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > > > Cc: Grygorii Strashko <grygorii.strashko@ti.com> > > > > Cc: Christoph Hellwig <hch@infradead.org> > > > > Cc: Bjorn Helgaas <helgaas@kernel.org> > > > > Cc: Dave Young <dyoung@redhat.com> > > > > Cc: linux-pci@vger.kernel.org > > > > Cc: linuxppc-dev@lists.ozlabs.org > > > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com> > > > > --- > > > > note: there is lock issue in this patch, should be fixed in next version > > > > > > Please send patches that you know are correct, why would I want to > > > review this if you know it is not correct? > > > > > > And if the original commit is causing problems for you, why not just > > > revert that instead of adding this much-increased complexity? > > > > > Revert the original commit, then it will expose the error order > > "consumer <- supplier" again. > > This patch tries to resolve the error and fix the following scenario: > > step0: before the consumer device's probing, (note child_a is a > > supplier of consumer_a, etc) > > [ consumer-X, child_a, ...., child_z] [.... consumer_a, ..., > > consumer_z, ....] supplier-X > > ^^^ > > affected range during moving^^^ > > step1: When probing, moving consumer-X after supplier-X > > [ child_a, ...., child_z] [.... consumer_a, ..., consumer_z, > > ....] supplier-X, consumer-X > > But it breaks "parent <-child" seq now, and should be fixed like: > > step2: > > [.... consumer_a, ..., consumer_z, ....] supplier-X [ > > consumer-X, child_a, ...., child_z] <--- > > descendants_reorder_after_pos() does it. > > Again, the seq "consumer_a <- child_a" breaks the "supplier<-consumer" > > order, should be fixed like: > > step3: > > [.... consumer_z, .....] supplier-X [ consumer-X, child_a, > > consumer_a ...., child_z] <--- __device_reorder_consumer() does it. > > ^^ affected range^^ > > The moving of consumer_a brings us to face the same scenario of step1, > > hence we need an external recursion. > > Something really got messed up here, and this all does not make any > sense :( > > Can you try again? > > Also, please cc: Rafael on all of this, as he wrote all of this > consumer/supplier logic and I am not that familiar with it at all. > Cc Rafael J. Wysocki for the context. I will send out V3 soon. Regards, Pingfan
diff --git a/drivers/base/core.c b/drivers/base/core.c index 66f06ff..db30e86 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -123,12 +123,138 @@ static int device_is_dependent(struct device *dev, void *target) return ret; } -/* a temporary place holder to mark out the root cause of the bug. - * The proposal algorithm will come in next patch +struct pos_info { + struct device *pos; + struct device *tail; +}; + +/* caller takes the devices_kset->list_lock */ +static int descendants_reorder_after_pos(struct device *dev, + void *data) +{ + struct device *pos; + struct pos_info *p = data; + + pos = p->pos; + pr_debug("devices_kset: Moving %s after %s\n", + dev_name(dev), dev_name(pos)); + device_for_each_child(dev, p, descendants_reorder_after_pos); + /* children at the tail */ + list_move(&dev->kobj.entry, &pos->kobj.entry); + /* record the right boundary of the section */ + if (p->tail == NULL) + p->tail = dev; + return 0; +} + +/* iterate over an open section */ +#define list_opensect_for_each_reverse(cur, left, right) \ + for (cur = right->prev; cur == left; cur = cur->prev) + +static bool is_consumer(struct device *query, struct device *supplier) +{ + struct device_link *link; + /* todo, lock protection */ + list_for_each_entry(link, &supplier->links.consumers, s_node) + if (link->consumer == query) + return true; + return false; +} + +/* recursively move the potential consumers in open section (left, right) + * after the barrier + */ +static int __device_reorder_consumer(struct device *consumer, + struct list_head *left, struct list_head *right, + struct pos_info *p) +{ + struct list_head *iter; + struct device *c_dev, *s_dev, *tail_dev; + + descendants_reorder_after_pos(consumer, p); + tail_dev = p->tail; + /* (left, right) may contain consumers, hence checking if any moved + * child serving as supplier. The reversing order help us to meet + * the last supplier of a consumer. + */ + list_opensect_for_each_reverse(iter, left, right) { + struct list_head *l_iter, *moved_left, *moved_right; + + moved_left = (&consumer->kobj.entry)->prev; + moved_right = tail_dev->kobj.entry.next; + /* the moved section may contain potential suppliers */ + list_opensect_for_each_reverse(l_iter, moved_left, + moved_right) { + s_dev = list_entry(l_iter, struct device, kobj.entry); + c_dev = list_entry(iter, struct device, kobj.entry); + /* to fix: this poses extra effort for locking */ + if (is_consumer(c_dev, s_dev)) { + p->tail = NULL; + /* to fix: lock issue */ + p->pos = s_dev; + /* reorder after the last supplier */ + __device_reorder_consumer(c_dev, + l_iter, right, p); + } + } + } + return 0; +} + +static int find_last_supplier(struct device *dev, struct device *supplier) +{ + struct device_link *link; + + list_for_each_entry_reverse(link, &dev->links.suppliers, c_node) { + if (link->supplier == supplier) + return 1; + } + if (dev == supplier) + return -1; + return 0; +} + +/* When reodering, take care of the range of (old_pos(dev), new_pos(dev)), + * there may be requirement to recursively move item. */ int device_reorder_consumer(struct device *dev) { - devices_kset_move_last(dev); + struct list_head *iter, *left, *right; + struct device *cur_dev; + struct pos_info info; + int ret, idx; + + idx = device_links_read_lock(); + if (list_empty(&dev->links.suppliers)) { + device_links_read_unlock(idx); + return 0; + } + spin_lock(&devices_kset->list_lock); + list_for_each_prev(iter, &devices_kset->list) { + cur_dev = list_entry(iter, struct device, kobj.entry); + ret = find_last_supplier(dev, cur_dev); + switch (ret) { + case -1: + goto unlock; + case 1: + break; + case 0: + continue; + } + } + BUG_ON(!ret); + + /* record the affected open section */ + left = dev->kobj.entry.prev; + right = iter; + info.pos = list_entry(iter, struct device, kobj.entry); + info.tail = NULL; + /* dry out the consumers in (left,right) */ + __device_reorder_consumer(dev, left, right, &info); + +unlock: + spin_unlock(&devices_kset->list_lock); + device_links_read_unlock(idx); return 0; }
commit 52cdbdd49853 ("driver core: correct device's shutdown order") introduces supplier<-consumer order in devices_kset. The commit tries to cleverly maintain both parent<-child and supplier<-consumer order by reordering a device when probing. This method makes things simple and clean, but unfortunately, breaks parent<-child order in some case, which is described in next patch in this series. Here this patch tries to resolve supplier<-consumer by only reordering a device when it has suppliers, and takes care of the following scenario: [consumer, children] [ ... potential ... ] supplier ^ ^ After moving the consumer and its children after the supplier, the potentail section may contain consumers whose supplier is inside children, and this poses the requirement to dry out all consumpers in the section recursively. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: linux-pci@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Pingfan Liu <kernelfans@gmail.com> --- note: there is lock issue in this patch, should be fixed in next version --- drivers/base/core.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 129 insertions(+), 3 deletions(-)