Message ID | 1516955496-17236-1-git-send-email-cpandya@codeaurora.org (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | Andy Gross |
Headers | show |
On 2018-01-26 09:31, Chintan Pandya wrote: > Implement, device-phandle relation in hash-table so > that look up can be faster, irrespective of where my > device is defined in the DT. > > There are ~6.7k calls to of_find_node_by_phandle() and > total improvement observed during boot is 400ms. I'm probably missing something obvious, but: Aren't phandles in practice small consecutive integers assigned by dtc? If so, why not just have a smallish static array mapping the small phandle values directly to device node, instead of adding a pointer to every struct device_node? Or one could determine the size of the array dynamically (largest seen phandle value, capping at something sensible, e.g. 1024). In either case, one would still need to keep the code doing the whole-tree traversal for handling large phandle values, but I think the above should make lookup O(1) in most cases. Alternatively, one could just count the number of nodes with a phandle, allocate an array of that many pointers (so the memory use is certainly no more than if adding a pointer to each device_node), and sort it by phandle, so one can do lookup using a binary search. Rasmus -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I'm probably missing something obvious, but: Aren't phandles in practice > small consecutive integers assigned by dtc? If so, why not just have a > smallish static array mapping the small phandle values directly to > device node, instead of adding a pointer to every struct device_node? Or > one could determine the size of the array dynamically (largest seen > phandle value, capping at something sensible, e.g. 1024). Haven't noticed this earlier !! If following is known or true, we can avoid using hash-table and save per device_node hlish_node. 1. How to know max phandle value without traversing tree once? In my case, max is 1263. 2. Although, I haven't observed collision but is it like every device_node is associated with unique phandle value ? > In either case, one would still need to keep the code doing the > whole-tree traversal for handling large phandle values, but I think the > above should make lookup O(1) in most cases. I would refrain doing this because that will make this API inconsistent in terms of time taken by different nodes. I see that people do change their device placing in DT and that changes time taken in of_* APIs for them but affecting others. > Alternatively, one could just count the number of nodes with a phandle, > allocate an array of that many pointers (so the memory use is certainly > no more than if adding a pointer to each device_node), and sort it by > phandle, so one can do lookup using a binary search. > > Rasmus This is certainly doable if current approach is not welcomed due to addition on hlish_node in device_node. Chintan Pandya
On Fri, Jan 26, 2018 at 9:14 AM, Chintan Pandya <cpandya@codeaurora.org> wrote: > >> I'm probably missing something obvious, but: Aren't phandles in practice >> small consecutive integers assigned by dtc? If so, why not just have a >> smallish static array mapping the small phandle values directly to >> device node, instead of adding a pointer to every struct device_node? Or >> one could determine the size of the array dynamically (largest seen >> phandle value, capping at something sensible, e.g. 1024). I do have some concerns that is a bit fragile and dependent on dtc's implementations. However, I guess we already kind of are with overlay phandles. > Haven't noticed this earlier !! If following is known or true, we can avoid > using hash-table and save per device_node hlish_node. > > 1. How to know max phandle value without traversing tree once? In my > case, > max is 1263. We already have to know it for overlays. Plus unflattening has to handle phandles specially already, so it would be easy to do there if we aren't already. Then the question what to do with overlays. For now, we can probably assume too few phandles to matter. > 2. Although, I haven't observed collision but is it like every > device_node > is associated with unique phandle value ? Yes, phandles must be unique. >> In either case, one would still need to keep the code doing the >> whole-tree traversal for handling large phandle values, but I think the >> above should make lookup O(1) in most cases. > > I would refrain doing this because that will make this API inconsistent in > terms > of time taken by different nodes. I see that people do change their device > placing in DT and that changes time taken in of_* APIs for them but > affecting > others. Who cares. It's the total time that matters. It's obviously not a problem until you have 1000s of lookups as no one cared until recently (though you aren't the first with a hash table lookup). >> Alternatively, one could just count the number of nodes with a phandle, >> allocate an array of that many pointers (so the memory use is certainly >> no more than if adding a pointer to each device_node), and sort it by >> phandle, so one can do lookup using a binary search. >> >> Rasmus > > This is certainly doable if current approach is not welcomed due to > addition on hlish_node in device_node. I certainly prefer an out of band approach as that's easier to turn of if we want to save memory. Still, I'd like to see some data on a cache based approach and reasons why that won't work. Rob -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Chintan,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on robh/for-next]
[also build test ERROR on v4.15-rc9 next-20180126]
[cannot apply to glikely/devicetree/next]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Chintan-Pandya/of-use-hash-based-search-in-of_find_node_by_phandle/20180127-043814
base: https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
config: sparc64-defconfig (attached as .config)
compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=sparc64
All errors (new ones prefixed by >>):
drivers/of/base.o: In function `of_find_node_by_phandle':
>> base.c:(.text+0x5ec): undefined reference to `dt_hash_spinlock'
base.c:(.text+0x5f4): undefined reference to `dt_hash_spinlock'
base.c:(.text+0x5fc): undefined reference to `dt_hash_table'
base.c:(.text+0x604): undefined reference to `dt_hash_table'
base.c:(.text+0x61c): undefined reference to `dt_hash_spinlock'
base.c:(.text+0x628): undefined reference to `dt_hash_spinlock'
base.c:(.text+0x64c): undefined reference to `dt_hash_spinlock'
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Chintan,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on robh/for-next]
[also build test ERROR on v4.15-rc9 next-20180126]
[cannot apply to glikely/devicetree/next]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Chintan-Pandya/of-use-hash-based-search-in-of_find_node_by_phandle/20180127-043814
base: https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
config: sparc-defconfig (attached as .config)
compiler: sparc-linux-gcc (GCC) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=sparc
All errors (new ones prefixed by >>):
drivers/of/base.o: In function `of_find_node_by_phandle':
>> base.c:(.text+0x1ac): undefined reference to `dt_hash_table'
base.c:(.text+0x1b4): undefined reference to `dt_hash_table'
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Chintan, On 01/26/18 00:31, Chintan Pandya wrote: > of_find_node_by_phandle() takes a lot of time (1ms per > call) to find right node when your intended device is > too deeper in the fdt. Reason is, we search for each > device serially in the fdt. See this, > > struct device_node *__of_find_all_nodes(struct device_node *prev) > { > struct device_node *np; > if (!prev) { > np = of_root; > } else if (prev->child) { > np = prev->child; > } else { > /* Walk back up looking for a sibling, or the end of the structure */ > np = prev; > while (np->parent && !np->sibling) > np = np->parent; > np = np->sibling; /* Might be null at the end of the tree */ > } > return np; > } > > #define for_each_of_allnodes_from(from, dn) \ > for (dn = __of_find_all_nodes(from); dn; dn = __of_find_all_nodes(dn)) > #define for_each_of_allnodes(dn) for_each_of_allnodes_from(NULL, dn) > > Implement, device-phandle relation in hash-table so > that look up can be faster, irrespective of where my > device is defined in the DT. > > There are ~6.7k calls to of_find_node_by_phandle() and > total improvement observed during boot is 400ms. > > Signed-off-by: Chintan Pandya <cpandya@codeaurora.org> > --- > drivers/of/base.c | 8 ++++++-- < snip > I asked some questions in the version 1 thread and did not get answers. I am copying the 3 questions here. (1) >>> >> Please give me a pointer to the code that is doing >> this search. >> >> -Frank > You can refer include/linux/of.h > > #define for_each_of_allnodes_from(from, dn) \ > for (dn = __of_find_all_nodes(from); dn; dn = __of_find_all_nodes(dn)) > #define for_each_of_allnodes(dn) for_each_of_allnodes_from(NULL, dn) > > where __of_find_all_nodes() does > > struct device_node *__of_find_all_nodes(struct device_node *prev) > { > struct device_node *np; > if (!prev) { > np = of_root; > } else if (prev->child) { > np = prev->child; > } else { > /* Walk back up looking for a sibling, or the end of the structure */ > np = prev; > while (np->parent && !np->sibling) > np = np->parent; > np = np->sibling; /* Might be null at the end of the tree */ > } > return np; > } > Let me restate my question. Can you point me to the driver code that is invoking the search? (2) And also the .dts devicetree source file that you are seeing large overhead with. (3) -- this one is less important, but if the info is easily available to you Sorry about dribbling out questions instead of all at once.... What is the hardware you are testing this on? Processor? Cache size? Memory size? Processor frequency? Any other attribute of the system that will help me understand the boot performance you are seeing? Thanks, Frank -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> (1) > > Can you point me to the driver code that is invoking > the search? There are many locations. Few of them being, https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/of/irq.c?h=msm-4.9#n214 https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/irqchip/irq-gic-v3.c?h=msm-4.9#n1107 https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/clk/msm/msm-clock-controller.c?h=msm-4.9#n492 > > (2) > > And also the .dts devicetree source file that you are seeing > large overhead with. SDM670 DTS tree starts here. https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/arch/arm64/boot/dts/qcom/sdm670.dtsi?h=msm-4.9 > > > (3) -- this one is less important, but if the info is easily > available to you > > Sorry about dribbling out questions instead of all at once.... > > What is the hardware you are testing this on? SDM670 > Processor? Kryo-300 Silver > Cache size? From DT, L1 32KB (per CPU) L2 128KB (per CPU) L3 1MB (total) > Memory size? 6GB > Processor frequency? Max 1.7GHz for core 0. Not sure about boot time frequency. > Any other attribute of the system that will help me understand > the boot performance you are seeing? I'm not able to profile of_find_node_by_phandle specifically as timers are not up by then. So, just observing overall boot time for comparison. My recent results were taken on debug_defconfig which has many performance slowing code. So, gap between base-build and w/ the test patches would be more than the actual production build. Thanks, Chintan
On 01/30/18 00:04, Chintan Pandya wrote: >> (1) >> >> Can you point me to the driver code that is invoking >> the search? > There are many locations. Few of them being, > https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/of/irq.c?h=msm-4.9#n214 > https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/irqchip/irq-gic-v3.c?h=msm-4.9#n1107 > https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/clk/msm/msm-clock-controller.c?h=msm-4.9#n492 >> >> (2) >> >> And also the .dts devicetree source file that you are seeing >> large overhead with. > SDM670 DTS tree starts here. > https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/arch/arm64/boot/dts/qcom/sdm670.dtsi?h=msm-4.9 Thanks, I'm starting to get a picture of what you are facing. The file you point at is a .dtsi, not a .dts. There are quite a few sdm670*.dts files in that tree. Which one corresponds to the system you are working on? I picked a random one (sdm670-cdp.dts), but I need to be looking at the .dts that you are using to be able to ask reasonable questions instead of poking around in the dark. Also, when I clone that git tree there is not a valid HEAD. I picked what looked like a reasonable commit to checkout, but I would prefer to be looking at the same exact source that you are working with. Can you give me the commit id of the branch that you are working on? -Frank >> >> >> (3) -- this one is less important, but if the info is easily >> available to you >> >> Sorry about dribbling out questions instead of all at once.... >> >> What is the hardware you are testing this on? > SDM670 >> Processor? > Kryo-300 Silver >> Cache size? > From DT, > L1 32KB (per CPU) > L2 128KB (per CPU) > L3 1MB (total) >> Memory size? > 6GB >> Processor frequency? > Max 1.7GHz for core 0. Not sure about boot time frequency. >> Any other attribute of the system that will help me understand >> the boot performance you are seeing? > I'm not able to profile of_find_node_by_phandle specifically as timers are > not up by then. So, just observing overall boot time for comparison. > > My recent results were taken on debug_defconfig which has many performance > slowing code. So, gap between base-build and w/ the test patches would be > more than the actual production build. > > Thanks, > Chintan > -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/of/base.c b/drivers/of/base.c index 26618ba..bfbfa99 100644 --- a/drivers/of/base.c +++ b/drivers/of/base.c @@ -1005,10 +1005,14 @@ struct device_node *of_find_node_by_phandle(phandle handle) if (!handle) return NULL; - raw_spin_lock_irqsave(&devtree_lock, flags); - for_each_of_allnodes(np) + spin_lock(&dt_hash_spinlock); + hash_for_each_possible(dt_hash_table, np, hash, handle) if (np->phandle == handle) break; + + spin_unlock(&dt_hash_spinlock); + + raw_spin_lock_irqsave(&devtree_lock, flags); of_node_get(np); raw_spin_unlock_irqrestore(&devtree_lock, flags); return np; diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c index 4675e5a..62a9a4c 100644 --- a/drivers/of/fdt.c +++ b/drivers/of/fdt.c @@ -33,6 +33,10 @@ #include "of_private.h" +static bool dt_hash_needs_init = true; +DECLARE_HASHTABLE(dt_hash_table, DT_HASH_BITS); +DEFINE_SPINLOCK(dt_hash_spinlock); + /* * of_fdt_limit_memory - limit the number of regions in the /memory node * @limit: maximum entries @@ -242,6 +246,20 @@ static void populate_properties(const void *blob, pprev = &pp->next; } + /* + * In 'dryrun = true' cases, np is some non-NULL junk. So, protect + * against those cases. + */ + if (!dryrun && np->phandle) { + spin_lock(&dt_hash_spinlock); + if (dt_hash_needs_init) { + dt_hash_needs_init = false; + hash_init(dt_hash_table); + } + hash_add(dt_hash_table, &np->hash, np->phandle); + spin_unlock(&dt_hash_spinlock); + } + /* With version 0x10 we may not have the name property, * recreate it here from the unit name if absent */ diff --git a/include/linux/of.h b/include/linux/of.h index d3dea1d..2e3ba84 100644 --- a/include/linux/of.h +++ b/include/linux/of.h @@ -25,6 +25,7 @@ #include <linux/notifier.h> #include <linux/property.h> #include <linux/list.h> +#include <linux/hashtable.h> #include <asm/byteorder.h> #include <asm/errno.h> @@ -69,6 +70,7 @@ struct device_node { #endif unsigned long _flags; void *data; + struct hlist_node hash; #if defined(CONFIG_SPARC) const char *path_component_name; unsigned int unique_id; @@ -76,6 +78,10 @@ struct device_node { #endif }; +#define DT_HASH_BITS 6 +extern DECLARE_HASHTABLE(dt_hash_table, DT_HASH_BITS); +extern spinlock_t dt_hash_spinlock; + #define MAX_PHANDLE_ARGS 16 struct of_phandle_args { struct device_node *np;
of_find_node_by_phandle() takes a lot of time (1ms per call) to find right node when your intended device is too deeper in the fdt. Reason is, we search for each device serially in the fdt. See this, struct device_node *__of_find_all_nodes(struct device_node *prev) { struct device_node *np; if (!prev) { np = of_root; } else if (prev->child) { np = prev->child; } else { /* Walk back up looking for a sibling, or the end of the structure */ np = prev; while (np->parent && !np->sibling) np = np->parent; np = np->sibling; /* Might be null at the end of the tree */ } return np; } #define for_each_of_allnodes_from(from, dn) \ for (dn = __of_find_all_nodes(from); dn; dn = __of_find_all_nodes(dn)) #define for_each_of_allnodes(dn) for_each_of_allnodes_from(NULL, dn) Implement, device-phandle relation in hash-table so that look up can be faster, irrespective of where my device is defined in the DT. There are ~6.7k calls to of_find_node_by_phandle() and total improvement observed during boot is 400ms. Signed-off-by: Chintan Pandya <cpandya@codeaurora.org> --- drivers/of/base.c | 8 ++++++-- drivers/of/fdt.c | 18 ++++++++++++++++++ include/linux/of.h | 6 ++++++ 3 files changed, 30 insertions(+), 2 deletions(-)