diff mbox

[v2] of: use hash based search in of_find_node_by_phandle

Message ID 1516955496-17236-1-git-send-email-cpandya@codeaurora.org (mailing list archive)
State Not Applicable, archived
Delegated to: Andy Gross
Headers show

Commit Message

Chintan Pandya Jan. 26, 2018, 8:31 a.m. UTC
of_find_node_by_phandle() takes a lot of time (1ms per
call) to find right node when your intended device is
too deeper in the fdt. Reason is, we search for each
device serially in the fdt. See this,

struct device_node *__of_find_all_nodes(struct device_node *prev)
{
        struct device_node *np;
        if (!prev) {
                np = of_root;
        } else if (prev->child) {
                np = prev->child;
        } else {
                /* Walk back up looking for a sibling, or the end of the structure */
                np = prev;
                while (np->parent && !np->sibling)
                        np = np->parent;
                np = np->sibling; /* Might be null at the end of the tree */
        }
        return np;
}

#define for_each_of_allnodes_from(from, dn) \
        for (dn = __of_find_all_nodes(from); dn; dn = __of_find_all_nodes(dn))
#define for_each_of_allnodes(dn) for_each_of_allnodes_from(NULL, dn)

Implement, device-phandle relation in hash-table so
that look up can be faster, irrespective of where my
device is defined in the DT.

There are ~6.7k calls to of_find_node_by_phandle() and
total improvement observed during boot is 400ms.

Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
---
 drivers/of/base.c  |  8 ++++++--
 drivers/of/fdt.c   | 18 ++++++++++++++++++
 include/linux/of.h |  6 ++++++
 3 files changed, 30 insertions(+), 2 deletions(-)

Comments

Rasmus Villemoes Jan. 26, 2018, 10:57 a.m. UTC | #1
On 2018-01-26 09:31, Chintan Pandya wrote:
> Implement, device-phandle relation in hash-table so
> that look up can be faster, irrespective of where my
> device is defined in the DT.
> 
> There are ~6.7k calls to of_find_node_by_phandle() and
> total improvement observed during boot is 400ms.

I'm probably missing something obvious, but: Aren't phandles in practice
small consecutive integers assigned by dtc? If so, why not just have a
smallish static array mapping the small phandle values directly to
device node, instead of adding a pointer to every struct device_node? Or
one could determine the size of the array dynamically (largest seen
phandle value, capping at something sensible, e.g. 1024).

In either case, one would still need to keep the code doing the
whole-tree traversal for handling large phandle values, but I think the
above should make lookup O(1) in most cases.

Alternatively, one could just count the number of nodes with a phandle,
allocate an array of that many pointers (so the memory use is certainly
no more than if adding a pointer to each device_node), and sort it by
phandle, so one can do lookup using a binary search.

Rasmus
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chintan Pandya Jan. 26, 2018, 3:14 p.m. UTC | #2
> I'm probably missing something obvious, but: Aren't phandles in practice
> small consecutive integers assigned by dtc? If so, why not just have a
> smallish static array mapping the small phandle values directly to
> device node, instead of adding a pointer to every struct device_node? Or
> one could determine the size of the array dynamically (largest seen
> phandle value, capping at something sensible, e.g. 1024).
Haven't noticed this earlier !! If following is known or true, we can avoid
using hash-table and save per device_node hlish_node.

     1. How to know max phandle value without traversing tree once? In 
my case,
        max is 1263.
     2. Although, I haven't observed collision but is it like every 
device_node
        is associated with unique phandle value ?
> In either case, one would still need to keep the code doing the
> whole-tree traversal for handling large phandle values, but I think the
> above should make lookup O(1) in most cases.
I would refrain doing this because that will make this API inconsistent 
in terms
of time taken by different nodes. I see that people do change their device
placing in DT and that changes time taken in of_* APIs for them but 
affecting
others.

> Alternatively, one could just count the number of nodes with a phandle,
> allocate an array of that many pointers (so the memory use is certainly
> no more than if adding a pointer to each device_node), and sort it by
> phandle, so one can do lookup using a binary search.
>
> Rasmus
This is certainly doable if current approach is not welcomed due to
addition on hlish_node in device_node.

Chintan Pandya
Rob Herring Jan. 26, 2018, 3:34 p.m. UTC | #3
On Fri, Jan 26, 2018 at 9:14 AM, Chintan Pandya <cpandya@codeaurora.org> wrote:
>
>> I'm probably missing something obvious, but: Aren't phandles in practice
>> small consecutive integers assigned by dtc? If so, why not just have a
>> smallish static array mapping the small phandle values directly to
>> device node, instead of adding a pointer to every struct device_node? Or
>> one could determine the size of the array dynamically (largest seen
>> phandle value, capping at something sensible, e.g. 1024).

I do have some concerns that is a bit fragile and dependent on dtc's
implementations. However, I guess we already kind of are with overlay
phandles.

> Haven't noticed this earlier !! If following is known or true, we can avoid
> using hash-table and save per device_node hlish_node.
>
>     1. How to know max phandle value without traversing tree once? In my
> case,
>        max is 1263.

We already have to know it for overlays. Plus unflattening has to
handle phandles specially already, so it would be easy to do there if
we aren't already.

Then the question what to do with overlays. For now, we can probably
assume too few phandles to matter.

>     2. Although, I haven't observed collision but is it like every
> device_node
>        is associated with unique phandle value ?

Yes, phandles must be unique.

>> In either case, one would still need to keep the code doing the
>> whole-tree traversal for handling large phandle values, but I think the
>> above should make lookup O(1) in most cases.
>
> I would refrain doing this because that will make this API inconsistent in
> terms
> of time taken by different nodes. I see that people do change their device
> placing in DT and that changes time taken in of_* APIs for them but
> affecting
> others.

Who cares. It's the total time that matters. It's obviously not a
problem until you have 1000s of lookups as no one cared until recently
(though you aren't the first with a hash table lookup).

>> Alternatively, one could just count the number of nodes with a phandle,
>> allocate an array of that many pointers (so the memory use is certainly
>> no more than if adding a pointer to each device_node), and sort it by
>> phandle, so one can do lookup using a binary search.
>>
>> Rasmus
>
> This is certainly doable if current approach is not welcomed due to
> addition on hlish_node in device_node.

I certainly prefer an out of band approach as that's easier to turn of
if we want to save memory.

Still, I'd like to see some data on a cache based approach and reasons
why that won't work.

Rob
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
kernel test robot Jan. 26, 2018, 11:11 p.m. UTC | #4
Hi Chintan,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on robh/for-next]
[also build test ERROR on v4.15-rc9 next-20180126]
[cannot apply to glikely/devicetree/next]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Chintan-Pandya/of-use-hash-based-search-in-of_find_node_by_phandle/20180127-043814
base:   https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
config: sparc64-defconfig (attached as .config)
compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=sparc64 

All errors (new ones prefixed by >>):

   drivers/of/base.o: In function `of_find_node_by_phandle':
>> base.c:(.text+0x5ec): undefined reference to `dt_hash_spinlock'
   base.c:(.text+0x5f4): undefined reference to `dt_hash_spinlock'
   base.c:(.text+0x5fc): undefined reference to `dt_hash_table'
   base.c:(.text+0x604): undefined reference to `dt_hash_table'
   base.c:(.text+0x61c): undefined reference to `dt_hash_spinlock'
   base.c:(.text+0x628): undefined reference to `dt_hash_spinlock'
   base.c:(.text+0x64c): undefined reference to `dt_hash_spinlock'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
kernel test robot Jan. 26, 2018, 11:26 p.m. UTC | #5
Hi Chintan,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on robh/for-next]
[also build test ERROR on v4.15-rc9 next-20180126]
[cannot apply to glikely/devicetree/next]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Chintan-Pandya/of-use-hash-based-search-in-of_find_node_by_phandle/20180127-043814
base:   https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
config: sparc-defconfig (attached as .config)
compiler: sparc-linux-gcc (GCC) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=sparc 

All errors (new ones prefixed by >>):

   drivers/of/base.o: In function `of_find_node_by_phandle':
>> base.c:(.text+0x1ac): undefined reference to `dt_hash_table'
   base.c:(.text+0x1b4): undefined reference to `dt_hash_table'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Frank Rowand Jan. 29, 2018, 11:23 p.m. UTC | #6
Hi Chintan,

On 01/26/18 00:31, Chintan Pandya wrote:
> of_find_node_by_phandle() takes a lot of time (1ms per
> call) to find right node when your intended device is
> too deeper in the fdt. Reason is, we search for each
> device serially in the fdt. See this,
> 
> struct device_node *__of_find_all_nodes(struct device_node *prev)
> {
>         struct device_node *np;
>         if (!prev) {
>                 np = of_root;
>         } else if (prev->child) {
>                 np = prev->child;
>         } else {
>                 /* Walk back up looking for a sibling, or the end of the structure */
>                 np = prev;
>                 while (np->parent && !np->sibling)
>                         np = np->parent;
>                 np = np->sibling; /* Might be null at the end of the tree */
>         }
>         return np;
> }
> 
> #define for_each_of_allnodes_from(from, dn) \
>         for (dn = __of_find_all_nodes(from); dn; dn = __of_find_all_nodes(dn))
> #define for_each_of_allnodes(dn) for_each_of_allnodes_from(NULL, dn)
> 
> Implement, device-phandle relation in hash-table so
> that look up can be faster, irrespective of where my
> device is defined in the DT.
> 
> There are ~6.7k calls to of_find_node_by_phandle() and
> total improvement observed during boot is 400ms.
> 
> Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
> ---
>  drivers/of/base.c  |  8 ++++++--

< snip >
 

I asked some questions in the version 1 thread and did not get
answers.  I am copying the 3 questions here.

(1)

>>>
>> Please give me a pointer to the code that is doing
>> this search.
>>
>> -Frank
> You can refer include/linux/of.h
>
> #define for_each_of_allnodes_from(from, dn) \
>         for (dn = __of_find_all_nodes(from); dn; dn = __of_find_all_nodes(dn))
> #define for_each_of_allnodes(dn) for_each_of_allnodes_from(NULL, dn)
>
> where __of_find_all_nodes() does
>
> struct device_node *__of_find_all_nodes(struct device_node *prev)
> {
>         struct device_node *np;
>         if (!prev) {
>                 np = of_root;
>         } else if (prev->child) {
>                 np = prev->child;
>         } else {
>                 /* Walk back up looking for a sibling, or the end of the structure */
>                 np = prev;
>                 while (np->parent && !np->sibling)
>                         np = np->parent;
>                 np = np->sibling; /* Might be null at the end of the tree */
>         }
>         return np;
> }
>

Let me restate my question.

Can you point me to the driver code that is invoking
the search?


(2)

And also the .dts devicetree source file that you are seeing
large overhead with.


(3) -- this one is less important, but if the info is easily
       available to you

Sorry about dribbling out questions instead of all at once....

What is the hardware you are testing this on?
Processor?
Cache size?
Memory size?
Processor frequency?
Any other attribute of the system that will help me understand
the boot performance you are seeing?


Thanks,

Frank
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chintan Pandya Jan. 30, 2018, 8:04 a.m. UTC | #7
> (1)
>
> Can you point me to the driver code that is invoking
> the search?
There are many locations. Few of them being,
https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/of/irq.c?h=msm-4.9#n214
https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/irqchip/irq-gic-v3.c?h=msm-4.9#n1107
https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/clk/msm/msm-clock-controller.c?h=msm-4.9#n492
>
> (2)
>
> And also the .dts devicetree source file that you are seeing
> large overhead with.
SDM670 DTS tree starts here.
https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/arch/arm64/boot/dts/qcom/sdm670.dtsi?h=msm-4.9
>
>
> (3) -- this one is less important, but if the info is easily
>         available to you
>
> Sorry about dribbling out questions instead of all at once....
>
> What is the hardware you are testing this on?
SDM670
> Processor?
Kryo-300 Silver
> Cache size?
 From DT,
L1 32KB (per CPU)
L2 128KB (per CPU)
L3 1MB (total)
> Memory size?
6GB
> Processor frequency?
Max 1.7GHz for core 0. Not sure about boot time frequency.
> Any other attribute of the system that will help me understand
> the boot performance you are seeing?
I'm not able to profile of_find_node_by_phandle specifically as timers are
not up by then. So, just observing overall boot time for comparison.

My recent results were taken on debug_defconfig which has many performance
slowing code. So, gap between base-build and w/ the test patches would be
more than the actual production build.

Thanks,
Chintan
Frank Rowand Jan. 30, 2018, 6:59 p.m. UTC | #8
On 01/30/18 00:04, Chintan Pandya wrote:
>> (1)
>>
>> Can you point me to the driver code that is invoking
>> the search?
> There are many locations. Few of them being,
> https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/of/irq.c?h=msm-4.9#n214
> https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/irqchip/irq-gic-v3.c?h=msm-4.9#n1107
> https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/clk/msm/msm-clock-controller.c?h=msm-4.9#n492
>>
>> (2)
>>
>> And also the .dts devicetree source file that you are seeing
>> large overhead with.
> SDM670 DTS tree starts here.
> https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/arch/arm64/boot/dts/qcom/sdm670.dtsi?h=msm-4.9

Thanks, I'm starting to get a picture of what you are facing.

The file you point at is a .dtsi, not a .dts.  There are quite a
few sdm670*.dts files in that tree.  Which one corresponds to
the system you are working on?

I picked a random one (sdm670-cdp.dts), but I need to be looking
at the .dts that you are using to be able to ask reasonable
questions instead of poking around in the dark.

Also, when I clone that git tree there is not a valid HEAD.  I
picked what looked like a reasonable commit to checkout, but I
would prefer to be looking at the same exact source that you
are working with.  Can you give me the commit id  of the branch
that you are working on?

-Frank


>>
>>
>> (3) -- this one is less important, but if the info is easily
>>         available to you
>>
>> Sorry about dribbling out questions instead of all at once....
>>
>> What is the hardware you are testing this on?
> SDM670
>> Processor?
> Kryo-300 Silver
>> Cache size?
> From DT,
> L1 32KB (per CPU)
> L2 128KB (per CPU)
> L3 1MB (total)
>> Memory size?
> 6GB
>> Processor frequency?
> Max 1.7GHz for core 0. Not sure about boot time frequency.
>> Any other attribute of the system that will help me understand
>> the boot performance you are seeing?
> I'm not able to profile of_find_node_by_phandle specifically as timers are
> not up by then. So, just observing overall boot time for comparison.
> 
> My recent results were taken on debug_defconfig which has many performance
> slowing code. So, gap between base-build and w/ the test patches would be
> more than the actual production build.
> 
> Thanks,
> Chintan
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 26618ba..bfbfa99 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -1005,10 +1005,14 @@  struct device_node *of_find_node_by_phandle(phandle handle)
 	if (!handle)
 		return NULL;
 
-	raw_spin_lock_irqsave(&devtree_lock, flags);
-	for_each_of_allnodes(np)
+	spin_lock(&dt_hash_spinlock);
+	hash_for_each_possible(dt_hash_table, np, hash, handle)
 		if (np->phandle == handle)
 			break;
+
+	spin_unlock(&dt_hash_spinlock);
+
+	raw_spin_lock_irqsave(&devtree_lock, flags);
 	of_node_get(np);
 	raw_spin_unlock_irqrestore(&devtree_lock, flags);
 	return np;
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index 4675e5a..62a9a4c 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -33,6 +33,10 @@ 
 
 #include "of_private.h"
 
+static bool dt_hash_needs_init = true;
+DECLARE_HASHTABLE(dt_hash_table, DT_HASH_BITS);
+DEFINE_SPINLOCK(dt_hash_spinlock);
+
 /*
  * of_fdt_limit_memory - limit the number of regions in the /memory node
  * @limit: maximum entries
@@ -242,6 +246,20 @@  static void populate_properties(const void *blob,
 		pprev      = &pp->next;
 	}
 
+	/*
+	 * In 'dryrun = true' cases, np is some non-NULL junk. So, protect
+	 * against those cases.
+	 */
+	if (!dryrun && np->phandle) {
+		spin_lock(&dt_hash_spinlock);
+		if (dt_hash_needs_init) {
+			dt_hash_needs_init = false;
+			hash_init(dt_hash_table);
+		}
+		hash_add(dt_hash_table, &np->hash, np->phandle);
+		spin_unlock(&dt_hash_spinlock);
+	}
+
 	/* With version 0x10 we may not have the name property,
 	 * recreate it here from the unit name if absent
 	 */
diff --git a/include/linux/of.h b/include/linux/of.h
index d3dea1d..2e3ba84 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -25,6 +25,7 @@ 
 #include <linux/notifier.h>
 #include <linux/property.h>
 #include <linux/list.h>
+#include <linux/hashtable.h>
 
 #include <asm/byteorder.h>
 #include <asm/errno.h>
@@ -69,6 +70,7 @@  struct device_node {
 #endif
 	unsigned long _flags;
 	void	*data;
+	struct hlist_node hash;
 #if defined(CONFIG_SPARC)
 	const char *path_component_name;
 	unsigned int unique_id;
@@ -76,6 +78,10 @@  struct device_node {
 #endif
 };
 
+#define DT_HASH_BITS 6
+extern DECLARE_HASHTABLE(dt_hash_table, DT_HASH_BITS);
+extern spinlock_t dt_hash_spinlock;
+
 #define MAX_PHANDLE_ARGS 16
 struct of_phandle_args {
 	struct device_node *np;