Message ID | 20221017064950.2038-1-thunder.leizhen@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | kallsyms: Optimizes the performance of lookup symbols | expand |
On Mon, Oct 17, 2022 at 02:49:39PM +0800, Zhen Lei wrote: > Currently, to search for a symbol, we need to expand the symbols in > 'kallsyms_names' one by one, and then use the expanded string for > comparison. This is very slow. > > In fact, we can first compress the name being looked up and then use > it for comparison when traversing 'kallsyms_names'. > > This patch series optimizes the performance of function kallsyms_lookup_name(), > and function klp_find_object_symbol() in the livepatch module. Based on the > test results, the performance overhead is reduced to 5%. That is, the > performance of these functions is improved by 20 times. Stupid question, is a hash table in order? Luis
On 2022/10/19 20:01, Luis Chamberlain wrote: > On Mon, Oct 17, 2022 at 02:49:39PM +0800, Zhen Lei wrote: >> Currently, to search for a symbol, we need to expand the symbols in >> 'kallsyms_names' one by one, and then use the expanded string for >> comparison. This is very slow. >> >> In fact, we can first compress the name being looked up and then use >> it for comparison when traversing 'kallsyms_names'. >> >> This patch series optimizes the performance of function kallsyms_lookup_name(), >> and function klp_find_object_symbol() in the livepatch module. Based on the >> test results, the performance overhead is reduced to 5%. That is, the >> performance of these functions is improved by 20 times. > > Stupid question, is a hash table in order? No hash table. All symbols are arranged in ascending order of address. For example: cat /proc/kallsyms The addresses of all symbols are stored in kallsyms_addresses[], and names of all symbols are stored in kallsyms_names[]. The elements in these two arrays are in a one-to-one relationship. For any symbol, it has the same index in both arrays. Therefore, when we look up a symbolic name based on an address, we use a binary lookup. However, when we look up an address based on a symbol name, we can only traverse array kallsyms_names[] in sequence. I think the reason why hash is not used is to save memory. > > Luis > . >
On Wed, Oct 19, 2022 at 10:11:58PM +0800, Leizhen (ThunderTown) wrote: > > > On 2022/10/19 20:01, Luis Chamberlain wrote: > > On Mon, Oct 17, 2022 at 02:49:39PM +0800, Zhen Lei wrote: > >> Currently, to search for a symbol, we need to expand the symbols in > >> 'kallsyms_names' one by one, and then use the expanded string for > >> comparison. This is very slow. > >> > >> In fact, we can first compress the name being looked up and then use > >> it for comparison when traversing 'kallsyms_names'. > >> > >> This patch series optimizes the performance of function kallsyms_lookup_name(), > >> and function klp_find_object_symbol() in the livepatch module. Based on the > >> test results, the performance overhead is reduced to 5%. That is, the > >> performance of these functions is improved by 20 times. > > > > Stupid question, is a hash table in order? > > No hash table. > > All symbols are arranged in ascending order of address. For example: cat /proc/kallsyms > > The addresses of all symbols are stored in kallsyms_addresses[], and names of all symbols > are stored in kallsyms_names[]. The elements in these two arrays are in a one-to-one > relationship. For any symbol, it has the same index in both arrays. > > Therefore, when we look up a symbolic name based on an address, we use a binary lookup. > However, when we look up an address based on a symbol name, we can only traverse array > kallsyms_names[] in sequence. I think the reason why hash is not used is to save memory. This answers how we don't use a hash table, the question was *should* we use one? Luis
On 2022/10/26 1:53, Luis Chamberlain wrote: > On Wed, Oct 19, 2022 at 10:11:58PM +0800, Leizhen (ThunderTown) wrote: >> >> >> On 2022/10/19 20:01, Luis Chamberlain wrote: >>> On Mon, Oct 17, 2022 at 02:49:39PM +0800, Zhen Lei wrote: >>>> Currently, to search for a symbol, we need to expand the symbols in >>>> 'kallsyms_names' one by one, and then use the expanded string for >>>> comparison. This is very slow. >>>> >>>> In fact, we can first compress the name being looked up and then use >>>> it for comparison when traversing 'kallsyms_names'. >>>> >>>> This patch series optimizes the performance of function kallsyms_lookup_name(), >>>> and function klp_find_object_symbol() in the livepatch module. Based on the >>>> test results, the performance overhead is reduced to 5%. That is, the >>>> performance of these functions is improved by 20 times. >>> >>> Stupid question, is a hash table in order? >> >> No hash table. >> >> All symbols are arranged in ascending order of address. For example: cat /proc/kallsyms >> >> The addresses of all symbols are stored in kallsyms_addresses[], and names of all symbols >> are stored in kallsyms_names[]. The elements in these two arrays are in a one-to-one >> relationship. For any symbol, it has the same index in both arrays. >> >> Therefore, when we look up a symbolic name based on an address, we use a binary lookup. >> However, when we look up an address based on a symbol name, we can only traverse array >> kallsyms_names[] in sequence. I think the reason why hash is not used is to save memory. > > This answers how we don't use a hash table, the question was *should* we > use one? I'm not the original author, and I can only answer now based on my understanding. Maybe the original author didn't think of the hash method, or he has weighed it out. Hash is a good solution if only performance is required and memory overhead is not considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms + 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million. Because I don't know what hash algorithm will be used, the cost of generating the hash value corresponding to the symbol name is unknown now. But I think it's gonna be small. But it definitely needs a simpler algorithm, the tool needs to implement the same hash algorithm. If the hash is not very uniform or ARRAY_SIZE(hashtable) is small, then my current approach still makes sense. So maybe hash can be deferred to the next phase of improvement. > > Luis > . >
On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote: > On 2022/10/26 1:53, Luis Chamberlain wrote: > > This answers how we don't use a hash table, the question was *should* we > > use one? > > I'm not the original author, and I can only answer now based on my understanding. Maybe > the original author didn't think of the hash method, or he has weighed it out. > > Hash is a good solution if only performance is required and memory overhead is not > considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms + > 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million. > > Because I don't know what hash algorithm will be used, the cost of generating the > hash value corresponding to the symbol name is unknown now. But I think it's gonna > be small. But it definitely needs a simpler algorithm, the tool needs to implement > the same hash algorithm. For instance, you can look at evaluating if alloc_large_system_hash() would help. Luis
On 2022/10/27 3:03, Luis Chamberlain wrote: > On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote: >> On 2022/10/26 1:53, Luis Chamberlain wrote: >>> This answers how we don't use a hash table, the question was *should* we >>> use one? >> >> I'm not the original author, and I can only answer now based on my understanding. Maybe >> the original author didn't think of the hash method, or he has weighed it out. >> >> Hash is a good solution if only performance is required and memory overhead is not >> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms + >> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million. >> >> Because I don't know what hash algorithm will be used, the cost of generating the >> hash value corresponding to the symbol name is unknown now. But I think it's gonna >> be small. But it definitely needs a simpler algorithm, the tool needs to implement >> the same hash algorithm. > > For instance, you can look at evaluating if alloc_large_system_hash() would help. OK, I found the right hash function. In this way, the tool does not need to consider the byte order. include/linux/stringhash.h /* * Version 1: one byte at a time. Example of use: * * unsigned long hash = init_name_hash; * while (*p) * hash = partial_name_hash(tolower(*p++), hash); * hash = end_name_hash(hash); > > Luis > . >
On 2022/10/27 11:26, Leizhen (ThunderTown) wrote: > > > On 2022/10/27 3:03, Luis Chamberlain wrote: >> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote: >>> On 2022/10/26 1:53, Luis Chamberlain wrote: >>>> This answers how we don't use a hash table, the question was *should* we >>>> use one? >>> >>> I'm not the original author, and I can only answer now based on my understanding. Maybe >>> the original author didn't think of the hash method, or he has weighed it out. >>> >>> Hash is a good solution if only performance is required and memory overhead is not >>> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms + >>> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million. Sorry, 1-2 million ==> 0.1~0.2 million >>> >>> Because I don't know what hash algorithm will be used, the cost of generating the >>> hash value corresponding to the symbol name is unknown now. But I think it's gonna >>> be small. But it definitely needs a simpler algorithm, the tool needs to implement >>> the same hash algorithm. >> >> For instance, you can look at evaluating if alloc_large_system_hash() would help. > > OK, I found the right hash function. In this way, the tool does not need to consider > the byte order. https://en.wikipedia.org/wiki/Jenkins_hash_function Let's go with jenkins_one_at_a_time_hash(), which looks simpler and doesn't even have to think about sizeof(long). It seems to be closest to our current needs. uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) { size_t i = 0; uint32_t hash = 0; while (i != length) { hash += key[i++]; hash += hash << 10; hash ^= hash >> 6; } hash += hash << 3; hash ^= hash >> 11; hash += hash << 15; return hash; } > > include/linux/stringhash.h > > /* > * Version 1: one byte at a time. Example of use: > * > * unsigned long hash = init_name_hash; > * while (*p) > * hash = partial_name_hash(tolower(*p++), hash); > * hash = end_name_hash(hash); > > >> >> Luis >> . >> >
On 2022/10/27 14:27, Leizhen (ThunderTown) wrote: > > > On 2022/10/27 11:26, Leizhen (ThunderTown) wrote: >> >> >> On 2022/10/27 3:03, Luis Chamberlain wrote: >>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote: >>>> On 2022/10/26 1:53, Luis Chamberlain wrote: >>>>> This answers how we don't use a hash table, the question was *should* we >>>>> use one? >>>> >>>> I'm not the original author, and I can only answer now based on my understanding. Maybe >>>> the original author didn't think of the hash method, or he has weighed it out. >>>> >>>> Hash is a good solution if only performance is required and memory overhead is not >>>> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms + >>>> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million. > > Sorry, 1-2 million ==> 0.1~0.2 million > >>>> >>>> Because I don't know what hash algorithm will be used, the cost of generating the >>>> hash value corresponding to the symbol name is unknown now. But I think it's gonna >>>> be small. But it definitely needs a simpler algorithm, the tool needs to implement >>>> the same hash algorithm. >>> >>> For instance, you can look at evaluating if alloc_large_system_hash() would help. >> The following three hash algorithms are compared. The kernel is compiled by defconfig on arm64. |---------------------------------------------------------------------------------------| | | hash &= HASH_TABLE_SIZE - 1 | | | number of conflicts >= 1000 | |---------------------------------------------------------------------------------------| | ARRAY_SIZE(hash_table) | crc16 | jhash_one_at_a_time | string hash_32 | |---------------------------------------------------------------------------------------| | | 345b: 3905 | 0d40: 1013 | 4a4c: 6548 | | | 35bb: 1016 | 35ce: 6549 | 883a: 1015 | | 0x10000 | 385b: 6548 | 4440: 19126 | d05f: 19129 | | | f0ba: 19127 | 7ebe: 3916 | ecda: 3903 | |---------------------------------------------------------------------------------------| | | 0ba: 19168 | 440: 19165 | 05f: 19170 | | | 45b: 3955 | 5ce: 6577 | 83a: 1066 | | 0x1000 | 5bb: 1069 | d40: 1052 | a4c: 6609 | | | 85b: 6582 | ebe: 3938 | cda: 3924 | |---------------------------------------------------------------------------------------| Based on the above test results, I conclude that: 1. For the worst-case scenario, the three algorithms are not much different. But the kernel only implements crc16 and string hash_32. The latter is not processed byte-by-byte, so it is coupled with byte order and sizeof(long). So crc16 is the best choice. 2. For the worst-case scenario, there are almost 19K strings are mapped to the same hash value,just over 1/10 of the total. And with my current compression-then-comparison approach, it's 25-30 times faster. So there's still a need for my current approach, and they can be combined. if (nr_conflicts(key) >= CONST_N) { newname = compress(name); for_each_name_in_slot(key): compare(new_name) } else { for_each_name_in_slot(key): compare(name) } Above CONST_N can be roughly calculated: time_of_compress(name) + N * time_of_compare(new_name) <= N * time_of_compare(name) 3. For the worst-case scenario, there is no obvious difference between ARRAY_SIZE(hash_table) 0x10000 and 0x1000. So ARRAY_SIZE(hash_table)=0x1000 is enough. Statistic information: |------------------------------------------------------| | nr_conflicts(key) | ARRAY_SIZE(hash_table) | |------------------------------------------------------| | <= ? | 0x1000 | 0x10000 | |------------------------------------------------------| | 0 | 0 | 7821 | | 20 | 19 | 57375 | | 40 | 2419 | 124 | | 60 | 1343 | 70 | | 80 | 149 | 73 | | 100 | 38 | 49 | | 200 | 108 | 16 | | 400 | 14 | 2 | | 600 | 2 | 2 | | 800 | 0 | 0 | | 1000 | 0 | 0 | | 100000 | 4 | 4 | |------------------------------------------------------| Also, I re-calculated: Using hash will increase the memory size by up to "6 * kallsyms_num_syms + 4 * ARRAY_SIZE(hashtable)" |---- What I said earlier was 4 The increased size is close to 1 MB if CONFIG_KALLSYMS_ALL=y. Hi, Luis: For the reasons of the above-mentioned second conclusion. And except for patches 4-6, even if only the hash method is used, other patches and option "--lto-clang" in patch 6/11 are also needed. If you don't mind, I'd like to use hash at the next stage. The current patch set is already huge. >> OK, I found the right hash function. In this way, the tool does not need to consider >> the byte order. > > https://en.wikipedia.org/wiki/Jenkins_hash_function > > Let's go with jenkins_one_at_a_time_hash(), which looks simpler and doesn't even > have to think about sizeof(long). It seems to be closest to our current needs. > > uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) { > size_t i = 0; > uint32_t hash = 0; > > while (i != length) { > hash += key[i++]; > hash += hash << 10; > hash ^= hash >> 6; > } > hash += hash << 3; > hash ^= hash >> 11; > hash += hash << 15; > > return hash; > } > >> >> include/linux/stringhash.h >> >> /* >> * Version 1: one byte at a time. Example of use: >> * >> * unsigned long hash = init_name_hash; >> * while (*p) >> * hash = partial_name_hash(tolower(*p++), hash); >> * hash = end_name_hash(hash); >> >> >>> >>> Luis >>> . >>> >> >
> >> On 2022/10/27 3:03, Luis Chamberlain wrote: > >>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote: > >>>> On 2022/10/26 1:53, Luis Chamberlain wrote: > >>>>> This answers how we don't use a hash table, the question was *should* we > >>>>> use one? (Probably brainfart) thought... Is the current table (effectively) a sorted list of strings? So the lookup is a binary chop - so O(log(n)). But your hashes are having 'trouble' stopping one chain being very long? So a linear search of that hash chain is slow. In fact that sort of hashed lookup in O(n). What if the symbols were sorted by hash then name? (Without putting the hash into each entry.) Then the code could do a binary chop search over the symbols with the same hash value. The additional data is then an array of symbol numbers indexed by the hash - 32 bits for each bucket. If the hash table has 0x1000 entries it saves 12 compares. (All of which are likely to be data cache misses.) If you add the hash to each table entry then you can do a binary chop search for the hash itself. While this is the same search as is done for the strings the comparison (just a number) will be faster than a string compare. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 2022/10/29 20:49, David Laight wrote: >>>> On 2022/10/27 3:03, Luis Chamberlain wrote: >>>>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote: >>>>>> On 2022/10/26 1:53, Luis Chamberlain wrote: >>>>>>> This answers how we don't use a hash table, the question was *should* we >>>>>>> use one? > > (Probably brainfart) thought... > > Is the current table (effectively) a sorted list of strings? > So the lookup is a binary chop - so O(log(n)). Currently not sorted. > > But your hashes are having 'trouble' stopping one chain > being very long? > So a linear search of that hash chain is slow. > In fact that sort of hashed lookup in O(n). You've analyzed it very well. The hash method is not good for sorting names and then searching in binary mode. I figured it out when I was working on the design these days. Current Implementation: --------------------------------------- | idx | addresses | markers | names | --------------------------------------- | 0 | addr0 | | name0 | | 1 | addr1 | | name1 | | ... | addrx | [0] | namex | | 255 | addrx | | name255| --------------------------------------- | 256 | addr256 | | name256| | ... | addrx | [1] | namex | | 511 | addr511 | | name511| --------------------------------------- markers[0] = offset_of(name0) markers[1] = offset_of(name256) 1. Find name by address binary search addresses[], get idx, traverse names[] from markers[idx>>8] to markers[(idx>>8) + 1], return name 2. Find address by name traverse names[], get idx, return addresses[idx] Hash Implementation: Add two new arrays: hash_table[] and names_offsets[] ----------------------------------------------------------------- | key | hash_table | names_offsets | |---------------------------------------------------------------| | 0 | names_offsets[key=0] | offsets of all names with key=0 | | 1 | names_offsets[key=1] | offsets of all names with key=1 | | ... | ... | offsets of all names with key=k | |---------------------------------------------------------------| hash_table[0] = 0 hash_table[1] = hash_table[0] + sizeof(names_offsets[0]) * number_of_names(key=0) hash_table[2] = hash_table[1] + sizeof(names_offsets[0]) * number_of_names(key=1) 1. Find address by name hash name, get key, traverse names_offsets[] from index=hash_table[key] to index=hash_table[key+1], get the offset of name in names[]. binary search markers[], get index, then traverse names[] from markers[index] to markers[index + 1], until match the offset of name, return addresses[idx]. 2. Find address by name No change. Sorted names Implementation: Add two new arrays: offsets_of_addr_to_name[] and offsets_of_name[] offsets_of_addr_to_name[i] = offset of name i in names[] offsets_of_name[i] = offset of sorted name i in names[] 1. Find name by address binary search addresses[], get idx, return names[offsets_of_addr_to_name[idx]] 2. Find address by name binary search offsets_of_name[], get idx, return addresses[idx] > > What if the symbols were sorted by hash then name? > (Without putting the hash into each entry.) > Then the code could do a binary chop search over > the symbols with the same hash value. > The additional data is then an array of symbol numbers > indexed by the hash - 32 bits for each bucket. > > If the hash table has 0x1000 entries it saves 12 compares. > (All of which are likely to be data cache misses.) > > If you add the hash to each table entry then you can do > a binary chop search for the hash itself. > While this is the same search as is done for the strings > the comparison (just a number) will be faster than a > string compare. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >
On 2022/10/29 16:10, Leizhen (ThunderTown) wrote: > > > On 2022/10/27 14:27, Leizhen (ThunderTown) wrote: >> >> >> On 2022/10/27 11:26, Leizhen (ThunderTown) wrote: >>> >>> >>> On 2022/10/27 3:03, Luis Chamberlain wrote: >>>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote: >>>>> On 2022/10/26 1:53, Luis Chamberlain wrote: >>>>>> This answers how we don't use a hash table, the question was *should* we >>>>>> use one? >>>>> >>>>> I'm not the original author, and I can only answer now based on my understanding. Maybe >>>>> the original author didn't think of the hash method, or he has weighed it out. >>>>> >>>>> Hash is a good solution if only performance is required and memory overhead is not >>>>> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms + >>>>> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million. >> >> Sorry, 1-2 million ==> 0.1~0.2 million >> >>>>> >>>>> Because I don't know what hash algorithm will be used, the cost of generating the >>>>> hash value corresponding to the symbol name is unknown now. But I think it's gonna >>>>> be small. But it definitely needs a simpler algorithm, the tool needs to implement >>>>> the same hash algorithm. >>>> >>>> For instance, you can look at evaluating if alloc_large_system_hash() would help. >>> > > The following three hash algorithms are compared. The kernel is compiled by defconfig > on arm64. > > |---------------------------------------------------------------------------------------| > | | hash &= HASH_TABLE_SIZE - 1 | > | | number of conflicts >= 1000 | > |---------------------------------------------------------------------------------------| > | ARRAY_SIZE(hash_table) | crc16 | jhash_one_at_a_time | string hash_32 | > |---------------------------------------------------------------------------------------| > | | 345b: 3905 | 0d40: 1013 | 4a4c: 6548 | > | | 35bb: 1016 | 35ce: 6549 | 883a: 1015 | > | 0x10000 | 385b: 6548 | 4440: 19126 | d05f: 19129 | > | | f0ba: 19127 | 7ebe: 3916 | ecda: 3903 | > |---------------------------------------------------------------------------------------| > | | 0ba: 19168 | 440: 19165 | 05f: 19170 | > | | 45b: 3955 | 5ce: 6577 | 83a: 1066 | > | 0x1000 | 5bb: 1069 | d40: 1052 | a4c: 6609 | > | | 85b: 6582 | ebe: 3938 | cda: 3924 | > |---------------------------------------------------------------------------------------| > > Based on the above test results, I conclude that: > 1. For the worst-case scenario, the three algorithms are not much different. But the kernel > only implements crc16 and string hash_32. The latter is not processed byte-by-byte, so > it is coupled with byte order and sizeof(long). So crc16 is the best choice. > 2. For the worst-case scenario, there are almost 19K strings are mapped to the same hash > value,just over 1/10 of the total. And with my current compression-then-comparison > approach, it's 25-30 times faster. So there's still a need for my current approach, and > they can be combined. > if (nr_conflicts(key) >= CONST_N) { > newname = compress(name); > for_each_name_in_slot(key): compare(new_name) > } else { > for_each_name_in_slot(key): compare(name) > } > > Above CONST_N can be roughly calculated: > time_of_compress(name) + N * time_of_compare(new_name) <= N * time_of_compare(name) > 3. For the worst-case scenario, there is no obvious difference between ARRAY_SIZE(hash_table) > 0x10000 and 0x1000. So ARRAY_SIZE(hash_table)=0x1000 is enough. > Statistic information: > |------------------------------------------------------| > | nr_conflicts(key) | ARRAY_SIZE(hash_table) | > |------------------------------------------------------| > | <= ? | 0x1000 | 0x10000 | > |------------------------------------------------------| > | 0 | 0 | 7821 | > | 20 | 19 | 57375 | > | 40 | 2419 | 124 | > | 60 | 1343 | 70 | > | 80 | 149 | 73 | > | 100 | 38 | 49 | > | 200 | 108 | 16 | > | 400 | 14 | 2 | > | 600 | 2 | 2 | > | 800 | 0 | 0 | > | 1000 | 0 | 0 | > | 100000 | 4 | 4 | > |------------------------------------------------------| > > > Also, I re-calculated: > Using hash will increase the memory size by up to "6 * kallsyms_num_syms + 4 * ARRAY_SIZE(hashtable)" > |---- What I said earlier was 4 > The increased size is close to 1 MB if CONFIG_KALLSYMS_ALL=y. > > Hi, Luis: > For the reasons of the above-mentioned second conclusion. And except for patches 4-6, > even if only the hash method is used, other patches and option "--lto-clang" in patch 6/11 > are also needed. If you don't mind, I'd like to use hash at the next stage. The current > patch set is already huge. I just had an update in response to David Laight's email. The hash solution is like a centrist. It doesn't seem very feasible. Now, we need to make a decision. Choose one of the two: 1. Continue with my current approach. Improve the average performance of kallsyms_lookup_name() by 20 to 30 times. The memory overhead is increased by: arm64 (defconfig): 73.5KiB and 4.0% if CONFIG_KALLSYMS_ALL=y. 19.8KiB and 2.8% if CONFIG_KALLSYMS_ALL=n. x86 (defconfig): 49.0KiB and 3.0% if CONFIG_KALLSYMS_ALL=y. 16.8KiB and 2.3% if CONFIG_KALLSYMS_ALL=n. 2. Sort names, binary search (The static function causes duplicate names. Additional work is required) 2^18=262144, only up to 18 symbol expansions and comparisons are required. The performance is definitely excellent, although I haven't tested it yet. The memory overhead is increased by: 6 * kallsyms_num_syms arm64 (defconfig): 1MiB if CONFIG_KALLSYMS_ALL=y. 362KiB if CONFIG_KALLSYMS_ALL=n. x86 (defconfig): 770KiB if CONFIG_KALLSYMS_ALL=y. 356KiB if CONFIG_KALLSYMS_ALL=n. > > >>> OK, I found the right hash function. In this way, the tool does not need to consider >>> the byte order. >> >> https://en.wikipedia.org/wiki/Jenkins_hash_function >> >> Let's go with jenkins_one_at_a_time_hash(), which looks simpler and doesn't even >> have to think about sizeof(long). It seems to be closest to our current needs. >> >> uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) { >> size_t i = 0; >> uint32_t hash = 0; >> >> while (i != length) { >> hash += key[i++]; >> hash += hash << 10; >> hash ^= hash >> 6; >> } >> hash += hash << 3; >> hash ^= hash >> 11; >> hash += hash << 15; >> >> return hash; >> } >> >>> >>> include/linux/stringhash.h >>> >>> /* >>> * Version 1: one byte at a time. Example of use: >>> * >>> * unsigned long hash = init_name_hash; >>> * while (*p) >>> * hash = partial_name_hash(tolower(*p++), hash); >>> * hash = end_name_hash(hash); >>> >>> >>>> >>>> Luis >>>> . >>>> >>> >> >
On 2022/10/31 12:55, Leizhen (ThunderTown) wrote: > > > On 2022/10/29 16:10, Leizhen (ThunderTown) wrote: >> >> >> On 2022/10/27 14:27, Leizhen (ThunderTown) wrote: >>> >>> >>> On 2022/10/27 11:26, Leizhen (ThunderTown) wrote: >>>> >>>> >>>> On 2022/10/27 3:03, Luis Chamberlain wrote: >>>>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote: >>>>>> On 2022/10/26 1:53, Luis Chamberlain wrote: >>>>>>> This answers how we don't use a hash table, the question was *should* we >>>>>>> use one? >>>>>> >>>>>> I'm not the original author, and I can only answer now based on my understanding. Maybe >>>>>> the original author didn't think of the hash method, or he has weighed it out. >>>>>> >>>>>> Hash is a good solution if only performance is required and memory overhead is not >>>>>> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms + >>>>>> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million. >>> >>> Sorry, 1-2 million ==> 0.1~0.2 million >>> >>>>>> >>>>>> Because I don't know what hash algorithm will be used, the cost of generating the >>>>>> hash value corresponding to the symbol name is unknown now. But I think it's gonna >>>>>> be small. But it definitely needs a simpler algorithm, the tool needs to implement >>>>>> the same hash algorithm. >>>>> >>>>> For instance, you can look at evaluating if alloc_large_system_hash() would help. >>>> >> >> The following three hash algorithms are compared. The kernel is compiled by defconfig >> on arm64. >> >> |---------------------------------------------------------------------------------------| >> | | hash &= HASH_TABLE_SIZE - 1 | >> | | number of conflicts >= 1000 | >> |---------------------------------------------------------------------------------------| >> | ARRAY_SIZE(hash_table) | crc16 | jhash_one_at_a_time | string hash_32 | >> |---------------------------------------------------------------------------------------| >> | | 345b: 3905 | 0d40: 1013 | 4a4c: 6548 | >> | | 35bb: 1016 | 35ce: 6549 | 883a: 1015 | >> | 0x10000 | 385b: 6548 | 4440: 19126 | d05f: 19129 | >> | | f0ba: 19127 | 7ebe: 3916 | ecda: 3903 | >> |---------------------------------------------------------------------------------------| >> | | 0ba: 19168 | 440: 19165 | 05f: 19170 | >> | | 45b: 3955 | 5ce: 6577 | 83a: 1066 | >> | 0x1000 | 5bb: 1069 | d40: 1052 | a4c: 6609 | >> | | 85b: 6582 | ebe: 3938 | cda: 3924 | >> |---------------------------------------------------------------------------------------| >> >> Based on the above test results, I conclude that: >> 1. For the worst-case scenario, the three algorithms are not much different. But the kernel >> only implements crc16 and string hash_32. The latter is not processed byte-by-byte, so >> it is coupled with byte order and sizeof(long). So crc16 is the best choice. >> 2. For the worst-case scenario, there are almost 19K strings are mapped to the same hash >> value,just over 1/10 of the total. And with my current compression-then-comparison >> approach, it's 25-30 times faster. So there's still a need for my current approach, and >> they can be combined. >> if (nr_conflicts(key) >= CONST_N) { >> newname = compress(name); >> for_each_name_in_slot(key): compare(new_name) >> } else { >> for_each_name_in_slot(key): compare(name) >> } >> >> Above CONST_N can be roughly calculated: >> time_of_compress(name) + N * time_of_compare(new_name) <= N * time_of_compare(name) >> 3. For the worst-case scenario, there is no obvious difference between ARRAY_SIZE(hash_table) >> 0x10000 and 0x1000. So ARRAY_SIZE(hash_table)=0x1000 is enough. >> Statistic information: >> |------------------------------------------------------| >> | nr_conflicts(key) | ARRAY_SIZE(hash_table) | >> |------------------------------------------------------| >> | <= ? | 0x1000 | 0x10000 | >> |------------------------------------------------------| >> | 0 | 0 | 7821 | >> | 20 | 19 | 57375 | >> | 40 | 2419 | 124 | >> | 60 | 1343 | 70 | >> | 80 | 149 | 73 | >> | 100 | 38 | 49 | >> | 200 | 108 | 16 | >> | 400 | 14 | 2 | >> | 600 | 2 | 2 | >> | 800 | 0 | 0 | >> | 1000 | 0 | 0 | >> | 100000 | 4 | 4 | >> |------------------------------------------------------| >> >> >> Also, I re-calculated: >> Using hash will increase the memory size by up to "6 * kallsyms_num_syms + 4 * ARRAY_SIZE(hashtable)" >> |---- What I said earlier was 4 >> The increased size is close to 1 MB if CONFIG_KALLSYMS_ALL=y. >> >> Hi, Luis: >> For the reasons of the above-mentioned second conclusion. And except for patches 4-6, >> even if only the hash method is used, other patches and option "--lto-clang" in patch 6/11 >> are also needed. If you don't mind, I'd like to use hash at the next stage. The current >> patch set is already huge. > > I just had an update in response to David Laight's email. The hash solution is like > a centrist. It doesn't seem very feasible. > > Now, we need to make a decision. Choose one of the two: > 1. Continue with my current approach. Improve the average performance of > kallsyms_lookup_name() by 20 to 30 times. The memory overhead is increased by: > arm64 (defconfig): > 73.5KiB and 4.0% if CONFIG_KALLSYMS_ALL=y. > 19.8KiB and 2.8% if CONFIG_KALLSYMS_ALL=n. > x86 (defconfig): > 49.0KiB and 3.0% if CONFIG_KALLSYMS_ALL=y. > 16.8KiB and 2.3% if CONFIG_KALLSYMS_ALL=n. > 2. Sort names, binary search (The static function causes duplicate names. Additional work is required) > 2^18=262144, only up to 18 symbol expansions and comparisons are required. > The performance is definitely excellent, although I haven't tested it yet. > The memory overhead is increased by: 6 * kallsyms_num_syms > arm64 (defconfig): > 1MiB if CONFIG_KALLSYMS_ALL=y. > 362KiB if CONFIG_KALLSYMS_ALL=n. > x86 (defconfig): > 770KiB if CONFIG_KALLSYMS_ALL=y. > 356KiB if CONFIG_KALLSYMS_ALL=n. > Preliminary Test Results: (On Qemu arm64) [ 73.049249] kallsyms_selftest: kallsyms_lookup_name() looked up 151880 symbols [ 73.049331] kallsyms_selftest: The time spent on each symbol is (ns): min=1088, max=46848, avg=6629 > > > >> >> >>>> OK, I found the right hash function. In this way, the tool does not need to consider >>>> the byte order. >>> >>> https://en.wikipedia.org/wiki/Jenkins_hash_function >>> >>> Let's go with jenkins_one_at_a_time_hash(), which looks simpler and doesn't even >>> have to think about sizeof(long). It seems to be closest to our current needs. >>> >>> uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) { >>> size_t i = 0; >>> uint32_t hash = 0; >>> >>> while (i != length) { >>> hash += key[i++]; >>> hash += hash << 10; >>> hash ^= hash >> 6; >>> } >>> hash += hash << 3; >>> hash ^= hash >> 11; >>> hash += hash << 15; >>> >>> return hash; >>> } >>> >>>> >>>> include/linux/stringhash.h >>>> >>>> /* >>>> * Version 1: one byte at a time. Example of use: >>>> * >>>> * unsigned long hash = init_name_hash; >>>> * while (*p) >>>> * hash = partial_name_hash(tolower(*p++), hash); >>>> * hash = end_name_hash(hash); >>>> >>>> >>>>> >>>>> Luis >>>>> . >>>>> >>>> >>> >> >
On 2022/10/31 23:04, Leizhen (ThunderTown) wrote: > Now, we need to make a decision. Choose one of the two: > 1. Continue with my current approach. Improve the average performance of > kallsyms_lookup_name() by 20 to 30 times. The memory overhead is increased by: > arm64 (defconfig): > 73.5KiB and 4.0% if CONFIG_KALLSYMS_ALL=y. > 19.8KiB and 2.8% if CONFIG_KALLSYMS_ALL=n. > x86 (defconfig): > 49.0KiB and 3.0% if CONFIG_KALLSYMS_ALL=y. > 16.8KiB and 2.3% if CONFIG_KALLSYMS_ALL=n. > 2. Sort names, binary search (The static function causes duplicate names. Additional work is required) > 2^18=262144, only up to 18 symbol expansions and comparisons are required. > The performance is definitely excellent, although I haven't tested it yet. > The memory overhead is increased by: 6 * kallsyms_num_syms > arm64 (defconfig): > 1MiB if CONFIG_KALLSYMS_ALL=y. > 362KiB if CONFIG_KALLSYMS_ALL=n. > x86 (defconfig): > 770KiB if CONFIG_KALLSYMS_ALL=y. > 356KiB if CONFIG_KALLSYMS_ALL=n. Hi, Luis: I've implemented v8 based on method 2(Sort names, binary search). The memory overhead is increased by: 3 * kallsyms_num_syms. kallsyms_offsets_of_names[] is not added in v8 because it does not help much in performance improvement, so save (3 * kallsyms_num_syms) bytes. For details about the performance data, please see the commit message.