diff mbox series

Use a real non-cryptographic hash algorithm

Message ID CAEtFKsv=sEe4dTHWXCSV=EfxjGgzEX4wfvszg4RS7AKc31_KSA@mail.gmail.com (mailing list archive)
State Rejected
Delegated to: Herbert Xu
Headers show
Series Use a real non-cryptographic hash algorithm | expand

Commit Message

Devin Hussey Nov. 15, 2018, 2:27 p.m. UTC
From 502cbd61e386eb014035df66b032e90a9de926b7 Mon Sep 17 00:00:00 2001
From: Devin Hussey <husseydevin@gmail.com>
Date: Thu, 15 Nov 2018 09:12:24 -0500
Subject: [PATCH] Use a real non-cryptographic hash algorithm

dash previously used a modified version of the LoseLose algorithm.

LoseLose is a TERRIBLE algorithm.

It has terrible distribution, leaving many hash entries unused.

However, more importantly, it is incredibly easy to make a
collision; even something as simple as rearranging the letters.

For example. these all produce the hash 1652:

Hello
Holle
Hlleo
Hoell
Haxed\n
HASH\tBAD
Hater

Collsions in hash tables are very bad because it requires
looking through a linked list, and the benefits of O(1) time
in a hash table are completely lost.

The FNV-1a algorithm has comparable performance and code size
while having a much better collision rate.

While there are some other faster and more resistant hashes out there,
namely xxHash, Murmur2, and CityHash, the benefits are minimal on
short strings and they expect a length argument, unlike FNV-1a which
simply uses null-termination, and they are not in the public domain
like FNV-1a.

Signed-off-by: Devin Hussey <husseydevin@gmail.com>
---
 src/alias.c | 14 ++++----------
 src/exec.c  | 11 +++++------
 src/shell.h |  3 +++
 src/var.c   |  5 ++---
 4 files changed, 14 insertions(+), 19 deletions(-)

Comments

Jilles Tjoelker Nov. 16, 2018, 12:23 p.m. UTC | #1
On Thu, Nov 15, 2018 at 09:27:42AM -0500, Devin Hussey wrote:
> >From 502cbd61e386eb014035df66b032e90a9de926b7 Mon Sep 17 00:00:00 2001
> From: Devin Hussey <husseydevin@gmail.com>
> Date: Thu, 15 Nov 2018 09:12:24 -0500
> Subject: [PATCH] Use a real non-cryptographic hash algorithm

> dash previously used a modified version of the LoseLose algorithm.

> LoseLose is a TERRIBLE algorithm.

> It has terrible distribution, leaving many hash entries unused.

> However, more importantly, it is incredibly easy to make a
> collision; even something as simple as rearranging the letters.

> For example. these all produce the hash 1652:

> Hello
> Holle
> Hlleo
> Hoell
> Haxed\n
> HASH\tBAD
> Hater

> Collsions in hash tables are very bad because it requires
> looking through a linked list, and the benefits of O(1) time
> in a hash table are completely lost.

> The FNV-1a algorithm has comparable performance and code size
> while having a much better collision rate.

You do have a dependency chain with a multiplication per character, so
it is less suitable for long strings. There are implementation
techniques to avoid the dependency chain, but I'm not sure it's worth
the trouble.

> While there are some other faster and more resistant hashes out there,
> namely xxHash, Murmur2, and CityHash, the benefits are minimal on
> short strings and they expect a length argument, unlike FNV-1a which
> simply uses null-termination, and they are not in the public domain
> like FNV-1a.

> [snip]
> +        hashval = (hashval ^ (unsigned char)*p++) * FNV_PRIME;
> [snip]
> +        hashval = (hashval ^ (unsigned char)*p++) * FNV_PRIME;
> [snip]
> +        hashval = (hashval ^ (unsigned char) *p++) + FNV_PRIME;

I suppose this latter one should be * instead of + as well?
Herbert Xu Dec. 14, 2018, 4:56 a.m. UTC | #2
Devin Hussey <husseydevin@gmail.com> wrote:
> From 502cbd61e386eb014035df66b032e90a9de926b7 Mon Sep 17 00:00:00 2001
> From: Devin Hussey <husseydevin@gmail.com>
> Date: Thu, 15 Nov 2018 09:12:24 -0500
> Subject: [PATCH] Use a real non-cryptographic hash algorithm
> 
> dash previously used a modified version of the LoseLose algorithm.
> 
> LoseLose is a TERRIBLE algorithm.
> 
> It has terrible distribution, leaving many hash entries unused.
> 
> However, more importantly, it is incredibly easy to make a
> collision; even something as simple as rearranging the letters.
> 
> For example. these all produce the hash 1652:
> 
> Hello
> Holle
> Hlleo
> Hoell
> Haxed\n
> HASH\tBAD
> Hater
> 
> Collsions in hash tables are very bad because it requires
> looking through a linked list, and the benefits of O(1) time
> in a hash table are completely lost.
> 
> The FNV-1a algorithm has comparable performance and code size
> while having a much better collision rate.
> 
> While there are some other faster and more resistant hashes out there,
> namely xxHash, Murmur2, and CityHash, the benefits are minimal on
> short strings and they expect a length argument, unlike FNV-1a which
> simply uses null-termination, and they are not in the public domain
> like FNV-1a.
> 
> Signed-off-by: Devin Hussey <husseydevin@gmail.com>

I'm confused.  What problem are you trying to solve?

If it's security against an adversary, then clearly your replacement
isn't up to stratch either.  For real security, you'd need a strong
algorithm and periodic rehashing.

If it's hash collisions with a real-world script, please provide an
example.

Cheers,
diff mbox series

Patch

diff --git a/src/alias.c b/src/alias.c
index daeacbb..f625199 100644
--- a/src/alias.c
+++ b/src/alias.c
@@ -202,19 +202,13 @@  printalias(const struct alias *ap) {

 STATIC struct alias **
 __lookupalias(const char *name) {
-    unsigned int hashval;
     struct alias **app;
-    const char *p;
-    unsigned int ch;
+    unsigned int hashval = FNV_OFFSET_BASIS;
+    const char *p = name;

-    p = name;
+    while (*p)
+        hashval = (hashval ^ (unsigned char)*p++) * FNV_PRIME;

-    ch = (unsigned char)*p;
-    hashval = ch << 4;
-    while (ch) {
-        hashval += ch;
-        ch = (unsigned char)*++p;
-    }
     app = &atab[hashval % ATABSIZE];

     for (; *app; app = &(*app)->next) {
diff --git a/src/exec.c b/src/exec.c
index 9d0215a..c6b515d 100644
--- a/src/exec.c
+++ b/src/exec.c
@@ -638,16 +638,15 @@  struct tblentry **lastcmdentry;
 STATIC struct tblentry *
 cmdlookup(const char *name, int add)
 {
-    unsigned int hashval;
-    const char *p;
     struct tblentry *cmdp;
     struct tblentry **pp;

-    p = name;
-    hashval = (unsigned char)*p << 4;
+    unsigned int hashval = FNV_OFFSET_BASIS;
+    const char *p = name;
+
     while (*p)
-        hashval += (unsigned char)*p++;
-    hashval &= 0x7FFF;
+        hashval = (hashval ^ (unsigned char)*p++) * FNV_PRIME;
+
     pp = &cmdtable[hashval % CMDTABLESIZE];
     for (cmdp = *pp ; cmdp ; cmdp = cmdp->next) {
         if (equal(cmdp->cmdname, name))
diff --git a/src/shell.h b/src/shell.h
index 98edc8b..d08d0d5 100644
--- a/src/shell.h
+++ b/src/shell.h
@@ -82,6 +82,9 @@  extern char nullstr[1];        /* null string */
 #define TRACEV(param)
 #endif

+#define FNV_PRIME 16777619u
+#define FNV_OFFSET_BASIS 2166136261u
+
 #if defined(__GNUC__) && __GNUC__ < 3
 #define va_copy __va_copy
 #endif
diff --git a/src/var.c b/src/var.c
index 0d7e1db..be34731 100644
--- a/src/var.c
+++ b/src/var.c
@@ -637,11 +637,10 @@  void unsetvar(const char *s)
 STATIC struct var **
 hashvar(const char *p)
 {
-    unsigned int hashval;
+    unsigned int hashval = FNV_OFFSET_BASIS;

-    hashval = ((unsigned char) *p) << 4;
     while (*p && *p != '=')
-        hashval += (unsigned char) *p++;
+        hashval = (hashval ^ (unsigned char) *p++) + FNV_PRIME;
     return &vartab[hashval % VTABSIZE];
 }