[Solved]: Term for binary search tree using hashes?

Problem Detail: I was looking for a way to easily store and access a symbol table using the least memory and code as possible and I went with a BST. Symbols, however, tend to be defined in order as in foo0, foo1, foo2… so tree balance is an issue. I considered balancing and auto-balancing algorithms, but then I realized that the hash for a key given a good hash function should have a 50% average chance for left/right. And I already had a hash function so that meant no additional code. So I would get a balanced tree for free(edit: at no point did I want to imply a perfectly balanced tree, this is emergent behavior). I wanted to check if my theory was correct. When I search for BSTs, hashes, and hash trees, however, no relevant information turns up. What is the term for a binary search tree that balances itself by using a hash? edit: I am going to accept the answer because if something doesn’t have a name it just doesn’t, but I tested this because some comments and asides imply that it is worse than it actually is. I tested a sorted list of 468685 words containing both English words and “elite” password cracking words.

balanced-tree:    worst(19)     average(18.84) edlin-tree:       worst(47)     average(22.72) edlin-tree(SHA2): worst(46)     average(23.25) regular BST:      worst(468685) average(234342.50)

So it’s not going to replace your language library algorithms but, as stated in the accepted answer, it is obviously logarithmic. As for the hash function(Which for reference is add+xorshift as in:

do{h+=*s;h^=h<<left1;h^=h>>right;h^=h<<left2;}while(*s);

), I have also tested how bad it is by comparing collisions to the expected value.

8 bits:    expected :  468428.99    axs      :  468429    SHA2-256 :  468429  16 bits:    expected :  403200.35    axs      :  403189    SHA2-256 :  403198  24 bits:    expected :    6485.99    axs      :    6451    SHA2-256 :    6433  31 bits:    expected :      51.14    axs      :      59    SHA2-256 :      47  32 bits:    expected :      25.57    axs      :      30    SHA2-256 :      20  33 bits:    expected :      12.79    axs      :      14    SHA2-256 :      11  35 bits:    expected :       3.19    axs      :       7    SHA2-256 :       2  36 bits:    expected :       1.59    axs      :       1    SHA2-256 :       2  37 bits:    expected :       0.80    axs      :       0    SHA2-256 :       0

So, while it appears to lag for some bitnesses, it generally follows the expected values, both for tree depth and collisions. What would a hash which is not good for your input mean? That your input generates a sequence when passed to it? That would be ridiculously unlikely statistically speaking. Worst case is O(n) but you are not going to see it. The algorithm is not good compared to AVL or red-black, even though insertion will probably be faster generally, however it beats a linked list or betting on input for a BST any day.

Asked By : Edlin

Answered By : Edlin

It’s not that the analysis in the previous answer is wrong regarding the weight balance, but the answer I was looking for was that this algorithm is basically equivalent to a http://en.wikipedia.org/wiki/Random_binary_tree . The longest path is $$4.311 times log_{e} n$$ EDIT: I’ll update this answer because the information below was relatively hard to come by. The magic number comes from:

Reed, Bruce (2003), "The height of a random binary search tree", Journal of the ACM 50 (3): 306–332, doi:10.1145/765568.765571.

The author shows that the expected height of a random binary tree is $E(H_n ) = alpha log_e n − betalog_e log_e n + O(1)$, where $α = 4.311ldots$ and $β = 1.953ldots$. The magic number represents the upper bound. The expected height happens to be $approxlog_{frac{4}{3}} n$, which is somewhat intuitive when you consider that each parent node represents a random pivot partitioning the node search space like in random quicksort. Finally, the average access time for any given node is actually $approxlog_{frac{4}{3}} sqrt{n}$(I can’t find a reference for this). That makes using hash as the key $approx1.205log_2n$ or 20% slower on average than a self-balancing binary tree.* Similar structures with various trade-offs include the hash-as-priority treap, hash+b-tree, and hash+suffix tree. The suffix tree is only a bit less lazy than the proposed structure and provides expected $log_2n$ access time. [*] Many balanced tree algorithms also have a worst-case height upper bound of $log_phi n$

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/28194 Ask a Question Download Related Notes/Documents