[Solved]: Hash table versus binary search lookup for unchanging data

Problem Detail: Let’s say I have some static, unchanging data (no adds, modifies or deletes) which is looked up by a string value, and that I’m looking to minimize size in memory while also trying to minimize lookup time. It seems to me that keeping the strings sorted in a list and then doing a binary search is always going to be better than a hash table since the hash table has memory overhead to store info about the buckets, and lookups ought to be slower unless I have a lot of buckets. Is my reasoning correct? Are there any better data structures or algorithms for this situation? Implementation details My specific goal is to have a data structure which associates data entries with strings. The data structure is meant to be friendly for disk serialization – where i can read the whole file into memory with one allocation and one disk read, and then do “pointer fixup” to make the data structure valid and usable. This data structure is static (constant) and does not change. It’s read only data, so the search for data entry by string function is the only operation that matters for execution time. The size in memory of the data structure is also important, but is not as important as search time. Below are the two options I described above, as I see them implemented, but let me know if you see anything that could be improved, or if there is something better I could do that is completely different! Binary Search This one is pretty simple. I’m leaving out details like fourcc in header, file versioning and proper syntax for avoiding padding in structs since they are out of scope of the algorithmic questions. In the file, i would have a uint32 specifying how many entries there were. Then, I would have the entries $0…n$, which might look like:

struct SEntry {   char *string;   uint32 data; // or whatever data was associated with the string }

Immediately after the last data entry, the file would then contain the (null terminated) string for the first entry, then the string for the second string, etc, all the way up to the last string for data entry $n$. On disk, the char * would store the offset in the file to the string that it pointed to. This way, after i read the file into memory, i just add the address of the memory buffer to each char * to make it point at the (null terminated) string associated with that object. To make data lookup by string fast, i’d sort the entries in the file by the string alphabetically, so that a lookup is $O(log_2{N})$. Hash Table This one is only slightly more complicated. The file would once again first contain a uint32 specifying how many entries there were. A hash function would be used on the strings to output a $M$ bit hash. For more easy understanding let’s assume it gives a 4 bit hash. The data entries would be sorted by what hash their strings gave, effectively grouping the data entries into 16 different groups, that were ideally evenly sized on average. After the number of entries in the file, there would then be 16 SEntry pointers, which stored offsets to where each of those groups began in the file. On load, we would fix these up to point at the actual places in memory, just like the strings. The strings would come after the last data entry, just like in the last file description. Now, for searching this structure once it’s in memory and fixed up, you hash your string with the same 4 bit hash, find where the bucket begins, and search all items in the bucket linearly until you find the item you are looking for, or until you reach the end of the bucket. You can find the end of the bucket by seeing where the next bucket starts. In the case of the last bucket, it ends where the first data entry’s string begins. In the case of empty buckets, they have the same starting address as where the next bucket starts, so have a size of zero. Searching this data structure is $O(N/16)$, more generally $O(N/2^M)$, which makes it get better lookup time as $M$ gets bigger, but there is also the memory overhead of $2^M$ uint32’s in the file to store the information about the hash tables. Hybrid I also had the idea of doing a hybrid solution, where the hash table solution is used, but each hash bucket is sorted so that it can be binary searched. Analysis Let’s say that a typical file (but neither best nor worst case) contains 20,000 items. That may not be very much in some circles, but these lookups are done in performance critical situations where microseconds matter, so needs to be as fast as possible, and also be as small as possible, because RAM is also precious, second to performance. Going with the binary search, $log_2{20000} = ~14$, so that means worst case is 14 string compares looking for a string that isn’t there. Going with the 4 bit hash function and 16 hash buckets, instead it is $20000/16 = 1250$, so that is a worst case of 1250 string compares, which is quite a bit. It only has 128 bytes of overhead for the hash table information though (16* 64 bit pointers) Going with an 8 bit hash table and 256 hash buckets, that is $20000/16=~78$, so worst case of about 78 string compares, and has 2048 (2KB) overhead for the hash table information (256*64 bit pointers). Going with the hybrid solution for the 4 bit hash, the 1250 worst case lookup becomes ~10 lookups worst case. Going with the hybrid solution for the 8 bit hash, the 78 worst case becomes ~6 lookups. That 8 bit hash hybrid setup only brings the count down to 6, from the binary searches 14. It seems like binary search brings way more to the table than hash tables do in this situation, but that maybe a small bit count hybrid could possibly be helpful? With these details, do you guys have any other recommendations? Thanks!

Asked By : Alan Wolfe

Answered By : TilmannZ

Have you considered “perfect hashing“? Basically you chose the hashfunction such that not collisions occur (every bucket has at most one entry). This is usually quite memory efficient, because you simply have a an array of values, no whole memory pages or linked list of objects to manage buckets (if you would do that). The plain array structure also makes it quite fast. If you are concerned about memory efficiency you may also want to look into critbit-trees / binary tries. They store only the bits of the key that differ from other keys (more or less). This is reasonably fast and also quite memory efficient, especially with long keys, such as uint256 or arbitrary length. An example implementation is here (my code). I’m not sure whether these structures fit your requirements, but I think they are worth mentioning.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/50360 Ask a Question Download Related Notes/Documents