struct SEntry { char *string; uint32 data; // or whatever data was associated with the string }
Immediately after the last data entry, the file would then contain the (null terminated) string for the first entry, then the string for the second string, etc, all the way up to the last string for data entry $n$. On disk, the char * would store the offset in the file to the string that it pointed to. This way, after i read the file into memory, i just add the address of the memory buffer to each char * to make it point at the (null terminated) string associated with that object. To make data lookup by string fast, i’d sort the entries in the file by the string alphabetically, so that a lookup is $O(log_2{N})$. Hash Table This one is only slightly more complicated. The file would once again first contain a uint32 specifying how many entries there were. A hash function would be used on the strings to output a $M$ bit hash. For more easy understanding let’s assume it gives a 4 bit hash. The data entries would be sorted by what hash their strings gave, effectively grouping the data entries into 16 different groups, that were ideally evenly sized on average. After the number of entries in the file, there would then be 16 SEntry pointers, which stored offsets to where each of those groups began in the file. On load, we would fix these up to point at the actual places in memory, just like the strings. The strings would come after the last data entry, just like in the last file description. Now, for searching this structure once it’s in memory and fixed up, you hash your string with the same 4 bit hash, find where the bucket begins, and search all items in the bucket linearly until you find the item you are looking for, or until you reach the end of the bucket. You can find the end of the bucket by seeing where the next bucket starts. In the case of the last bucket, it ends where the first data entry’s string begins. In the case of empty buckets, they have the same starting address as where the next bucket starts, so have a size of zero. Searching this data structure is $O(N/16)$, more generally $O(N/2^M)$, which makes it get better lookup time as $M$ gets bigger, but there is also the memory overhead of $2^M$ uint32’s in the file to store the information about the hash tables. Hybrid I also had the idea of doing a hybrid solution, where the hash table solution is used, but each hash bucket is sorted so that it can be binary searched. Analysis Let’s say that a typical file (but neither best nor worst case) contains 20,000 items. That may not be very much in some circles, but these lookups are done in performance critical situations where microseconds matter, so needs to be as fast as possible, and also be as small as possible, because RAM is also precious, second to performance. Going with the binary search, $log_2{20000} = ~14$, so that means worst case is 14 string compares looking for a string that isn’t there. Going with the 4 bit hash function and 16 hash buckets, instead it is $20000/16 = 1250$, so that is a worst case of 1250 string compares, which is quite a bit. It only has 128 bytes of overhead for the hash table information though (16* 64 bit pointers) Going with an 8 bit hash table and 256 hash buckets, that is $20000/16=~78$, so worst case of about 78 string compares, and has 2048 (2KB) overhead for the hash table information (256*64 bit pointers). Going with the hybrid solution for the 4 bit hash, the 1250 worst case lookup becomes ~10 lookups worst case. Going with the hybrid solution for the 8 bit hash, the 78 worst case becomes ~6 lookups. That 8 bit hash hybrid setup only brings the count down to 6, from the binary searches 14. It seems like binary search brings way more to the table than hash tables do in this situation, but that maybe a small bit count hybrid could possibly be helpful? With these details, do you guys have any other recommendations? Thanks!
Asked By : Alan Wolfe
Answered By : TilmannZ
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/50360 Ask a Question Download Related Notes/Documents