- $v_1$ : $langle 23, 23, 1, 33, 103, 219, dots rangle$
- $v_2$ : $langle 92, 83, 1, 33, 239, 192, dots rangle$
- …
I will use Hamming distance to calculate their difference: The difference between $v_1$ and $v_2$ is $4$ because elements $3$ and $4$ are the same and others are difference. Now, I want to use Locality Sensitive Hashing (LSH) to put those vectors into different bins.
What kind of hash function can I use for this case?
I have read some article about universal hash function, but I am not sure can I use it and how to ensure that the probability for the similar vectors going to the same bin is higher than those non-similar one. Here is the way that I think how should I use the universal hash function for my task. I will first divide those high dimensions vectors into sub-vectors: $$x : 23, 23 ; | ; 1, 33 ; | ; 103, 219 ; | ; dots$$ sub1-x : 23 23
sub2-x : 1 33
sub3-x : 103 219
The following function will be used for each sub-vector: $$sum_{i=0}^{r} a_{i}x_{i} mod m$$ Basically this is a dot product, a = {a_1, a_2, … a_i}, x = {23, 23, 1, 33, 103, 219 …}, m is a prime.
- Different combination of {a} will form a different hash table, one hash table is used for one sub-vector.
- I can now hash the data into bins, but the question is
Is this an LSH method? I don’t know that two similar vectors will go into the same bin with a high probability.
Asked By : sflee
Answered By : D.W.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/13334