input frequency code a 16 00 b 12 010 c 9 0110 d 7 0111 e 6 100 f 6 1010 g 6 1011 h 5 1100 i 3 11010 j 3 11011 k 3 11100 l 2 11101 m 1 111100 n 1 111101 o 1 111110 p 1 1111110 r 1 1111111
Observe that the encoding for e with a frequency of 6 is shorter than that of c and d with frequencies of 9 and 7, respectively. Clearly I would get better overall compression by swapping the codes for c and e. Let me show – at least in part – how the codes are derived. First I split between d and e giving a combined frequency of 44 on the left and 39 on the right (a distance of 5). Splitting between c and d would yield 37 and 46, a distance of 9, and between e and f yields 50 and 33, and distance of 17 – both inferior. So a, b, c, and d start with 0 – the remaining inputs start with 1. Recursively I then process the abcd side. There I find that splitting between a and b or between b and c yield exactly the same ‘error’ or distance, namely 12, so I just pick the first one, yielding the encoding you see above. When I noticed this problem I then modified the algorithm to favor the split point closest to the center of the segment I am working on in case of equality. That does indeed yield 3-bit codes for the four inputs on the left-hand side of the tree, thus “repairing” the table in the sense that no input has a shorter code than another one with a higher frequency, but I have not heard before that something like this would be necessary and besides that is still inferior in terms of compression ratio compared to the table above with the codes for c and e swapped. What gives? Is this a known limitation of Shannon-Fano, or did I make a mistake somewhere?
Asked By : 500 – Internal Server Error
Answered By : Yuval Filmus
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/32831 3.2K people like this