[Solved]: Shannon-Fano puzzle

Problem Detail: I was playing around with Shannon-Fano (SF) entropy encoding when I ran into this issue. I am aware that the compression that can be achieved with SF is sometimes inferior to that of Huffman encoding, but I just though it meant that code lengths were sometimes sub-optimal in the sense that an encoding exists, which yields a better overall result. I did not expect SF to generate codes that are clearly sub-optimal in the sense that a particular input value can achieve a shorter encoding than another one with a higher frequency gets, but that is exactly what I am observing. This is most likely something I am doing wrong or something I misunderstood, but I am not able to see it at the moment. Please consider the following data:

input     frequency code a         16        00 b         12        010 c          9        0110 d          7        0111 e          6        100 f          6        1010 g          6        1011 h          5        1100 i          3        11010 j          3        11011 k          3        11100 l          2        11101 m          1        111100 n          1        111101 o          1        111110 p          1        1111110 r          1        1111111

Observe that the encoding for e with a frequency of 6 is shorter than that of c and d with frequencies of 9 and 7, respectively. Clearly I would get better overall compression by swapping the codes for c and e. Let me show – at least in part – how the codes are derived. First I split between d and e giving a combined frequency of 44 on the left and 39 on the right (a distance of 5). Splitting between c and d would yield 37 and 46, a distance of 9, and between e and f yields 50 and 33, and distance of 17 – both inferior. So a, b, c, and d start with 0 – the remaining inputs start with 1. Recursively I then process the abcd side. There I find that splitting between a and b or between b and c yield exactly the same ‘error’ or distance, namely 12, so I just pick the first one, yielding the encoding you see above. When I noticed this problem I then modified the algorithm to favor the split point closest to the center of the segment I am working on in case of equality. That does indeed yield 3-bit codes for the four inputs on the left-hand side of the tree, thus “repairing” the table in the sense that no input has a shorter code than another one with a higher frequency, but I have not heard before that something like this would be necessary and besides that is still inferior in terms of compression ratio compared to the table above with the codes for c and e swapped. What gives? Is this a known limitation of Shannon-Fano, or did I make a mistake somewhere?

Asked By : 500 – Internal Server Error

Answered By : Yuval Filmus

As you mention, Shannon–Fano coding is not optimal, and has been superseded by Huffman coding. Huffman codes can be found efficiently, so there is no reason to use Shannon–Fano coding nowadays. Shannon–Fano was never intended to be an optimal algorithm, it’s just the best Fano could think of, and was enough to prove the source coding theorem. Shannon–Fano coding indeed has a “limitation” – it doesn’t always produce an optimal code. You’re suggesting an improvement of Shannon–Fano coding, namely generate the codewords, and then assign them according to the frequencies. The resulting coding improves on Shannon–Fano coding, but is still probably inferior to Huffman coding. The crucial property of Huffman coding is that it’s optimal, so we know that it has no “limitations”.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/32831 3.2K people like this

Download Related Notes/Documents