Starting at the beginning.

Giganews Newsgroups
Subject: Starting at the beginning.
Posted by:  Ernst_Be…@sbcglobal.net (Ernst Berg)
Date: 19 Jul 2004

Hello

As you may know I've worked on a Codec for a while and now that the
Codec program works well I want to learn more about Data Compression.
I have not written any data compression programs or worked with any
data compression algorithms except for a dictionary type algorithm
where strings were replaced with a code.  I have to start at the
beginning and learn.

The Codec I developed has an interesting ( to me ) structure.

The following data is the symbol counts of the Code generated by a
Codec I have developed.

Million Random Digits :
------------------------------------------------------
Symbol : 1  Symbol count : 829953 50.00990824  %
Symbol : 2  Symbol count : 414219 24.94941981  %
Symbol : 3  Symbol count : 208097 12.53418944  %
Symbol : 4  Symbol count : 104207  6.27664156  %
Symbol : 5  Symbol count : 51831  3.121907441 %
Symbol : 6  Symbol count : 26099  1.572006373 %
Symbol : 7  Symbol count : 12852  0.77410728  %
Symbol : 8  Symbol count : 6478    0.390185727 %
Symbol : 9  Symbol count : 3298    0.198646577 %
Symbol : 10  Symbol count : 1602  0.096492364 %
Symbol : 11  Symbol count : 803    0.048366647 %
Symbol : 12  Symbol count : 405    0.024394137 %
Symbol : 13  Symbol count : 203    0.012227185 %
Symbol : 14  Symbol count : 104    0.006264173 %
Symbol : 15  Symbol count : 37    0.0022286  %
Symbol : 16  Symbol count : 20    0.001204649 %
Symbol : 17  Symbol count : 17    0.001023951 %
Symbol : 18  Symbol count : 3      0.000180697 %
Symbol : 19  Symbol count : 4      0.00024093  %
Symbol : 20  Symbol count : 1      0.0000602324 % ( 6.02324e-05 )
Symbol : 23  Symbol count : 1      0.0000602324 % ( 6.02324e-05 )
Symbol : 25  Symbol count : 1      0.0000602324 % ( 6.02324e-05 )

        Total Symbol count 1660235
-------------------------------------------------------------------------
So, assuming that any encoded file will have a frequency structure
simular to the above data what method should I learn first? Huffman
encoding, Shannon-Fano coding or Arithmetic coding?
Since the symbols counts are approximately half of the previous
symbol count I was reading that Shannon-Fano coding was a good choice
but, another source I read suggested arithmetic coding was better.

So friends I am ready to learn and I am starting at the beginning.  I
program in C and I could benefit from some guidance.

Ernst

Replies