Subject: | A Random Data Compressor to solve |
Posted by: | Gerald Tamayo (comp…@gmail.com) |
Date: | Sat, 18 May 2019 |
Do you have a library to do arithmetic using Very Large Integers, BigNums o=
r BigInts, or Infinite-Precision Integers? Then maybe you can use it for th=
e compression algorithm i describe here. But really, we need to have small =
integers (frequencies) as much as possible to avoid very long computations.=
=20
The random data compression algorithm is best understood if you "visualize"=
a bar chart or a histogram, where new symbol frequencies are always trying=
to become greater than the current highest frequency, which we increment b=
y its delta with the new symbol's frequency. The new highest frequency beco=
mes the new symbol's frequency; or put simply, the new symbol must have the=
highest frequency. So at most, the new highest frequency can only "double"=
or add by 1 bit in length. (In decoding, the symbol with the highest frequ=
ency is the symbol to decode; this means it is stack based. We add the delt=
a to the highest frequency during encoding so we can preserve or get back t=
o the previous frequency of the symbol when decoding.) Output is actually t=
he frequency table, which is easy to compress or generate?=20
Algorithm pseudo-code:=20
/* initialize frequency table. */=20
for (i=3D0; i < 256; i++) freq[i] =3D i + 1;=20
max =3D freq[255];=20
do {=20
c =3D get_byte(infile);=20
if (c =3D=3D EOF) break;=20
freq[c] =3D max + (max - freq[c]);=20
max =3D freq[c];=20
} while (1);=20
No "runs" of a single character allowed in the input, no "run-lengths" as m=
uch as possible. "Random data" indeed.=20
New or recalled observations:=20
1. This algorithm ironically "expands" the frequencies at first. ? LOL=20
We're back to the early days of information theory or data compression hist=
ory!=20
2. The bombshell: It takes more than 1 bit added to encode for very small =
frequencies which suddenly must be maximum. The solution might be to "swap"=
them but this requires new information or codes. This is back to delta cod=
ing. haist=20
3. But a total cycling of the frequency table might work...=20
4. Instead of 8-bit bytes, use 4-bit symbols;=20
***=20
This is similar, i think, to WEB Technologies' algorithm as featured in BYT=
E magazine in 1992 and noted by comp.compression FAQ:
"WEB, in fact, says that virtually any amount of data can be=20
squeezed to under 1024 bytes by using DataFiles/16 to compress=20
its own output multiple times."=20
I think they were using or playing with a frequency table too, 256 32-bit f=
requencies =3D 1K.=20
They might had to output the MSbit of the highest frequency, the result of =
which may equal another byte frequency/ies?=20
That's why they had the problem that 4 numbers in a matrix are equal, a rar=
e case in their algorithm.=20
Just maybe.=20
(Ideally, at most 1 bit increase in frequency of output or new symbol, but =
the bombshell precludes that. If they are of the same bitsize, then only 1 =
bit increase in the new max frequency.=20
The current symbol has always the highest frequency.=20
You decode backwards, from last symbol to first; the symbol with the highes=
t frequency is the current symbol.
One parameter in decoding is the famed file_size().=20
)=20
The problem with the algorithm is that the emitted frequency table could be=
very large due to very large frequencies if you implement it by really usi=
ng BigNums or BigInts;=20
You then have to compress the very large frequency table.
Maybe to achieve compression, you can just consider the MSBit after the ari=
thmetic (addition) operation.=20
Or the solution is nearly just MTF (you have to output the character that *=
doubled* (MSBit activated)).=20
WEB Technologies' Datafiles/16 algorithm is clearly designed for compressio=
n of *random* data, and recursive, which are futile indeed.=20