misc: a compressed network protocol spec...

Giganews Newsgroups
Subject: misc: a compressed network protocol spec...
Posted by:  BGB (cr881…@hotmail.com)
Date: Wed, 22 Feb 2012

recently, I had partly designed, but didn't fully implement, a
compressed network protocol (to get things working faster, I ended up
jerry-rigging something so that an ASCII serialized data and Deflate
could be used instead, but kept the same escape-coding scheme).

the basic idea is that it will relay list/S-Expression style data over a
TCP socket.

as for the "proper" implementation, I got it mostly written, but then
noted that deflate was "sufficient" at saving space (textual S-Exps were
printed, deflated, and then sent over the socket).

the main difference was that I had planned a more elaborate scheme which
would have made basic attempts to model the message data, and borrowed
some ideas from Deflate. being able to compress/decompress quickly was
also a goal.

basically, this would be for a real-time bidirectional asynchronous
message stream with "moderate" volume, and compression was a goal mostly
to reduce bandwidth use and also hopefully reduce the risk of stalls and
improve latency. uncompressed it would need an estimated 500 to 800
kbps, but I had wanted to keep it within "internet radio" levels (raw
deflate managed to do this, generally compressing messages to around 10%
their original size, reducing the need for a more specialized scheme).

I just figured maybe people here might be interested in looking at it or
commenting on it though.

but, anyways, below is/was the "working spec":

----

Idea:
Message Coding scheme for S-Expressions.

Currently, this will not address the matter of value 'identity' or of
encoding cycles (conceptually, any encoded expressions will be assumed
to be acyclic).

Packaging:
Marker codes used between coded messages.
    magic='x7F,xFF,xFE'
    magic,x00        Escape Marker (Escaped version of magic).
    magic,x01        Start of Message
    magic,x02        End of Message
    magic,x03        Reset Marker
    magic,x04-x07    Huffman Table 0-3
    magic,x08        End Huffman Table
    magic,x09        Start of Deflated Message
    magic,x0A        End of Deflated Message

Possible Issue: The long escape may be more costly to check inline in a
bitstream reader (say, vs the single 'FF' in JPEG).

Reset Marker

The Reset Marker is followed by:
A 2 byte protocol version, currently x00,x01;
A 1 byte command tag;
A 1 byte auxilary tag;
Data, 0 or more bytes.

Command Tag
x00: Stream Abort, (Indicates that the stream is being closed).
After a sender sends a stream abort, it is no longer expected to respond
to messages, and may also close the socket.
x01: Stream Init, Indicates that the stream has been newly opened, and
both sides should begin usual initialization processes.
x02: Stream Reset, Indicates that an error has occured and the stream
should reset to its default initial state.

Deflated Message
    Start Marker followed by 32-bit size (LE).
    End Marker followed by 32-bit Adler-32 (LE).
    Escape-filtered Deflated data is present between the markers.

Message

The bitstream will start packing bits in a low-high order (similar to
deflate). Multi-bit values will be encoded starting with the LSB.

Code Ranges
x00-x3F: Built-in Operations
x40-x7F: reserved
x80-xFF: MRU Reference

Operations
    x00, End Of List
    x01, Start Of List
    x02, Symbol (Coded String)
    x03, String (Coded String)
    x04, Keyword (Coded String)
    x05, Fixnum/Integer (Signed Coded Integer)
    x06, Flonum (Coded Real)
    x07, Double (Coded Real)
    x08, Start Of Dotted List
    x09, Start Of Array
    ...
    x10, EOL ("()")
    x11, True ("#t")
    x12, False ("#f")
    x13, Null ("#z")
    x14, Undefined ("#u")
    x15, Flonum Escape (Coded Real Escape)
    x16, Double Escape (Coded Real Escape)

Value MRU
An MRU is used for recent values.
The MRU will have a bounded size, and stale values will "fall off the end".
Making a reference into the MRU will move the value to the front of the
list.
Uses Huffman Table 0

Prefix    Value        Extra Bits
80-8F    0-15        0
90-97    16-31        1
98-9F    32-63        2
A0-A7    64-127        3
A8-AF    128-255        4
B0-B7    256-511        5
B8-BF    512-1023    6
C0-C7    1024-2047    7
C8-CF    2048-4095    8
D0-D7    4096-8191    9
D8-DF    8192-16383    10
E0-E7    16384-32767    11
E8-EF    32768-65535    12
F0-FF    -

Coded String/Data
Will use a mix of UTF-8 and LZ77.
Uses Huffman Table 1

00-FF: Raw Values
100: End Of String
101-013F: LZ Run

LZ Run Values
101-107        1-7            0
108-10B        8-15        1
10C-10F        16-31        2
110-113        32-63        3
114-117        64-127        4
118-11B        128-255        5
11C-11F        256-511        6

Distances will be an Unsigned Coded Integer representing the relative
position of the run within the sliding window. Currently, the sliding
window will be defined to be 64kB.

Note that the value 2 will be added to the length of runs, such that the
shortest encodable run is 3.

Coded Integers

Distances and Integers use Huffman Table 2

Prefix    Value Range                Suffix Bits
00-07    0-7                        0
08-0B    8-15                    1
0C-0F    16-31                    2
10-13    32-63                    3
14-17    64-127                    4
18-1B    128-255                    5
1C-1F    256-511                    6
20-23    512-1023                7
24-27    1024-2047                8
28-2B    2048-4095                9
2C-2F    4096-8191                10
30-33    8192-16383                11
34-37    16384-32767                12
38-3B    32768-65535                13

3C-3F    65536-131071            14
40-43    131072-252143            15
44-47    262144-524287            16
48-4B    524288-1048575            17
4C-4F    1048576-2097151            18
50-53    2097152-4194303            19
54-57    4194304-8388607            20
58-5B    8388608-16777215        21
5C-5F    16777216-33554431        22
60-63    33554432-67108863        23
64-67    67108864-134217727        24
68-6B    134217728-268435455        25
6C-6F    268435456-536870911        26
70-73    536870912-1073741823    27
74-77    1073741824-2147483647    28
78-7B    2147483648-4294967295    29

7C-7F    4294967296-8589934591                        30
80-83    8589934592-17179869183                        31
84-87    17179869184-34359738367                        32
88-8B    34359738368-68719476735                        33
8C-8F    68719476736-137438953471                    34
90-93    137438953472-274877906943                    35
94-97    274877906944-549755813887                    36
98-9B    549755813888-1099511627775                    37
9C-9F    1099511627776-2199023255551                    38
A0-A3    2199023255552-4398046511103                    39
A4-A7    4398046511104-8796093022207                    40
A8-AB    8796093022208-17592186044415                41
AC-AF    17592186044416-35184372088831                42
B0-B3    35184372088832-70368744177663                43
B4-B7    70368744177664-140737488355327                44
B8-BB    140737488355328-281474976710655                45
BC-BF    281474976710656-562949953421311                46
C0-C3    562949953421312-1125899906842623            47
C4-C7    1125899906842624-2251799813685247            48
C8-CB    2251799813685248-4503599627370495            49
CC-CF    4503599627370496-9007199254740991            50
D0-D3    9007199254740992-18014398509481983            51
D4-D7    18014398509481984-36028797018963967            52
D8-DB    36028797018963968-72057594037927935            53
DC-DF    72057594037927936-144115188075855871        54
E0-E3    144115188075855872-288230376151711743        55
E4-E7    288230376151711744-576460752303423487        56
E8-EB    576460752303423488-1152921504606846975        57
EC-EF    1152921504606846976-2305843009213693951        58
F0-F3    2305843009213693952-4611686018427387903        59
F4-F7    4611686018427387904-9223372036854775807        60
F8-FB    9223372036854775808-18446744073709551615    61

// FC-FF    18446744073709551616
// 18446744073709551616

For signed values, the sign will be folded into the LSB, following the
pattern 0, -1, 1, -2, 2, ...

Coded Real
Reals will be encoded as a pair of Signed Coded Integers.
First will be the mantissa, followed by the exponent.
The exponent will specify how many bits to shift the mantissa left or
right, with positive values indicating a left-shift, and negative values
indicating a right shift.

Coded Real Escape
Encodes certain special cases with real values.

It is a signed integer currently with several defined values:
    0=NaN, 1=Inf, -1=-Inf.

Huffman Table

The Huffman Tables will use Rice Codes and Runs to encode an array of
symbol lengths.

Symbols with a length of 0 are not present in the encodable character
set, as are any symbols past the end of the Huffman table (they will
implicitly have values of 0).

The layout of a Huffman Table will be:
    4 bits: Table Type (0=Fixed, 1=Dynamic)
    4 bits: Table Index, or K-Factor (Rice Codes)
    if(type==Dynamic)
        <n bits>: coded length runs.

Huffman codes will be assigned starting with 0, starting with shortest
lengths first, and starting with the smallest symbol within that
code-length. This way, the symbol lengths are sufficient to
unambiguously define the layout of the Huffman table.

For example, if one has the symbols A-H, with the lengths
(3,3,3,3,3,2,4,4), then the assigned Huffman codes will be:
F=00, A=010, B=011, C=100, D=101, E=110, G=1110, H=1111

Fixed Tables
    0, The table will have 256 symbols all with a length of 8.
    1, The length ranges will be:
        0-143, 8
        144-255, 9
        256-279, 7
        280-287, 8
    2, Length Ranges will be:
        0-15, 5
        16-31, 6
        32-63, 8
        64-127, 9
        128-255, 11

        0xxxx        (00-0F)
        10xxxx        (10-1F)
        110xxxxx    (20-3F)
        1110xxxxx    (40-7F)
        11110xxxxxx    (80-BF)
        11111xxxxxx    (C0-FF)

Rice Codes

Will consist of a prefix consisting of 0 or more 1 bits terminated by a
0 bit (giving the value 'n'), and a fixed value of k bits (v). The value
will be 'value=(n<<k)|v'.

Coded Length Table

The length table is encoded as an array of values:
    0-16: Code Length
    17: RLE Run of 3-10 (3 bit length + 3)
    18: RLE Run of 11-138 (7 bit length + 11)
        RLE Runs will repeat the prior value a certain number of times.
    19: End Of Table

Each value is encoded as a rice-code using the permutation:
    16, 17, 18, 0, 8,7, 9,6, 10,5, 11,4, 12,3, 13,2, 14,1, 15, 19

With the inverted (decoder) mapping being:
    3, 17, 15, 13, 11,9, 7,5, 4,6, 8,10, 12,14, 16,18, 0,1, 2, 19

Replies