BWT, tuning for speed

Giganews Newsgroups
Subject: BWT, tuning for speed
Posted by:  cr88192 (cr881…
Date: Thu, 4 May 2006

since yesterday I have implemented the BWT transform (for the second time
really, but oh well). since first implemented, I have been able to get it
significantly faster, but still not as fast as I would hope (then again, it
is bwt after all).

originally, I had started with an in-place algo (dunno if it has a name)
where basically I skim forwards, exchanging the current value with anything
lower than itself, stepping to the next value, and repeating until done
(this algo was about O(n^2) afaik). this was followed by implementing

at present, it can transform a 256kB block in a little under 200ms
(optimizer settings), around 330 (debug settings) or 500ms (profiler

most of the time goes into the comparrison function, but at present I am
unsure if I can optimize it much further.

int bufcmp(byte *ibuf, int bs, int s0, int s1, int ml, int *rml)
    byte *s, *t, *e;
    int i, j, k;


    s=ibuf+(s0+ml); t=ibuf+(s1+ml);

    if(s>=e)s-=bs; if(t>=e)t-=bs;
    for(i=ml; i<j; i++)
        if(*s!=*t) { *rml=i; return((*s<*t)?-1:1); }
        s++; t++;

    if(s>=e)s-=bs; if(t>=e)t-=bs;
    for(; i<j; i++)
        if(*s!=*t) { *rml=i; return((*s<*t)?-1:1); }
        s++; t++;

    if(s>=e)s-=bs; if(t>=e)t-=bs;
    for(; i<bs; i++)
        if(*s!=*t) { *rml=i; return((*s<*t)?-1:1); }
        s++; t++;

    ibuf is the input block;
    bs is the size;
    s0 and s1 are the strings;
    ml is the current (known) prefix length;
    rml is the return length (used for figuring out the prefix len for
subsequent recursions).

one possible thought was eliminating the 'i' variable, but I don't know how
much payoff there would be (I would in-turn have to do more when figuring
the return length, which may counteract any real gain...).

the sort function itself is less interesting. it gets a fairly minor portion
of the runtime (about 20%), and seems to be operating within theoretical
expectations (so I probably can't do that much to it).

the algo is a variation of quicksort. 2 stacks are used for holding values
temporarily, and a 3rd buffer is used for holding the pivot (and anything
equivalent to the pivot, but this should only really happen much in rare
cases). this is done because there is some payoff in not needing to
subsequently sort sets containing the pivot.

originally, it was implemented such that it only needed 2 versions of the
strings buffer for sorting, and could alternate between them, but the
problem with this was that it requires 3 passes over the data for each
iteration, wheras using the stacks I can get it done in a single pass...

problem: a case could arrise where the stacks could overflow, which would be

linked lists also exist as another possibility for sorting, however I feel
less certain as the general overhead could be a lot higher. then again,
linked lists will still only require a single pass, and would not risk
overflowing (buffer-based) stacks (in a sufficiently bad case, however,
there would be risk of overflowing the system stack). I may try this...

the inverse bwt is also done (partly so I can compute a few sums and hashes
to verify that it still decodes right). at present the costs of the inverse
bwt are negligable.

plans for the rest of the coding:
I will probably not do MTF, as at present I don't trust MTF's decoding speed
(it could take longer to perform than both the huffman decoding and the
inverse BWT).
why: because of the need for a loop to shift the values, and the likelyhood
of shifting a number of values, for every value...

instead, I will probably use a "rotator" (dunno if there is a better term),
which will approximate MTF, but will also generally contain "holes" (a
detractor, but should be acceptable I think).

order: 0123
coding: 01230123

0: no change
1: 1 moves to front
3: 2 moves to front (as 2 is a hole where 1 was)
5: 3 moves to front (as both 3 and 4 are holes)
3: 0 moves to front
3: 1 moves to front
3: 2 moves to front
3: 3 moves to front

I at least feel more confident that this will perform quickly (having had
good enough success with them before on this front).

if succiciently fast, I could do full MTF for the front few entries,
otherwise falling back to a rotator, or just doing a mini-mtf (say, only 4
or 16 items).

of course, knowing for sure will require testing (eg: comparing the costs in
terms of both ratio and speed).

of course, most of this testing will first require implementing huffman (or
at least making something to estimate how well it would do given the output

or something...