Current File : //usr/share/doc/freeze/README |
FREEZE / MELT COMPRESSION PROGRAM
This version is tested under SunOS 4.1.2, Xenix 2.3.2, MS-DOS.
The following preprocessor symbols control the compilation of Freeze
package (these can be added by hand into config.h after running ./configure):
o COMPAT Turns on backwards compatibility
with Freeze 1.0 (have you ever heard
about it? :-) )
o FASTHASH Uses another hash function which gives
some speedup, thus reducing the compression
rate (!)
o TEXT_DEFAULT (For MS-DOS only!) Define, if you
freeze text files more often than binary
ones. In this case don't forget to use "-i"
to freeze binary files!!
o TINY For segmented architectures -
allows to build freeze in small model
(data segment is less than 64K).
o SMALL Decreases the amount of memory
but without 64K restriction
o DEFFILE Full name of the file with freeze's
Huffman values; it is the only
symbol defined in the Makefile
to allow correlation with installation place.
Other preprocessor symbols (DEBUG, GATHER_STAT) are for internal use
only.
The format of frozen (2.X) file is incompatible with that of frozen (1.0),
but if this package is compiled with -DCOMPAT switch, you will able to
unpack frozen (1.0) files, if you have them.
------------------- COMPILING -----------------------
Type "./configure", or "sh configure". If it runs successfully,
you will see messages "creating config.h" and "creating Makefile".
You should check them and probably you may have to make some corrections.
After that simply type "make".
------------------- LINT ----------------------------
Some lint's complaints about `used/declared inconsistently' are (in my
case) due to inconsistencies of /usr/include/* and /usr/lib/llib*ln. It
isn't dangerous.
------------------- BUGS ----------------------------
Please send descriptions of found bugs, incompatibilities, etc. to
leo@ipmce.su.
------------ SPEED & COMPRESSION RATE ---------------
Compression rate depends somewhat of the hash table size and may vary
with different static Huffman tables.
Note: if you see 16-bits Compress works nearly as Freeze (on some big files),
this means the maximum is gained, so LHA and ARJ won't do their job better
by more than 1-1.5%. There are some files (I have one) that freeze compresses
better than ARJ 2.20.
Please! If your computer supports the string operations, try to write
"asm" instructions (GNU style) which realize the memory comparison
"(s1, s2, maxlength) --> matched length" in the minimum number of clock
cycles. If a noticeable (5% or more) speedup is gained, please send
me a message.
--------------- POSSIBLE IMPROVEMENTS ---------------
The high-level routines (freeze, melt) are almost independent from
low-level routines (Get_Next_Match, Insert/Delete_Node,
Encode/Decode_Char/Position), so if you want the speed and/or compression
rate `a la vogue' you may replace the low-level routines with the homebrew
(e. g.) ones and enjoy the results.
(I tried to implement splay trees instead of Huffman ones and instead of
static table for position information, but this gives nothing, alas.)
--------- CALGARY COMPRESSION CORPUS RESULTS --------
41127 bib.F
340447 book1.F
229188 book2.F
68610 geo.F
155157 news.F
10551 obj1.F
86216 obj2.F
19924 paper1.F
32439 paper2.F
54993 pic.F
14180 progc.F
17136 progl.F
11771 progp.F
22903 trans.F
Average bits/byte on the standard set (except paper3-6) =
1104642 * 8 / 3141622 = 2.813
--------------- BACKGROUND INFORMATION -----------------
The format of a frozen file is as follows:
(version 1.0 had only 2-bytes header)
offset type value comment
0 byte 037 2 byte magic header
1 byte 0237 (version 1.0 - 0236)
2 short X (little endian)
4 byte Y
X = 0 e e e e e d d d d c c c b b a \
> [a-f] are binary digits
Y = 0 0 f f f f f f /
a - number of 1-bit static Huffman codes in the `matching positions'
table (see freeze.1)
bb - number of 2-bit codes,
etc.
The numbers of 7- and 8-bits codes are evaluated from the
conditions: sum(codes) = 62(dec), max code = 1111111(bin).
E.g. values are: 0 1 1 1 4 10 27 18, what means:
no 1-bit codes,
one 2-bit, 3-bit and 4-bit codes, etc., so Huffman codes are:
00, 010, 0110, 01110, 01111, 10000, .... , 11111111.
General format of frozen file is:
magic header - table description - stream of bits
The stream of bits is considered as a sequence of variable length dynamic
Huffman codes (if their values are in the range of 0-255, they mean single
bytes, special value of 256 means EOF, and further values mean the lengths
of matched string.) If we have the value greater than 256, we get a static
Huffman code from the stream (his value is 6 higher bits of matched
string's position in the buffer), and then we get 7 bits literally.
Because buffer length is 8192 bytes and maximum match length is 256 bytes,
the position of matched string cannot be greater than 8192-256, that's why
there is only (8192-256)/2^7 = 62 static codes.
* * *
The default table is tuned for both C texts and executable files (as in
LHARC). If you freeze any other files (databases, images, fonts,
etc.) you can calculate the matching positions distribution using the
`statist' program, which calculates and displays the mentioned
distribution for the given file. It is useful for large (100K or more)
files.
Though the built-in position table is polyvalent, the tuning can increase
the compression rate up to one additional percent. (Or even more, if the
matching strings distribution is very bizarre!)
Usage: statist < sample_file ; you can also see the intermediate values
and watch their changes by pressing INTR key when you wish.
Note: If you use "gensample | statist", remember that INTR influence BOTH
processes !!
You may create the /etc/default/freeze (or rather /usr/lib/freeze.cnf,
which is now the default, NOTE IT!) file (in MS-DOS it is FREEZE.CNF in
the directory of FREEZE.EXE), which has the following format: name =
``statist's output (8 numbers)'', e.g.:
---------- cut here -----------
# This is freeze's defaults file
gif = 0 0 0 0 2 60 0 0 # This is NOT! optimal data
# for GIF files
doc=0 0 1 2 7 16 36 0 # The sample was gcc.lp
greedy2=0,1,1,2,5,8,11,34 # Both space- and comma-separated
# lists are allowed
# End of file
---------- cut here -----------
If you find values which are better THAN DEFAULT both for text (C
programs) and binary (executable) files, please send them to me.
Important note: statist.c is NOT a part of freeze package, it is an
additional feature.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
There is yet another additional feature since this version:
showhuf - shows Huffman table in a frozen file.
It can be used to find files frozen with freeze 1.0, too.