Impact of order of files in tar archive on compressed size

In David Kopec's book 'Classic Computer Science Problems in Python' I've found an example where the author seeking such an order of text strings so that it will be compressed (by zlib) as tightly as possible. He uses zlib and genetic algorithm. I took a liberty of cutting 3 pages from the book, see here.

So I thought I can do the same with tar utility, because it adds files in the same order as given in command line. After packing, I'm trying to compress resulting tar with gzip, bzip2, xz. Unlike GA, I used my favorite simulated annealing algorithm.

I took this Python library for SA. Do pip install simanneal.

# -*- coding: utf-8 -*-
from __future__ import print_function
import math, os, sys, random
from simanneal import Annealer

path=sys.argv[1]
files=list(os.walk(path))[0][2]

# 3rd element of tuple: number of steps
# increase for better results. will work slower, though.
tests=[
        ("gzip --fast", "gz", 2000),
        ("gzip --best", "gz", 2000),
        ("bzip2 --fast", "bz2", 2000),
        ("bzip2 --best", "bz2", 2000),
        ("xz -z --fast", "xz", 2000),
        ("xz -z --best", "xz", 2000)
]

def get_size_of_archive(state, min_or_max, test):
    cmdline="tar cf tmp.tar "
    for f in state:
        cmdline=cmdline+path+"/'"+f+"' "
    os.system(cmdline)

    compressed_fname="tmp.tar."+test[1]
    os.system (test[0]+" tmp.tar")
    fsize=os.path.getsize(compressed_fname)
    if min_or_max==False:
        rt=fsize
    else:
        rt=-fsize
    os.system ("rm "+compressed_fname)

    return rt

class FindPermutation(Annealer):

    def __init__(self, state, min_or_max, compressor):
        self.min_or_max=min_or_max
        self.test=test
        super(FindPermutation, self).__init__(state)  # important!

    def move(self):
        initial_energy = self.energy()

        a = random.randint(0, len(self.state) - 1)
        b = random.randint(0, len(self.state) - 1)
        self.state[a], self.state[b] = self.state[b], self.state[a]

        return self.energy() - initial_energy

    def energy(self):
        return get_size_of_archive(self.state, self.min_or_max, self.test)

def process (min_or_max, test):

    # initial state is random
    init_state = files
    random.shuffle(init_state)

    tmp = FindPermutation(init_state, min_or_max, test)
    tmp.steps=test[2]
    tmp.copy_strategy = "slice"
    state, e = tmp.anneal()

    print (state)
    print ("size:", abs(e))
    return abs(e) 

for test in tests:
    print ("Command:", test[0])
    print ("Minimizing...")
    _min=process(False, test)
    print ("Maximizing...")
    _max=process(True, test)

    print ("Delta=", _max-_min)

Because I'm searching for a good permutation of files, I took this TSP example as a basis. Indeed, seeking a shortest route in TSP is also a problem of searching for a good permutation.

As a test data, I used small files: random executables from my Linux box, C source codes, text files, HTML files. 14-23 files. Maximal total size of all files per each test set is ~300KB.

Sample output of my search utility:

Command: gzip --fast
Minimizing...
['textconf.c', 'textconf.h', 'help.h', 'help.c', 'clipboard.c', 'events_init.c', 'selcodepage.c', 'usermenu.h', 'clipboard.h', 'util.c', 'main.c', 'selcodepage.h', 'args.h', 'background.h', 'execute.c', 'execute.h', 'keybind-defaults.h', 'history.h', 'events_init.h', 'learn.h']
size: 29566
Maximizing...
['background.h', 'args.h', 'events_init.c', 'clipboard.c', 'execute.c', 'textconf.h', 'selcodepage.h', 'help.c', 'execute.h', 'util.c', 'events_init.h', 'main.c', 'keybind-defaults.h', 'usermenu.h', 'clipboard.h', 'history.h', 'learn.h', 'help.h', 'selcodepage.c', 'textconf.c']
size: 32000
Delta= 2434

See the table. Numbers: min size; max size; percentage change.

| compressor   | 14 executables        | 16 HTMLs           | 20 source codes     | 23 texts           |
|--------------+-----------------------+--------------------+---------------------+--------------------|
| gzip --fast  | 110113; 115609; 4.9%  | 59370; 63136; 6.3% | 29566; 32000; 8.2%  | 25820; 26825; 3.8% |
| gzip --best  | 99644;  107395; 7.7%  | 44438; 45282; 1.8% | 23127; 25837; 11.7% | 21059; 22074; 4.8% |
| bzip2 --fast | 95145;  101918; 7.1%  | 46031; 47298; 2.7% | 20984; 21050; 0.3%  | 18470; 18546; 0.4% |
| bzip2 --best | 92867;  92936;  0.07% | 38802; 38914; 0.2% | 20985; 21050; 0.3%  | 18472; 18543; 0.3% |
| xz -z --fast | 85560;  87544;  2.3%  | 46668; 47416; 1.6% | 24496; 25164; 2.7%  | 22104; 22728; 2.8% |
| xz -z --best | 79376;  80500;  1.4%  | 39836; 40232; 0.9% | 21084; 21340; 1.2%  | 19444; 19772; 1.6% |

(Percentage change is calculated using this formula.)

You see, bzip2 is not amenable to file order in tar archive, but gzip and xz are. Perhaps, bzip2 uses a smaller window?

In other words, you can shrink resulting compressed file by maybe ~10%, but this works only for small file set. Also, finding a good order/permutation is slow. If someone would really needs this in practice, one should build tar file and compress it in memory, without invoking shell, of course.

Enumerating all permutations for 23 files is $23! \approx 2^{74}$, that is too much.

All the files, including test data (files are packed in arbitrary order this time).

UPD. See also: Why are tar.xz files 15x smaller when using Python's tar compared to macOS tar?

UPD2: at reddit.


List of my other blog posts.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.