Benford's law

From Wikipedia:

Benford's law, also known as the Newcomb–Benford law, the law of anomalous
numbers, or the first-digit law, is an observation that in many real-life sets
of numerical data, the leading digit is likely to be small.[1] In sets that
obey the law, the number 1 appears as the leading significant digit about 30%
of the time, while 9 appears as the leading significant digit less than 5% of
the time. Uniformly distributed digits would each occur about 11.1% of the
time.[2] Benford's law also makes predictions about the distribution of second
digits, third digits, digit combinations, and so on.

Obeys

Linux kernel source code

Kernel 6.15.2:

% find . -type f -exec stat -c '%s' {} \; | cut -b1 | sort | uniq -c
     29 0
  26803 1
  15715 2
  11170 3
   8548 4
   7021 5
   5860 6
   5046 7
   4598 8
   3974 9

In other base:

#!/usr/bin/env python3

# this is first_digit.py

import sys

# convert to hex or oct
# print first character

for tmp in sys.stdin:
    tmp=tmp.rstrip()
    print (hex(int(tmp)).removeprefix("0x")[0]) # hexadecimal
    #print (oct(int(tmp)).removeprefix("0o")[0]) # octal
% find linux-6.15.2 -type f -exec stat -c '%s' {} \; | ./first_digit.py | sort | uniq -c
     29 0
  22342 1
  12646 2
   9141 3
   7134 4
   6202 5
   4956 6
   4285 7
   3812 8
   3340 9
   3004 a
   2727 b
   2595 c
   2335 d
   2157 e
   2059 f

English Wikipedia dump files

enwiki-20250601-pages-meta-current*.bz2 files.

% bzcat enwiki*.bz2 | grep "<text bytes=" | cut -d '"' -f2
  222461 0
16515615 1
 6113405 2
 4617137 3
 3366702 4
 3281269 5
 2813740 6
 2245301 7
 1895013 8
 1365120 9

Random *.fb2 files from my library

6300 files.

 % cat fb2_sizes | cut -b1 | sort | uniq -c
   1861 1
   1013 2
    775 3
    618 4
    536 5
    403 6
    405 7
    374 8
    315 9

Random .pdf files from my library

30194 files.

 % cat pdf_sizes | cut -b1 | sort | uniq -c
      3 0
   9649 1
   4930 2
   3387 3
   2753 4
   2355 5
   2027 6
   1800 7
   1712 8
   1578 9

Somewhat obeys

Gutenberg library .txt files

... from:

% curl -I https://www.gutenberg.org/cache/epub/feeds/txt-files.tar.zip
HTTP/2 200
date: Mon, 11 Aug 2025 16:30:58 GMT
server: Apache
last-modified: Sun, 10 Aug 2025 23:12:33 GMT
accept-ranges: bytes
content-length: 10492780014

72299 .txt files in.

  16607 1
  11584 2
  11294 3
   9348 4
   7657 5
   5305 6
   4226 7
   3370 8
   2908 9

Possibly because each .txt file in this library has a header (disclaimer or whatever), also a footer. See, for example: pg1661.txt.

Disobeys

File archive on my largest file server (or NAS)

29,671,504 files total. Including backups, etc.

1296148 0
7770107 1
4483082 2
3303982 3
4755441 4
1986657 5
1737312 6
1644824 7
1357922 8
1335987 9

HTMLs from mail.haskell.org

      1 0
  29903 1
  80357 2
 150948 3
 120394 4
  49564 5
  26394 6
  16648 7
  12103 8
   7856 9

Disobeys because almost all HTML files have an almost constant header/footer, e.g. this.

Let's try to make histogram, using this:

#!/usr/bin/env python3

# this is the histo.py file.

import sys, math
from collections import defaultdict

data=[]

for x in sys.stdin:
    try:
        x=int(x.rstrip())
        data.append(x)
    except ValueError:
        pass # skip

bin_size=int(sys.argv[1])
threshold=int(sys.argv[2])

print (f"{bin_size=}")
bins=defaultdict(int)

for x in data:
    bins[x//bin_size]+=1

for bin in sorted(bins.keys()):
    if bins[bin]>threshold:
        print (bin*bin_size, (bin*bin_size)+(bin_size-1), bins[bin])
 % cat mail.haskell.org\ html_sizes | ./histo.py 1000 100
bin_size=1000
1000 1999 5159
2000 2999 74668
3000 3999 148405
4000 4999 119005
5000 5999 48188
6000 6999 25722
7000 7999 15978
8000 8999 11711
9000 9999 7533
10000 10999 5689
11000 11999 4240
12000 12999 2935
...

Majority of HTML files have size between ~3000 and ~5000 bytes.

Random .jpg files from my art collection

19753 files.

Disobeys:

 % cat art_jpg | cut -b1 | sort | uniq -c
   5435 1
   2629 2
   3026 3
   2055 4
   1436 5
   1264 6
   1322 7
   1362 8
   1224 9

... because majority of files in my collection leans to ~300-500 kilobytes in size.

 % cat art_jpg | ./histo.py 100000 100
bin_size=100000
0 99999 265
100000 199999 914
200000 299999 1879
300000 399999 2517
400000 499999 1673
500000 599999 1154
600000 699999 1031
700000 799999 1082
800000 899999 1195
900000 999999 1083
1000000 1099999 928
1100000 1199999 839
1200000 1299999 627
1300000 1399999 461
1400000 1499999 382
...

My emails in Maildir form

368014 files, collected/archived in 25 years, including mailing lists and github notifications.

     24 0
  78898 1
  29551 2
  43917 3
  82128 4
  70098 5
  27512 6
  14795 7
  11186 8
   9905 9

Disobeys, because each email message has a header.

 % cat Maildir_main_sizes | ./histo.py 1000 100
bin_size=1000
0 999 8831
1000 1999 54276
2000 2999 18162
3000 3999 36359
4000 4999 79521
5000 5999 68033
6000 6999 23912
7000 7999 12164
8000 8999 8866
9000 9999 6608
10000 10999 4595
11000 11999 3413
12000 12999 2849
...

You see, majority of emails, again, between ~4000 and ~6000 bytes.

Random web pages from Common Crawl archive

(WARC files)

From CC-MAIN-20250327054221-20250327084221-00381.warc.gz.

     66 0
  19791 1
  37864 2
  25713 3
   6191 4
   4369 5
   3697 6
   3226 7
   2880 8
   2767 9

Disobeys, because majority of pages lean to ~20-40 kilobytes in size:

 % ... | ./histo.py 10000 100
bin_size=10000
0 9999 56223
10000 19999 3031
20000 29999 3396
30000 39999 3263
40000 49999 3155
50000 59999 3026
60000 69999 2762
70000 79999 2521
80000 89999 2225
90000 99999 1990
100000 109999 1844
110000 119999 1618
120000 129999 1503
130000 139999 1396
140000 149999 1219
150000 159999 1163
...

Random web pages from Common Crawl archive, in text form

(WET files)

CC-MAIN-20250720175629-20250720205629-00999.warc.wet.gz file.

   6276 1
   3749 2
   2843 3
   2359 4
   2027 5
   1676 6
   1443 7
   2199 8
   1011 9

Majority of web pages in text form are in ~1-2 kilobyte in size:

 % ... | ./histo.py 500 100
bin_size=500
0 499 2191
500 999 1209
1000 1499 1403
1500 1999 1463
2000 2499 1404
2500 2999 1315
3000 3499 1242
3500 3999 1075
4000 4499 1011
4500 4999 955
5000 5499 872
5500 5999 825
6000 6499 704
...

*.mp3 files from my music collection

125644 files.

  59715 1
  10187 2
   6001 3
   5828 4
   6281 5
   7325 6
   8850 7
  10009 8
  11447 9

Disobeys, because:

 % cat mp3_sizes | ./histo.py 1000000 1000
bin_size=1000000
1000000 1999999 2678
2000000 2999999 3934
3000000 3999999 4489
4000000 4999999 5250
5000000 5999999 5998
6000000 6999999 7085
7000000 7999999 8683
8000000 8999999 9817
9000000 9999999 11248
10000000 10999999 12046
11000000 11999999 10607
12000000 12999999 8792
13000000 13999999 6983
14000000 14999999 5321
15000000 15999999 3990
...

*.flac files from my music collection

15160 files.

   2883 1
   4484 2
   3713 3
   1586 4
    786 5
    559 6
    391 7
    442 8
    316 9

Disobeys, because:

 % cat flac_sizes | ./histo.py 10000000 1
bin_size=10000000
0 9999999 799
10000000 19999999 2207
20000000 29999999 4107
30000000 39999999 3415
40000000 49999999 1414
50000000 59999999 646
60000000 69999999 450
70000000 79999999 286
80000000 89999999 251
90000000 99999999 171
100000000 109999999 139
110000000 119999999 119
120000000 129999999 89
130000000 139999999 55
140000000 149999999 59
...

PRNG

#!/usr/bin/env python3

import random

for _ in range(100000):
    x=random.randrange(0, 100000000)
    print (str(x)[0])
  11199 1
  10945 2
  11077 3
  11174 4
  11048 5
  11251 6
  10948 7
  11191 8
  11167 9

(the post first published at 20250908.)


List of my other blog posts.

Subscribe to my news feed,

Some time ago (before 24-Mar-2025) there was Disqus JS script for comments. I dropped it --- it was so motley, distracting, animated, with too much ads. I never liked it. Also, comments didn't appeared correctly (Disqus was buggy). Also, my blog is too chamberlike --- not many people write comments here. So I decided to switch to the model I once had at least in 2020 --- send me your comments by email (don't forget to include URL to this blog post) and I'll copy&paste it here manually.

Let's party like it's ~1993-1996, in this ultimate, radical and uncompromisingly primitive pre-web1.0-style blog and website.