From Wikipedia:
Benford's law, also known as the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small.[1] In sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. Uniformly distributed digits would each occur about 11.1% of the time.[2] Benford's law also makes predictions about the distribution of second digits, third digits, digit combinations, and so on.
Kernel 6.15.2:
% find . -type f -exec stat -c '%s' {} \; | cut -b1 | sort | uniq -c 29 0 26803 1 15715 2 11170 3 8548 4 7021 5 5860 6 5046 7 4598 8 3974 9
In other base:
#!/usr/bin/env python3 # this is first_digit.py import sys # convert to hex or oct # print first character for tmp in sys.stdin: tmp=tmp.rstrip() print (hex(int(tmp)).removeprefix("0x")[0]) # hexadecimal #print (oct(int(tmp)).removeprefix("0o")[0]) # octal
% find linux-6.15.2 -type f -exec stat -c '%s' {} \; | ./first_digit.py | sort | uniq -c 29 0 22342 1 12646 2 9141 3 7134 4 6202 5 4956 6 4285 7 3812 8 3340 9 3004 a 2727 b 2595 c 2335 d 2157 e 2059 f
enwiki-20250601-pages-meta-current*.bz2 files.
% bzcat enwiki*.bz2 | grep "<text bytes=" | cut -d '"' -f2
222461 0 16515615 1 6113405 2 4617137 3 3366702 4 3281269 5 2813740 6 2245301 7 1895013 8 1365120 9
6300 files.
% cat fb2_sizes | cut -b1 | sort | uniq -c 1861 1 1013 2 775 3 618 4 536 5 403 6 405 7 374 8 315 9
30194 files.
% cat pdf_sizes | cut -b1 | sort | uniq -c 3 0 9649 1 4930 2 3387 3 2753 4 2355 5 2027 6 1800 7 1712 8 1578 9
... from:
% curl -I https://www.gutenberg.org/cache/epub/feeds/txt-files.tar.zip HTTP/2 200 date: Mon, 11 Aug 2025 16:30:58 GMT server: Apache last-modified: Sun, 10 Aug 2025 23:12:33 GMT accept-ranges: bytes content-length: 10492780014
72299 .txt files in.
16607 1 11584 2 11294 3 9348 4 7657 5 5305 6 4226 7 3370 8 2908 9
Possibly because each .txt file in this library has a header (disclaimer or whatever), also a footer. See, for example: pg1661.txt.
29,671,504 files total. Including backups, etc.
1296148 0 7770107 1 4483082 2 3303982 3 4755441 4 1986657 5 1737312 6 1644824 7 1357922 8 1335987 9
1 0 29903 1 80357 2 150948 3 120394 4 49564 5 26394 6 16648 7 12103 8 7856 9
Disobeys because almost all HTML files have an almost constant header/footer, e.g. this.
Let's try to make histogram, using this:
#!/usr/bin/env python3 # this is the histo.py file. import sys, math from collections import defaultdict data=[] for x in sys.stdin: try: x=int(x.rstrip()) data.append(x) except ValueError: pass # skip bin_size=int(sys.argv[1]) threshold=int(sys.argv[2]) print (f"{bin_size=}") bins=defaultdict(int) for x in data: bins[x//bin_size]+=1 for bin in sorted(bins.keys()): if bins[bin]>threshold: print (bin*bin_size, (bin*bin_size)+(bin_size-1), bins[bin])
% cat mail.haskell.org\ html_sizes | ./histo.py 1000 100 bin_size=1000 1000 1999 5159 2000 2999 74668 3000 3999 148405 4000 4999 119005 5000 5999 48188 6000 6999 25722 7000 7999 15978 8000 8999 11711 9000 9999 7533 10000 10999 5689 11000 11999 4240 12000 12999 2935 ...
Majority of HTML files have size between ~3000 and ~5000 bytes.
19753 files.
Disobeys:
% cat art_jpg | cut -b1 | sort | uniq -c 5435 1 2629 2 3026 3 2055 4 1436 5 1264 6 1322 7 1362 8 1224 9
... because majority of files in my collection leans to ~300-500 kilobytes in size.
% cat art_jpg | ./histo.py 100000 100 bin_size=100000 0 99999 265 100000 199999 914 200000 299999 1879 300000 399999 2517 400000 499999 1673 500000 599999 1154 600000 699999 1031 700000 799999 1082 800000 899999 1195 900000 999999 1083 1000000 1099999 928 1100000 1199999 839 1200000 1299999 627 1300000 1399999 461 1400000 1499999 382 ...
368014 files, collected/archived in 25 years, including mailing lists and github notifications.
24 0 78898 1 29551 2 43917 3 82128 4 70098 5 27512 6 14795 7 11186 8 9905 9
Disobeys, because each email message has a header.
% cat Maildir_main_sizes | ./histo.py 1000 100 bin_size=1000 0 999 8831 1000 1999 54276 2000 2999 18162 3000 3999 36359 4000 4999 79521 5000 5999 68033 6000 6999 23912 7000 7999 12164 8000 8999 8866 9000 9999 6608 10000 10999 4595 11000 11999 3413 12000 12999 2849 ...
You see, majority of emails, again, between ~4000 and ~6000 bytes.
(WARC files)
From CC-MAIN-20250327054221-20250327084221-00381.warc.gz.
66 0 19791 1 37864 2 25713 3 6191 4 4369 5 3697 6 3226 7 2880 8 2767 9
Disobeys, because majority of pages lean to ~20-40 kilobytes in size:
% ... | ./histo.py 10000 100 bin_size=10000 0 9999 56223 10000 19999 3031 20000 29999 3396 30000 39999 3263 40000 49999 3155 50000 59999 3026 60000 69999 2762 70000 79999 2521 80000 89999 2225 90000 99999 1990 100000 109999 1844 110000 119999 1618 120000 129999 1503 130000 139999 1396 140000 149999 1219 150000 159999 1163 ...
(WET files)
CC-MAIN-20250720175629-20250720205629-00999.warc.wet.gz file.
6276 1 3749 2 2843 3 2359 4 2027 5 1676 6 1443 7 2199 8 1011 9
Majority of web pages in text form are in ~1-2 kilobyte in size:
% ... | ./histo.py 500 100 bin_size=500 0 499 2191 500 999 1209 1000 1499 1403 1500 1999 1463 2000 2499 1404 2500 2999 1315 3000 3499 1242 3500 3999 1075 4000 4499 1011 4500 4999 955 5000 5499 872 5500 5999 825 6000 6499 704 ...
125644 files.
59715 1 10187 2 6001 3 5828 4 6281 5 7325 6 8850 7 10009 8 11447 9
Disobeys, because:
% cat mp3_sizes | ./histo.py 1000000 1000 bin_size=1000000 1000000 1999999 2678 2000000 2999999 3934 3000000 3999999 4489 4000000 4999999 5250 5000000 5999999 5998 6000000 6999999 7085 7000000 7999999 8683 8000000 8999999 9817 9000000 9999999 11248 10000000 10999999 12046 11000000 11999999 10607 12000000 12999999 8792 13000000 13999999 6983 14000000 14999999 5321 15000000 15999999 3990 ...
15160 files.
2883 1 4484 2 3713 3 1586 4 786 5 559 6 391 7 442 8 316 9
Disobeys, because:
% cat flac_sizes | ./histo.py 10000000 1 bin_size=10000000 0 9999999 799 10000000 19999999 2207 20000000 29999999 4107 30000000 39999999 3415 40000000 49999999 1414 50000000 59999999 646 60000000 69999999 450 70000000 79999999 286 80000000 89999999 251 90000000 99999999 171 100000000 109999999 139 110000000 119999999 119 120000000 129999999 89 130000000 139999999 55 140000000 149999999 59 ...
#!/usr/bin/env python3 import random for _ in range(100000): x=random.randrange(0, 100000000) print (str(x)[0])
11199 1 10945 2 11077 3 11174 4 11048 5 11251 6 10948 7 11191 8 11167 9
Some time ago (before 24-Mar-2025) there was Disqus JS script for comments. I dropped it --- it was so motley, distracting, animated, with too much ads. I never liked it. Also, comments didn't appeared correctly (Disqus was buggy). Also, my blog is too chamberlike --- not many people write comments here. So I decided to switch to the model I once had at least in 2020 --- send me your comments by email (don't forget to include URL to this blog post) and I'll copy&paste it here manually.
Let's party like it's ~1993-1996, in this ultimate, radical and uncompromisingly primitive pre-web1.0-style blog and website.