Just hash input string using CRC32. And add to a table. After how many random input strings duplicates will appear?
Take some dictionary of English words...
#!/usr/bin/env python3 import sys, zlib tbl=set() for x in sys.stdin: x=x.rstrip() hash=zlib.crc32(x.encode('utf-8')) if hash in tbl: print (f"Hash in table already. Its size is {len(tbl)} now.") exit(0) tbl.add(hash)
% ./1.py < words.txt Hash in table already. Its size is 39542 now. % ./1.py < words_alpha.txt Hash in table already. Its size is 61687 now. % ./1.py < words_dictionary.json Hash in table already. Its size is 23820 now.
Unexpectedly -- all these numbers are under \( (2^{15}...2^{16}) \).
In real life -- this happens, when storing string to RDBMS, CRC32 of it is also stored. Because finding (long) string by hash is faster. We would not argue if it's good approach or bad, but now you know a table size, which will not cause problems -- less than \( \approx 2^{16} \) elements.
Collisions may appear after \( \approx 2^{64} \) elements. It's hard to imagine such huge table, file or RDBMS. But in our time of huge data centers and distributed file systems... who knows...
A gadget repairment shop. You take your gadget to them, they write down your phone number. You came second time, they ask you last 2 digits of your phone number. And then they going to search a box with your gadget, with a paper note glueled/stapled, with your number.
How many (new) customers should came to shop so that collisions will appear?
#!/usr/bin/env python3 import random def gen_N_two_digit_numbers(total): return [random.randint(0, 99) for _ in range(total)] def chk_for_dups(l): return len(set(x))!=len(x) sample_size=10**5 print (f"{sample_size=}") for sample_len in range(2, 40): collisions=0 for _ in range(sample_size): x=gen_N_two_digit_numbers(sample_len) if chk_for_dups(x): collisions+=1 print (f"{sample_len=} {collisions=}, ratio={collisions/sample_size}")
sample_size=100000 sample_len=2 collisions=1033, ratio=0.01033 sample_len=3 collisions=2943, ratio=0.02943 sample_len=4 collisions=5919, ratio=0.05919 sample_len=5 collisions=9635, ratio=0.09635 sample_len=6 collisions=14142, ratio=0.14142 sample_len=7 collisions=19331, ratio=0.19331 sample_len=8 collisions=24914, ratio=0.24914 sample_len=9 collisions=31072, ratio=0.31072 sample_len=10 collisions=37473, ratio=0.37473 sample_len=11 collisions=43605, ratio=0.43605 sample_len=12 collisions=49768, ratio=0.49768 sample_len=13 collisions=55806, ratio=0.55806 sample_len=14 collisions=61309, ratio=0.61309 sample_len=15 collisions=66632, ratio=0.66632 sample_len=16 collisions=71578, ratio=0.71578 sample_len=17 collisions=76480, ratio=0.7648 sample_len=18 collisions=80470, ratio=0.8047 sample_len=19 collisions=84015, ratio=0.84015 sample_len=20 collisions=87007, ratio=0.87007 sample_len=21 collisions=89499, ratio=0.89499 sample_len=22 collisions=91645, ratio=0.91645 sample_len=23 collisions=93616, ratio=0.93616 sample_len=24 collisions=95015, ratio=0.95015 sample_len=25 collisions=96266, ratio=0.96266 sample_len=26 collisions=97155, ratio=0.97155 sample_len=27 collisions=97844, ratio=0.97844 sample_len=28 collisions=98487, ratio=0.98487 sample_len=29 collisions=98951, ratio=0.98951 sample_len=30 collisions=99254, ratio=0.99254 sample_len=31 collisions=99454, ratio=0.99454 sample_len=32 collisions=99636, ratio=0.99636 sample_len=33 collisions=99743, ratio=0.99743 sample_len=34 collisions=99830, ratio=0.9983 sample_len=35 collisions=99897, ratio=0.99897 sample_len=36 collisions=99928, ratio=0.99928 sample_len=37 collisions=99947, ratio=0.99947 sample_len=38 collisions=99971, ratio=0.99971 sample_len=39 collisions=99983, ratio=0.99983
When 20 customers simultaneously came, or more, collisions will be almost always (but not always). And with probability of 50%, if ~12 new customers will came. But inconvenience (due to coincidences) will be already if ~8-9 new customers will came.
This is counterintuitive, you might think that there must be much larger number of customers to see such collisions/coincidences.
Now what if 3 last digits of phone will be asked in the shop?
--- v1.py 2025-08-15 14:59:53.236727522 +0300 +++ v2.py 2025-08-15 15:00:21.379543244 +0300 @@ -2,8 +2,8 @@ import random -def gen_N_two_digit_numbers(total): - return [random.randint(0, 99) for _ in range(total)] +def gen_N_three_digit_numbers(total): + return [random.randint(0, 999) for _ in range(total)] def chk_for_dups(l): return len(set(x))!=len(x) @@ -11,10 +11,10 @@ sample_size=10**5 print (f"{sample_size=}") -for sample_len in range(2, 40): +for sample_len in range(2, 100): collisions=0 for _ in range(sample_size): - x=gen_N_two_digit_numbers(sample_len) + x=gen_N_three_digit_numbers(sample_len) if chk_for_dups(x): collisions+=1 print (f"{sample_len=} {collisions=}, ratio={collisions/sample_size}")
sample_size=100000 sample_len=2 collisions=120, ratio=0.0012 sample_len=3 collisions=291, ratio=0.00291 sample_len=4 collisions=578, ratio=0.00578 sample_len=5 collisions=984, ratio=0.00984 sample_len=6 collisions=1536, ratio=0.01536 sample_len=7 collisions=2134, ratio=0.02134 sample_len=8 collisions=2748, ratio=0.02748 sample_len=9 collisions=3574, ratio=0.03574 sample_len=10 collisions=4423, ratio=0.04423 sample_len=11 collisions=5321, ratio=0.05321 sample_len=12 collisions=6382, ratio=0.06382 sample_len=13 collisions=7682, ratio=0.07682 sample_len=14 collisions=8764, ratio=0.08764 sample_len=15 collisions=10228, ratio=0.10228 sample_len=16 collisions=11390, ratio=0.1139 sample_len=17 collisions=12635, ratio=0.12635 sample_len=18 collisions=14025, ratio=0.14025 sample_len=19 collisions=15804, ratio=0.15804 sample_len=20 collisions=17783, ratio=0.17783 sample_len=21 collisions=18984, ratio=0.18984 sample_len=22 collisions=20776, ratio=0.20776 sample_len=23 collisions=22304, ratio=0.22304 sample_len=24 collisions=24295, ratio=0.24295 sample_len=25 collisions=26180, ratio=0.2618 sample_len=26 collisions=27703, ratio=0.27703 sample_len=27 collisions=29800, ratio=0.298 sample_len=28 collisions=31501, ratio=0.31501 sample_len=29 collisions=33726, ratio=0.33726 sample_len=30 collisions=35119, ratio=0.35119 sample_len=31 collisions=37602, ratio=0.37602 sample_len=32 collisions=39273, ratio=0.39273 sample_len=33 collisions=41581, ratio=0.41581 sample_len=34 collisions=43374, ratio=0.43374 sample_len=35 collisions=45342, ratio=0.45342 sample_len=36 collisions=47010, ratio=0.4701 sample_len=37 collisions=49172, ratio=0.49172 sample_len=38 collisions=50687, ratio=0.50687 sample_len=39 collisions=53015, ratio=0.53015 sample_len=40 collisions=54725, ratio=0.54725 sample_len=41 collisions=56398, ratio=0.56398 sample_len=42 collisions=58490, ratio=0.5849 sample_len=43 collisions=59921, ratio=0.59921 sample_len=44 collisions=61725, ratio=0.61725 sample_len=45 collisions=63693, ratio=0.63693 sample_len=46 collisions=65070, ratio=0.6507 sample_len=47 collisions=66765, ratio=0.66765 sample_len=48 collisions=68401, ratio=0.68401 sample_len=49 collisions=69660, ratio=0.6966 sample_len=50 collisions=71250, ratio=0.7125 sample_len=51 collisions=72805, ratio=0.72805 sample_len=52 collisions=73932, ratio=0.73932 sample_len=53 collisions=75091, ratio=0.75091 sample_len=54 collisions=76739, ratio=0.76739 sample_len=55 collisions=77826, ratio=0.77826 sample_len=56 collisions=78950, ratio=0.7895 sample_len=57 collisions=80198, ratio=0.80198 sample_len=58 collisions=81314, ratio=0.81314 sample_len=59 collisions=82619, ratio=0.82619 sample_len=60 collisions=83484, ratio=0.83484 sample_len=61 collisions=84730, ratio=0.8473 sample_len=62 collisions=85440, ratio=0.8544 sample_len=63 collisions=86388, ratio=0.86388 sample_len=64 collisions=87270, ratio=0.8727 sample_len=65 collisions=88064, ratio=0.88064 sample_len=66 collisions=89028, ratio=0.89028 sample_len=67 collisions=89449, ratio=0.89449 sample_len=68 collisions=90280, ratio=0.9028 sample_len=69 collisions=90914, ratio=0.90914 sample_len=70 collisions=91661, ratio=0.91661 sample_len=71 collisions=92265, ratio=0.92265 sample_len=72 collisions=92635, ratio=0.92635 sample_len=73 collisions=93293, ratio=0.93293 sample_len=74 collisions=93728, ratio=0.93728 sample_len=75 collisions=94151, ratio=0.94151 sample_len=76 collisions=94710, ratio=0.9471 sample_len=77 collisions=95150, ratio=0.9515 sample_len=78 collisions=95426, ratio=0.95426 sample_len=79 collisions=95752, ratio=0.95752 sample_len=80 collisions=96128, ratio=0.96128 sample_len=81 collisions=96404, ratio=0.96404 sample_len=82 collisions=96758, ratio=0.96758 sample_len=83 collisions=96996, ratio=0.96996 sample_len=84 collisions=97203, ratio=0.97203 sample_len=85 collisions=97497, ratio=0.97497 sample_len=86 collisions=97741, ratio=0.97741 sample_len=87 collisions=97875, ratio=0.97875 sample_len=88 collisions=98082, ratio=0.98082 sample_len=89 collisions=98306, ratio=0.98306 sample_len=90 collisions=98387, ratio=0.98387 sample_len=91 collisions=98469, ratio=0.98469 sample_len=92 collisions=98686, ratio=0.98686 sample_len=93 collisions=98775, ratio=0.98775 sample_len=94 collisions=98888, ratio=0.98888 sample_len=95 collisions=98978, ratio=0.98978 sample_len=96 collisions=99127, ratio=0.99127 sample_len=97 collisions=99154, ratio=0.99154 sample_len=98 collisions=99269, ratio=0.99269 sample_len=99 collisions=99325, ratio=0.99325
Collision probability of 50% if 37-38 customers already. Collision probability of 99% if more than 96 customers already.
Some time ago (before 24-Mar-2025) there was Disqus JS script for comments. I dropped it --- it was so motley, distracting, animated, with too much ads. I never liked it. Also, comments didn't appeared correctly (Disqus was buggy). Also, my blog is too chamberlike --- not many people write comments here. So I decided to switch to the model I once had at least in 2020 --- send me your comments by email (don't forget to include URL to this blog post) and I'll copy&paste it here manually.
Let's party like it's ~1993-1996, in this ultimate, radical and uncompromisingly primitive pre-web1.0-style blog and website.