[Math] Weed out SEO domains using entropy metric

SEO farms usually have random-looking domain names. (Randomly generated indeed.)

This is how they can be identified -- domain name should have high entropy. As opposing to 'legal' domain names written in some natural language, which usually have lower entropy.

Entropy level of randomly generated string out of a-z characters:

% cat /dev/urandom | tr -dc 'a-z' | fold -w 40 | head -n 1
rzatstkzmnbpsnqiwhruzkbvegljbrcnenxgineq

% cat /dev/urandom | tr -dc 'a-z' | fold -w 40 | head -n 1 | ent
Entropy = 4.192322 bits per byte.

And some weird words in English (or Latin):

% echo pneumonoultramicroscopicsilicovolcanoconiosis | ent
Entropy = 3.560437 bits per byte.

% echo supercalifragilisticexpialidocious | ent
Entropy = 3.738682 bits per byte.

And place names in unknown (to me) languages, but written in Latin letters:

% echo Taumata­whakatangihanga­koauau­o­tamatea­turi­pukaka­piki­maunga­horo­nuku­pokai­whenua­ki­tana­tahu | ent
Entropy = 3.696873 bits per byte.

% echo Llanfair­pwllgwyngyll­gogery­chwyrn­drobwll­llan­tysilio­gogo­goch | ent
Entropy = 4.022153 bits per byte.

~4 bits is a high level, but still slightly lower than of a random string (~4.2 bits).

Any natural language, no matter how exotic, expose some regularities, which are (almost) absent in random strings. These regularities lower the final entropy level.


On contrary --- entropy metric is a good measure of password's strength (along with password's length).

(the post first published at 20241029, updated 20250329.)


List of my other blog posts.

Subscribe to my news feed,

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.