[Math] Weed out SEO domains using entropy metric

SEO farms usually have random-looking domain names. (Randomly generated indeed.)

This is how they can be identified -- domain name should have high entropy. As opposing to 'legal' domain names written in some natural language, which usually have lower entropy.

Entropy level of randomly generated string out of a-z characters:

% cat /dev/urandom | tr -dc 'a-z' | fold -w 40 | head -n 1
rzatstkzmnbpsnqiwhruzkbvegljbrcnenxgineq

% cat /dev/urandom | tr -dc 'a-z' | fold -w 40 | head -n 1 | ent
Entropy = 4.192322 bits per byte.

And some weird words in English (or Latin):

% echo pneumonoultramicroscopicsilicovolcanoconiosis | ent
Entropy = 3.560437 bits per byte.

% echo supercalifragilisticexpialidocious | ent
Entropy = 3.738682 bits per byte.

And place names in unknown (to me) languages, but written in Latin letters:

% echo Taumata­whakatangihanga­koauau­o­tamatea­turi­pukaka­piki­maunga­horo­nuku­pokai­whenua­ki­tana­tahu | ent
Entropy = 3.696873 bits per byte.

% echo Llanfair­pwllgwyngyll­gogery­chwyrn­drobwll­llan­tysilio­gogo­goch | ent
Entropy = 4.022153 bits per byte.

~4 bits is a high level, but still slightly lower than of a random string (~4.2 bits).

Any natural language, no matter how exotic, expose some regularities, which are (almost) absent in random strings. These regularities lower the final entropy level.

(the post first published at 20241029.)


List of my other blog posts.

Subscribe to my news feed,

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.