Yet another use of regex backreferences

(The following text has been copypasted to the SAT/SMT by example book.)

For my first blog posts dedicated to backreferences, see: SAT solver on top of regex matcher.

This time, I searched for good words that can serve as examples for my blog posts about Knuth-Morris-Pratt algorithm. I wanted a list of words that have repeated prefixes and suffixes.

I took a good collection of English words here.

Then I used sed to find words with repeated prefixes:

sed -E -n '/^(.+)\1(.+)$/p' words_alpha.txt
Some of them:
eel
oops
ooze
cocoa
cocos
kokos
mimic
cocoon

I couldn't manage sed to find repeated suffixes, so I wrote a Racket program to do that (each suffix must have at least two characters):

#lang racket

;(define r (pregexp "^(.+)\\1(.+)$")) ; two prefixes
(define r (pregexp "^(.+)(..+)\\2$")) ; two suffixes >=2

(define (f s)
  (regexp-match r s))

(define result
  (sort
   (filter f (file->lines "words_alpha.txt"))
   (lambda (x y)
     (< (string-length x) (string-length y)))))

(for ([i result])
  (displayln i))

Some of these:

ceded
crisis
rococo
cantata

That sounds as a list of diseases:

hydrofluosilicic
integropallialia
interjaculateded
panmyelophthisis
plasmaphoresisis
pneumonophthisis
antihemagglutinin
ophthalmophthisis
bacterioagglutinin
erythrocytoschisis
phytohemagglutinin
thoracoceloschisis
craniorhachischisis
phytohaemagglutinin
thoracogastroschisis

I couln'd stand the itch and tried to find all words with 3 repeated suffixes:

(define r (pregexp "^(.+)(.+)\\2\\2$"))

That includes both words with 3 repeated characters at the end and the rare term 'ratatat' -- 3 repeated 'at' suffix:

brrr
ieee
mmmm
oooo
viii
xiii
xviii
xxiii
ratatat
earlesss

('earlesss' seems to be a typo in the list of words I used.)


List of my other blog posts.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.