Fun with regular expressions: part II

(The following text has been copypasted to the SAT/SMT by example book.)

(Part I.)

"(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)"

In NFA, we would just create 12 paths. Alas, this is DFA, so these paths are overlapped at some vertices:

"(1|2|3|4|5|6|7|8|9|10|11|12)"

'1' is also accepting state, but can also lead to '10', '11' and '12':

"(01|02|03|04|05|06|07|08|09|1|2|3|4|5|6|7|8|9|10|11|12)"

"(01|02|03|04|05|06|07|08|09|10|11|12):[0-5][0-9]:[0-5][0-9] (A|P)M"

(Both hours and minutes are always has two digits.)

"[0123][0-9]-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-(19|20)[0-9]{2}"

For the sake of simplicity, only years 19xx and 20xx are accepted:

"[0123][0-9]-?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-?(19|20)[0-9]{2}"

But what if the dash is optional?

"([0123][0-9]-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-(19|20)[0-9]{2}|(19|20)[0-9]{2}-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-[0123][0-9])"

Can accept dates in DD-MM-YYYY or YYYY-MM-DD format, dash must be present:

"([0123][0-9]-?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-?(19|20)[0-9]{2}|(19|20)[0-9]{2}-?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-?[0123][0-9])"

Can accept dates in DD-MM-YYYY or YYYY-MM-DD format, a dash is optional. Now the DFA is too large: date3.png


Important note: using the standard RE, it's impossible to match only DD-MM-YYYY or DDMMYYYY strings without matching DD-MMYYYY or DDMM-YYYY, because it's impossible to represent this in DFA form.

But you can fix this with non-standard backreferences. However, this wouldn't translate to DFA.

Read more about backreferences in my book: Ctrl-F "backreferences".


Proceed to part III.


As my reader comments...

Date: Thu, 19 Aug 2021 12:46:29 -0500
From: Tim Chase < yurichev.com (a) tim.thechases.com >
To: blog (a) yurichev.com
Subject: Regex accepting DD-MM-YYYY or DDMMYYYY without DD-MMYYYY or DDMM-YYYY

Reading

https://yurichev.com/news/20210819_RE2/

you wrote

> Important note: using the standard RE, it's impossible to match
> only DD-MM-YYYY or DDMMYYYY strings without matching DD-MMYYYY or
> DDMM-YYYY, because it's impossible to represent this in DFA form.

It seems like this should be possible with

[0123][0-9](-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-(19|20)[0-9]{2}|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)(19|20)[0-9]{2})

It's not pretty on account of having all that duplication, but the
result should accept DD-MM-YYYY or DDMMYYYY without accepting
DD-MMYYYY or DDMM-YYYY

(now whether one could *also* accept YYYY-MM-DD/YYYYMMDD format with
the same regexps, it might require some sort of deeper magic and
back-tracking)

Anyway, thanks for the fun posts!

-Tim


List of my other blog posts.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.