Regexp pre-processing

Regexp is cool, but sometimes, you may want to help it a bit. Say, add 'pre-process' phase.

Here is a real Golang code I just wrote. It finds a valid email in input string, using regular expression. Like anyone would do.

The task is to collect emails from a (very) big text file(s). (Not that I'm very proud of what I do.)

But when I first check for "@" and "." character(s) (at least one must present) in input string, that gives some speed-up:

        if (strings.Contains(text, "@")) {
            if (strings.Contains(text, ".")) {
                isEmailValid(text)
            }
        }

(My regexp is: `[a-z0-9][a-z0-9._\-]{1,25}[a-z0-9]@[a-z0-9][a-z0-9.\-]+[a-z0-9]\.[a-z]{2,25}`. And of course I compile it once, at start.)

For example, I tried to search for emails in a bunch of random RFC text files (~9300 files, ~513 MB). AMD Ryzen 5 3600, one thread. Non-optimized regexp matcher (email_extract_test_v1.go) -- ~43 seconds, my fancy version with two additional strings.Contains() calls (email_extract_test_v2.go) -- only ~2.3 seconds!

Summarizing, you may want first to check if a character(s) or a substring(s) is present in input string, using a standard string function(s) of your programming language, before running regexp matcher with a complex expression. This may be a speed improvement.

Also, your string is already in L(0|1)d? (regexp pun intended) cache after these checks, so a regexp matcher has a fast (enough) access to it.

Also, characters can be counted. For example, "@" must occur only once in email. Counting it using your PL's library function may improve speed as well. (But beware - only one email from a text string would be extracted instead of multiple.)

        if (strings.Count(text, "@")==1) {
            if (strings.Contains(text, ".")) {
                isEmailValid(text)
            }
        }

But in my case (email_extract_test_v3.go), that was again ~2.3 seconds, almost no improvement. (But an output result is slightly shorter due to a bug described above.)

Also, please note, all measurements should be done after the cache warm-up. That is, run your code several times before measurement(s), so that your filesystem's driver will populate its cache with your test files (at least partially). Of course, the computer I used for these experiments has more RAM than ~513 MB of test text files.

And of course, other regexp libraries (in your favorite PL(s)) may exhibit very different results from mine. YMMV, as they say.

All the files: 1, 2, 3.

The test RFC files I used.

(Updated 20240707 17:44:54 CEST.)

This is also may be faster, if you want to find a "CAD" substring that is not a part of other word:

        if strings.Contains(str, "CAD") {
            b1, _ = regexp.MatchString("\\bCAD\\b", str)
        }

(the post first published at 20240627, updated 20240707.)

List of my other blog posts.

Subscribe to my news feed,

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.