Regexp is cool, but sometimes, you may want to help it a bit. Say, add 'pre-process' phase.
Here is a real Golang code I just wrote. It finds a valid email in input string, using regular expression. Like anyone would do.
The task is to collect emails from a (very) big text file(s). (Not that I'm very proud of what I do.)
But when I first check for "@" and "." character(s) (at least one must present) in input string, that gives some speed-up:
if (strings.Contains(text, "@")) { if (strings.Contains(text, ".")) { isEmailValid(text) } }
(My regexp is: `[a-z0-9][a-z0-9._\-]{1,25}[a-z0-9]@[a-z0-9][a-z0-9.\-]+[a-z0-9]\.[a-z]{2,25}`. And of course I compile it once, at start.)
For example, I tried to search for emails in a bunch of random RFC text files (~9300 files, ~513 MB). AMD Ryzen 5 3600, one thread. Non-optimized regexp matcher (email_extract_test_v1.go) -- ~43 seconds, my fancy version with two additional strings.Contains() calls (email_extract_test_v2.go) -- only ~2.3 seconds!
Summarizing, you may want first to check if a character(s) or a substring(s) is present in input string, using a standard string function(s) of your programming language, before running regexp matcher with a complex expression. This may be a speed improvement.
Also, your string is already in L(0|1)d? (regexp pun intended) cache after these checks, so a regexp matcher has a fast (enough) access to it.
Also, characters can be counted. For example, "@" must occur only once in email. Counting it using your PL's library function may improve speed as well. (But beware - only one email from a text string would be extracted instead of multiple.)
if (strings.Count(text, "@")==1) { if (strings.Contains(text, ".")) { isEmailValid(text) } }
But in my case (email_extract_test_v3.go), that was again ~2.3 seconds, almost no improvement. (But an output result is slightly shorter due to a bug described above.)
Also, please note, all measurements should be done after the cache warm-up. That is, run your code several times before measurement(s), so that your filesystem's driver will populate its cache with your test files (at least partially). Of course, the computer I used for these experiments has more RAM than ~513 MB of test text files.
And of course, other regexp libraries (in your favorite PL(s)) may exhibit very different results from mine. YMMV, as they say.
(Updated 20240707 17:44:54 CEST.)
This is also may be faster, if you want to find a "CAD" substring that is not a part of other word:
if strings.Contains(str, "CAD") { b1, _ = regexp.MatchString("\\bCAD\\b", str) }
Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.