What's with (non-)greedy regexps?

Let's find all href HTML tags on the wikipedia's front page:

#!/usr/bin/env python3

import re, sys

f=open(sys.argv[1])
html=f.read()
f.close()

x=re.findall(r"href=\"http.*\"", html)
for y in x:
    print (y)

Ouch:

% wget https://en.wikipedia.org/wiki/Main_Page

% ./1.py Main_Page

...
href="https://species.wikimedia.org/wiki/" class="extiw" title="species:"
href="https://en.wikiversity.org/wiki/" title="Wikiversity"><img alt="Wikiversity logo" src="//upload.wikimedia.  org/wikipedia/commons/thumb/0/0b/Wikiversity_logo_2017.svg/41px-Wikiversity_logo_2017.svg.png" decoding="async" width="41" height="34" class="mw-file-element" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Wikiv ersity_logo_2017.svg/62px-Wikiversity_logo_2017.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/0 b/Wikiversity_logo_2017.svg/82px-Wikiversity_logo_2017.svg.png 2x" data-file-width="626" data-file-height="512"
href="https://en.wikiversity.org/wiki/" class="extiw" title="v:"
...

It grabs also other data in double quotes, as well as other URLs. This is a greedy regexp -- it grabs as much as possible, inside of double quotes.

One possible fix is to disable (inner/nested) double quotes inside of (outer) double quotes. (Reads: any character inside of double quotes except double quote.)

...
x=re.findall(r"href=\"http[^\"]+\"", html)
...

But another fix is to switch to non-greedy regexp (it stops after second double quote):

...
x=re.findall(r"href=\"http.*?\"", html)
...

Both these solutions works, but first is probably better.

Greedy regexp may be used to grab outer double quotes from the text like this (nested (double) quotes):

“[Mr. Lawson] called out the name [Gogol] in a perfectly reasonable way, without pause, without doubt, without a suppressed smile, just as he had called out Brian and Erica and Tom. And then: ‘Well, we’re going to have to read “The Overcoat.” Either that or “The Nose”’” (Lahiri 89).

...

    Fred said, "Hey!"
    Ed said, "Fred said, 'Hey!' "
    Mike said, "Ed said, 'Fred said, "Hey!" ' "

( src )

Also, outer brackets in expression with nested brackets, like: "(x+(y+z))".

(the post first published at 20241210.)


List of my other blog posts.

Subscribe to my news feed,

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.