Let's find all href HTML tags on the wikipedia's front page:
#!/usr/bin/env python3 import re, sys f=open(sys.argv[1]) html=f.read() f.close() x=re.findall(r"href=\"http.*\"", html) for y in x: print (y)
Ouch:
% wget https://en.wikipedia.org/wiki/Main_Page % ./1.py Main_Page ... href="https://species.wikimedia.org/wiki/" class="extiw" title="species:" href="https://en.wikiversity.org/wiki/" title="Wikiversity"><img alt="Wikiversity logo" src="//upload.wikimedia. org/wikipedia/commons/thumb/0/0b/Wikiversity_logo_2017.svg/41px-Wikiversity_logo_2017.svg.png" decoding="async" width="41" height="34" class="mw-file-element" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Wikiv ersity_logo_2017.svg/62px-Wikiversity_logo_2017.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/0 b/Wikiversity_logo_2017.svg/82px-Wikiversity_logo_2017.svg.png 2x" data-file-width="626" data-file-height="512" href="https://en.wikiversity.org/wiki/" class="extiw" title="v:" ...
It grabs also other data in double quotes, as well as other URLs. This is a greedy regexp -- it grabs as much as possible, inside of double quotes.
One possible fix is to disable (inner/nested) double quotes inside of (outer) double quotes. (Reads: any character inside of double quotes except double quote.)
... x=re.findall(r"href=\"http[^\"]+\"", html) ...
But another fix is to switch to non-greedy regexp (it stops after second double quote):
... x=re.findall(r"href=\"http.*?\"", html) ...
Both these solutions works, but first is probably better.
Greedy regexp may be used to grab outer double quotes from the text like this (nested (double) quotes):
“[Mr. Lawson] called out the name [Gogol] in a perfectly reasonable way, without pause, without doubt, without a suppressed smile, just as he had called out Brian and Erica and Tom. And then: ‘Well, we’re going to have to read “The Overcoat.” Either that or “The Nose”’” (Lahiri 89). ... Fred said, "Hey!" Ed said, "Fred said, 'Hey!' " Mike said, "Ed said, 'Fred said, "Hey!" ' "
( src )
Also, outer brackets in expression with nested brackets, like: "(x+(y+z))".
Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.