python - Regex to extract names from HTML -
i have 2 pieces of code below want extract names.
code:
;"><strong>deanskyshadow</strong> ;"><strong><em>xavier</em></strong>
the regex should extract names deanskyshadow , xavier. current regex:
(?<=(;"><strong><em>)|(;"><strong>))[\s\s]+?(?=(</em></strong>)|(</strong>))
grabs names correctly if there no em tag in code; if there grabs opening em tag, this: <em>xavier
. how can fix that?
match not <
character; cannot use variable-width look-behind version doesn't work @ all. use non-capturing pattern instead
(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)
demo:
>>> import re >>> sample = '''\ ... ;"><strong>deanskyshadow</strong> ... ;"><strong><em>xavier</em></strong> ... ''' >>> re.findall(r'(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)', sample) ['deanskyshadow', 'xavier']
the better solution use html parser instead. can recommend beautifulsoup:
from bs4 import beautifulsoup soup = beautifulsoup(htmltext) strong in soup.find_all('strong'): print strong.text
Comments
Post a Comment