python - Regex to extract names from HTML -
i have 2 pieces of code below want extract names.
code:
;"><strong>deanskyshadow</strong> ;"><strong><em>xavier</em></strong> the regex should extract names deanskyshadow , xavier. current regex:
(?<=(;"><strong><em>)|(;"><strong>))[\s\s]+?(?=(</em></strong>)|(</strong>)) grabs names correctly if there no em tag in code; if there grabs opening em tag, this: <em>xavier. how can fix that?
match not < character; cannot use variable-width look-behind version doesn't work @ all. use non-capturing pattern instead
(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>) demo:
>>> import re >>> sample = '''\ ... ;"><strong>deanskyshadow</strong> ... ;"><strong><em>xavier</em></strong> ... ''' >>> re.findall(r'(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)', sample) ['deanskyshadow', 'xavier'] the better solution use html parser instead. can recommend beautifulsoup:
from bs4 import beautifulsoup soup = beautifulsoup(htmltext) strong in soup.find_all('strong'): print strong.text
Comments
Post a Comment