python - Regex to extract names from HTML -


i have 2 pieces of code below want extract names.

code:

 ;"><strong>deanskyshadow</strong>  ;"><strong><em>xavier</em></strong> 

the regex should extract names deanskyshadow , xavier. current regex:

(?<=(;"><strong><em>)|(;"><strong>))[\s\s]+?(?=(</em></strong>)|(</strong>)) 

grabs names correctly if there no em tag in code; if there grabs opening em tag, this: <em>xavier. how can fix that?

match not < character; cannot use variable-width look-behind version doesn't work @ all. use non-capturing pattern instead

(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>) 

demo:

>>> import re >>> sample = '''\ ...  ;"><strong>deanskyshadow</strong> ...  ;"><strong><em>xavier</em></strong> ... ''' >>> re.findall(r'(?:;"><strong>(?:<em>)?)([^<]+?)(?=(?:</em>)?</strong>)', sample) ['deanskyshadow', 'xavier'] 

the better solution use html parser instead. can recommend beautifulsoup:

from bs4 import beautifulsoup  soup = beautifulsoup(htmltext)  strong in soup.find_all('strong'):     print strong.text 

Comments

Popular posts from this blog

ios - RestKit 0.20 — CoreData: error: Failed to call designated initializer on NSManagedObject class (again) -

laravel - PDOException in Connector.php line 55: SQLSTATE[HY000] [1045] Access denied for user 'root'@'localhost' (using password: YES) -

java - Digest auth with Spring Security using javaconfig -