regex - python split a unicode string by 3-bytes utf8 character -


suppose have unicode string in python,

s = u"abc你好def啊"

now want split no-ascii characters, result result = ["abc", "你好", "def", "啊"]

so, how implement that?

with regex split between "has or has not" a-z chars.

>>> import re >>> re.findall('([a-za-z0-9]+|[^a-za-z0-9]+)', u"abc你好def啊") ["abc", "你好", "def", "啊"] 

or, asciis

>>> ascii = ''.join(chr(x) x in range(33, 127)) >>> re.findall('([{}]+|[^{}]+)'.format(ascii, ascii), u"abc你好def啊") ['abc', '你好', 'def', '啊'] 

or, simpler suggested @dolda2000

>>> re.findall('([ -~]+|[^ -~]+)', u"abc你好def啊") ['abc', '你好', 'def', '啊'] 

Comments

Popular posts from this blog

ios - RestKit 0.20 — CoreData: error: Failed to call designated initializer on NSManagedObject class (again) -

java - Digest auth with Spring Security using javaconfig -

laravel - PDOException in Connector.php line 55: SQLSTATE[HY000] [1045] Access denied for user 'root'@'localhost' (using password: YES) -