regex - python split a unicode string by 3-bytes utf8 character -

suppose have unicode string in python,

s = u"abc你好def啊"

now want split no-ascii characters, result result = ["abc", "你好", "def", "啊"]

so, how implement that?

with regex split between "has or has not" a-z chars.

>>> import re >>> re.findall('([a-za-z0-9]+|[^a-za-z0-9]+)', u"abc你好def啊") ["abc", "你好", "def", "啊"]

or, asciis

>>> ascii = ''.join(chr(x) x in range(33, 127)) >>> re.findall('([{}]+|[^{}]+)'.format(ascii, ascii), u"abc你好def啊") ['abc', '你好', 'def', '啊']

or, simpler suggested @dolda2000

>>> re.findall('([ -~]+|[^ -~]+)', u"abc你好def啊") ['abc', '你好', 'def', '啊']

Today's Best Video