regex - python split a unicode string by 3-bytes utf8 character -
suppose have unicode string in python,
s = u"abc你好def啊"
now want split no-ascii characters, result result = ["abc", "你好", "def", "啊"]
so, how implement that?
with regex split between "has or has not" a-z chars.
>>> import re >>> re.findall('([a-za-z0-9]+|[^a-za-z0-9]+)', u"abc你好def啊") ["abc", "你好", "def", "啊"]
or, asciis
>>> ascii = ''.join(chr(x) x in range(33, 127)) >>> re.findall('([{}]+|[^{}]+)'.format(ascii, ascii), u"abc你好def啊") ['abc', '你好', 'def', '啊']
or, simpler suggested @dolda2000
>>> re.findall('([ -~]+|[^ -~]+)', u"abc你好def啊") ['abc', '你好', 'def', '啊']
Comments
Post a Comment