Friday, February 13, 2015

Splitting a string on multiple different delimiters

By Vasudev Ram

Just recently I was working on some ideas related to my text file indexing program - which I had blogged about earlier, here:

A simple text file indexing program in Python

As part of that work, I was using Python's string split() method for something. Found that it had a limitation that the separator can be only one string (though it can comprise more than one character).
Trying to work out a solution for that (i.e. the ability to split a string on any one of a set of separator / delimiter characters), I gave these commands interactively in the Python interpreter:

>>> print "".split.__doc__
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.

>>> s = "abc.def;ghi"
>>> s.split(".;")
['abc.def;ghi']
>>> s.split(".")
['abc', 'def;ghi']
>>> '---'.join(s.split("."))
'abc---def;ghi'
>>> '---'.join(s.split(".")).split(";")
['abc---def', 'ghi']
>>> "---".join('---'.join(s.split(".")).split(";"))
'abc---def---ghi'
>>> "---".join('---'.join(s.split(".")).split(";")).split('---')
['abc', 'def', 'ghi']
>>>

So you can see that by doing repeated manual split()'s and join()'s, I was able to split the original string the way I wanted, i.e. on both the period and semicolon as delimiters. I'll work out a function or class to do it and then blog it in a sequel to this post.

(Using regular expressions to match the delimiters, and extracting all but the matched parts, may be one way to do it, but I'll try another approach. There probably are many ways to go about it).

- Vasudev Ram - Dancing Bison Enterprises Signup to hear about new products or services from me. Contact Page


2 comments:

nharding said...

There is an easy way with regular expressions.

s = "abc.def;ghi"
parts = re.findall("[^.;]+", s)

In the regular expression, start the [] with ^ (to say NOT these characters, then put the delimiters, only works with single characters)

Vasudev Ram said...


Nice solution - thanks.
Yes, it does have the limitation of single characters.