Wednesday, September 12, 2018

How many ways can you substring a string? Part 1


By Vasudev Ram




String image attribution

Recently, something I read made me think of writing a simple program to generate all substrings of a given string.
(To be precise, excluding the null string.)

Here is an initial version I came up with, all_substrings.py:
"""
all_substrings.py
Function and program to find all the substrings of a given string.
Author: Vasudev Ram
Copyright 2018 Vasudev Ram
Web site: https://vasudevram.github.io
Blog: https://jugad2.blogspot.com
Twitter: https://mobile.twitter.com/vasudevram
Product Store: https://gumroad.com/vasudevram
"""

from __future__ import print_function
import sys
from error_exit import error_exit
from debug1 import debug1

def usage():
    message_lines = [\
        "Usage: python {} a_string".format(sa[0]),
        "Print all substrings of a_string.",
    ]
    sys.stderr.write("\n".join(message_lines))

def all_substrings(s):
    """
    Generator function that yields all the substrings of a given string.
    """

    ls = len(s)
    if ls == 0:
        usage()
        error_exit("\nError: String argument must be non-empty.")

    start = 0
    while start < ls:
        end = start + 1
        while end <= ls:
            debug1("s[{}:{}] = {}".format(start, end, s[start:end]))
            yield s[start:end]
            end += 1
        start += 1

def main():
    if lsa != 2:
        usage()
        error_exit("\nError: Exactly one argument must be given.")

    for substring in all_substrings(sa[1]):
        print(substring)

sa = sys.argv
lsa = len(sa)

if __name__ == "__main__":
    main()
Some runs and output of the program:

With no command-line arguments:
$ python all_substrings.py
Usage: python all_substrings.py a_string
Print all substrings of a_string.
Error: Exactly one argument must be given.
With one command-line argument, an empty string:
$ python all_substrings.py ""
Usage: python all_substrings.py a_string
Print all substrings of a_string.
Error: String argument must be non-empty.
Now with a 3-character string, with debugging enabled, via the use of my debug1 debugging function [1] (and Python's __debug__ built-in variable, which is set to True by default):
$ python all_substrings.py abc
s[0:1] = a
a
s[0:2] = ab
ab
s[0:3] = abc
abc
s[1:2] = b
b
s[1:3] = bc
bc
s[2:3] = c
c
[1] You can read about and get the code for that debugging function here:

Improved simple Python debugging function

The remaining runs are with debugging turned off via Python's -O flag:

With a 4-character string:
$ python -O all_substrings.py abcd
a
ab
abc
abcd
b
bc
bcd
c
cd
d
With a 4-character string, not all characters unique:
$ python -O all_substrings.py FEED
F
FE
FEE
FEED
E
EE
EED
E
ED
D
Note that when there are duplicated characters in the input, we can get duplicate substrings in the output; in this case, E appears twice.

With a string of length 6, again with some characters repeated (E and D):
$ python -O all_substrings.py FEEDED
F
FE
FEE
FEED
FEEDE
FEEDED
E
EE
EED
EEDE
EEDED
E
ED
EDE
EDED
D
DE
DED
E
ED
D
Again, we get duplicate substrings in the output.

With a 6-character string, no duplicate characters:
$ python -O all_substrings.py 123456
1
12
123
1234
12345
123456
2
23
234
2345
23456
3
34
345
3456
4
45
456
5
56
6
Is there any other way of doing it?
Any interesting enhancements possible?

Yes to both questions.
I will cover some of those points in a follow-up post.

Actually, I already did one thing in the current version, which is of interest: I used a generator to yield the substrings lazily, instead of creating them all upfront, and then returning them all in a list. I'll show and discuss a few pros and cons of some other approaches later.

Meanwhile, want to have a bit of fun with visual effects?

Try some variations of runs of the program like these:


python -O all_substrings.py +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-

python -O all_substrings.py /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

python -O all_substrings.py "%% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $"

$ python -O all_substrings.py 10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010

python -O all_substrings.py ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>"

You can change the characters used in the string argument to any combination of any punctuation characters, or even letters or digits - anything you like. You can also vary the number of characters used in the string. Longer ones (though not too long) tend to give better visual effects and the display also lasts for longer. Note that all the characters of the string you use, should come on the same single line, the same line as the python command you use. Also, if using a pipe character (|) (or any other characters that are special to your OS shell), enclose the whole string in quotes as I have done in an example above. I ran this on Windows and so used double quotes for such cases. Single quotes give errors. On Unix-like systems, either may work, but some characters may get interpreted inside double quotes. Experiment :)

You can also add an import time statement in the imports section of the program, and then use a time.sleep(number) inside the for loop, say, just above the print(substring) statement. I used values like:
time.sleep(0.002)
which works well for my display. You can tweak that number for your hardware.

- Have fun.

Did you know that there are a large number of meanings and contexts for the word string? Here are some of them:

String (Wikipedia).

This Wikipedia article about strings in computer science is interesting, and has a lot more points than one might imagine at first:

(computer) strings


- Vasudev Ram - Online Python training and consulting

Hit the ground running with my vi quickstart tutorial, vetted by two Windows system administrator friends.

Jump to posts: Python * DLang * xtopdf

Interested in a Python, SQL or Linux course?

Get WP Engine, powerful managed WordPress hosting.

Subscribe to my blog (jugad2.blogspot.com) by email

My ActiveState Code recipes


Follow me on:

Gumroad * LinkedIn * Twitter

Do you create online products? Get Convertkit:

Email marketing for digital product creators


4 comments:

Andy Gimblett said...

Just use itertools:

list((x[i:j] for i, j in itertools.combinations(range(len(x) + 1), 2)))

returns all substrings of the string x

:-)

Vasudev Ram said...


Just saw the comment now, some Google issue maybe.

Clever. But that list()-based approach may crash for a very large string, where the list of all its substrings cannot fit into memory together, while the generator-based approach may work for the same sized string, because it yields the substrings one at a time.

Andy Gimblett said...

It's a generator expression wrapped in a list() call. If you want a bare iterator, just remove the list() call:

(x[i:j] for i, j in itertools.combinations(range(len(x) + 1), 2))

Vasudev Ram said...

Oh, yes, missed that on first look, though I knew about genexps. Good one then.

Need to up my FP-fu, thanks.