Sunday, September 23, 2018

How many ways can you substring a string? Part 2


By Vasudev Ram


Twine image attribution

Hi readers,

In my last post, i.e.:

How many ways can you substring a string? Part 1, I had said that there can be other ways of doing it, and that some enhancements were possible. This post (Part 2) is about that.

Here is another algorithm to find all substrings of a given string:

Let s be the input string.
Let n be the length of s.
Find and yield all substrings of s of length 1.
Find and yield all substrings of s of length 2.
...
Find and yield all substrings of s of length n.

Even without doing any formal analysis of the algorithm, we can intuitively see that it is correct, because it accounts for all possible cases (except for the empty string, but adding that is trivial).

[ BTW, what about uses for this sort of program? Although I just wrote it for fun, one possible use could be in word games like Scrabble. ]

The code for this new algorithm is in program all_substrings2.py below.
"""
all_substrings2.py
Function and program to find all substrings of a given string.
Author: Vasudev Ram
Copyright 2018 Vasudev Ram
Web site: https://vasudevram.github.io
Blog: https://jugad2.blogspot.com
Twitter: https://mobile.twitter.com/vasudevram
Product store: https://gumroad.com/vasudevram
"""

from __future__ import print_function
import sys
from error_exit import error_exit
from debug1 import debug1

def usage():
    message_lines = [\
        "Usage: python {} a_string".format(sa[0]),
        "Print all substrings of a_string.",
        "",
    ]
    sys.stderr.write("\n".join(message_lines))

def all_substrings2(s):
    """
    Generator function that yields all the substrings 
    of a given string s.
    Algorithm used:
    1. len_s = len(s)
    2. if len_s == 0, return ""
    3. (else len_s is > 0):
       for substr_len in 1 to len_s:
           find all substrings of s that are of length substr_len
           yield each such substring 
    Expected output for some strings:
    For "a":
        "a"
    For "ab":
        "a"
        "b"
        "ab"
    For "abc:
        "a"
        "b"
        "c"
        "ab"
        "bc"
        "abc"
    For "abcd:
        "a"
        "b"
        "c"
        "d"
        "ab"
        "bc"
        "cd"
        "abc"
        "bcd"
        "abcd"
    """

    len_s = len(s)
    substr_len = 1
    while substr_len <= len_s:
        start = 0
        end = start + substr_len
        while end <= len_s:
            debug1("s[{}:{}] = {}".format(\
                start, end, s[start:end]))
            yield s[start:end]
            start += 1
            end = start + substr_len
        substr_len += 1

def main():
    if lsa != 2:
        usage()
        error_exit("\nError: Exactly one argument must be given.\n")

    if sa[1] == "":
        print("")
        sys.exit(0)

    for substring in all_substrings2(sa[1]):
        print(substring)

sa = sys.argv
lsa = len(sa)

if __name__ == "__main__":
    main()
BTW, I added the empty string as the last item in the message_lines list (in the usage() function), as a small trick, to avoid having to explicitly add an extra newline after the joined string in the write() call.

Here are some runs of the program, with outputs, using Python 2.7 on Linux:

(pyo, in the commands below, is a shell alias I created for 'python -O', to disable debugging output. And a*2*y expands to all_substrings2.py, since there are no other filenames matching that wildcard pattern in my current directory. It's a common Unix shortcut to save typing. In fact, the bash shell expands that shortcut to the full filename when you type the pattern and then press Tab. But the expansion happens without pressing Tab too, if you just type that command and hit Enter. But you have to know for sure, up front, that the wildcard expands to only one filename (if you want that), or you can get wrong results, e.g. if such a wildcard expands to 3 filenames, and your program expects command-line arguments, the 2nd and 3rd filenames will be treated as command-line arguments for the program represented by the 1st filename. This will likely not be what you want, and may create problems.)

Run it without any arguments:
$ pyo a*2*y
Usage: python all_substrings2.py a_string
Print all substrings of a_string.

Error: Exactly one argument must be given.
Run a few times with some input strings of incrementally longer lengths:
$ pyo a*2*y a
a
$ pyo a*2*y ab
a
b
ab
$ pyo a*2*y abc
a
b
c
ab
bc
abc
$ pyo a*2*y abcd
a
b
c
d
ab
bc
cd
abc
bcd
abcd
Count the number of substrings in the above run for string abcd:
$ pyo a*2*y abcd | wc -l
10
$ pyo a*2*y abcde
a
b
c
d
e
ab
bc
cd
de
abc
bcd
cde
abcd
bcde
abcde
Count the number of substrings in the above run for string abcde:
$ pyo a*2*y abcde | wc -l
15
$ pyo a*2*y abcdef
a
b
c
d
e
f
ab
bc
cd
de
ef
abc
bcd
cde
def
abcd
bcde
cdef
abcde
bcdef
abcdef
Count the number of substrings in the above run for string abcdef:
$ pyo a*2*y abcdef | wc
     21      21      77
Now a few more with only the count:
$ pyo a*2*y abcdefg | wc
     28      28     112
$ pyo a*2*y abcdefgh | wc
     36      36     156
$ pyo a*2*y abcdefghi | wc
     45      45     210
Notice a pattern?

The count of substrings for each succeeding run (which has one more character in the input string than the preceding run has), is equal to the sum of the count for the preceding run and the length of the input string for the succeeding run; e.g. 10 + 5 = 15, 15 + 6 = 21, 21 + 7 = 28, etc. This is the same as the sum of the first n natural numbers.

There is a well-known formula for that sum: n * (n + 1) / 2.

There is a story (maybe apocryphal) that the famous mathematician Gauss was posed this problem - to find the sum of the numbers from 1 to 100 - by his teacher, after he misbehaved in class. To the surprise of the teacher, he gave the answer in seconds. From the Wikipedia article about Gauss:

[ Gauss's presumed method was to realize that pairwise addition of terms from opposite ends of the list yielded identical intermediate sums: 1 + 100 = 101, 2 + 99 = 101, 3 + 98 = 101, and so on, for a total sum of 50 × 101 = 5050. ]

From this we can see that the sum of this sequence satisfies the formula n * (n + 1) / 2, where n = 100, i.e. 100 * (100 + 1) / 2 = 50 * 101.

(Wikipedia says that Gauss "is ranked among history's most influential mathematicians".)

We can also run the all_substrings2.py program multiple times with different inputs, using a for loop in the shell:
$ for s in a ab abc abcd
> do
>   echo All substrings of $s:
>   pyo al*2*py $s
> done

All substrings of a:
a
All substrings of ab:
a
b
ab
All substrings of abc:
a
b
c
ab
bc
abc
All substrings of abcd:
a
b
c
d
ab
bc
cd
abc
bcd
abcd
Some remarks on the program versions shown (i.e. all_substrings.py and all_substrings2.py, in Parts 1 and 2 respectively):

Both versions use a generator function, to lazily yield each substring on demand. Either version can easily be changed to use a list instead of a generator (and the basic algorithm used will not need to change, in either case.) To do that, we have to delete the yield statement, collect all the generated substrings in a new list, and at the end, return that list to the caller. The caller's code will not need to change, although we will now be iterating over the list returned from the function, not over the values yielded by the generator. Some of the pros and cons of the two approaches (generator vs. list) are:

- the list approach has to create and store all the substrings first, before it can return them. So it uses memory proportional to the sum of the sizes of all the substrings generated, with some overhead due to Python's dynamic nature (but that per-string overhead exists for the generator approach too). (See this post: Exploring sizes of data types in Python.) The list approach will need a bit of overhead for the list too. But the generator approach needs to handle only one substring at a time, before yielding it to the caller, and does not use a list. So it will potentially use much less memory, particularly for larger input strings. The generator approach may even be faster than the list version, since repeated memory (re)allocation for the list (as it expands) has some overhead. But that is speculation on my part as of now. To be sure of it, one would have to do some analysis and/or some speed measurements of relevant test programs.

- the list approach gives you the complete list of substrings (after the function that generates them returns). So, in the caller, if you want to do multiple processing passes over them, you can. But the generator approach gives you each substring immediately as it is generated, you have to process it, and then it is gone. So you can only do one processing pass over the substrings generated. In other words, the generator's output is sequential-access, forward-only, one-item-at-a-time-only, and single-pass-only. (Unless you store all the yielded substrings, but then that becomes the same as the list approach.)

Another enhancement that can be useful is to output only the unique substrings. As I showed in Part 1, if there are any repeated characters in the input string, there can be duplicate substrings in the output. There are two obvious ways of getting only unique substrings:

1) By doing it internal to the program, using a Python dict. All we have to do is add each substring (as a key, with the corresponding value being anything, say None), to a dict, as and when the substring is generated. Then the substrings in the dict are guaranteed to be unique. Then at the end, we just print the substrings from the dict instead of from the list. If we want to print the substrings in the same order they were generated, we can use an OrderedDict.

See: Python 2 OrderedDict
and: Python 3 OrderedDict

(Note: In Python 3.7, OrderedDict may no longer be needed, because dicts are defined as keeping insertion order.)

2) By piping the output of the program (which is all the generated substrings, one per line) to the Unix uniq command, whose purpose is to select only unique items from its input. But for that, we have to sort the list first, since uniq requires sorted input to work properly. We can do that with pipelines like the following:

First, without sort and uniq; there are duplicates:

$ pyo all_substrings2.py aabbb | nl -ba
1 a
2 a
3 b
4 b
5 b
6 aa
7 ab
8 bb
9 bb
10 aab
11 abb
12 bbb
13 aabb
14 abbb
15 aabbb

Then with sort and uniq; now there are no duplicates:

$ pyo all_substrings2.py aabbb | sort | uniq | nl -ba
1 a
2 aa
3 aab
4 aabb
5 aabbb
6 ab
7 abb
8 abbb
9 b
10 bb
11 bbb

The man pages for sort and uniq are here:

sort
uniq

That's it for now. I have a few more points which I may want to add; if I decide to do so, I'll do them in a Part 3 post.

The image at the top of the post is of spools of twine (a kind of string) from Wikipedia.

- Enjoy.


- Vasudev Ram - Online Python training and consulting

I conduct online courses on Python programming, Unix/Linux (commands and shell scripting) and SQL programming and database design, with personal coaching sessions.

Contact me for details of course content, terms and schedule.

DPD: Digital Publishing for Ebooks and Downloads.

Hit the ground running with my vi quickstart tutorial. I wrote it at the request of two Windows system administrator friends who were given additional charge of some Unix systems. They later told me that it helped them to quickly start using vi to edit text files on Unix.

Check out WP Engine, powerful WordPress hosting.

Creating online products for sale? Check out ConvertKit, email marketing for online creators.

Teachable: feature-packed course creation platform, with unlimited video, courses and students.

Track Conversions and Monitor Click Fraud with Improvely.

Posts about: Python * DLang * xtopdf

My ActiveState Code recipes

Follow me on:


No comments: