Showing posts with label text-file-indexing. Show all posts
Showing posts with label text-file-indexing. Show all posts

Friday, February 13, 2015

Splitting a string on multiple different delimiters

By Vasudev Ram

Just recently I was working on some ideas related to my text file indexing program - which I had blogged about earlier, here:

A simple text file indexing program in Python

As part of that work, I was using Python's string split() method for something. Found that it had a limitation that the separator can be only one string (though it can comprise more than one character).
Trying to work out a solution for that (i.e. the ability to split a string on any one of a set of separator / delimiter characters), I gave these commands interactively in the Python interpreter:

>>> print "".split.__doc__
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.

>>> s = "abc.def;ghi"
>>> s.split(".;")
['abc.def;ghi']
>>> s.split(".")
['abc', 'def;ghi']
>>> '---'.join(s.split("."))
'abc---def;ghi'
>>> '---'.join(s.split(".")).split(";")
['abc---def', 'ghi']
>>> "---".join('---'.join(s.split(".")).split(";"))
'abc---def---ghi'
>>> "---".join('---'.join(s.split(".")).split(";")).split('---')
['abc', 'def', 'ghi']
>>>

So you can see that by doing repeated manual split()'s and join()'s, I was able to split the original string the way I wanted, i.e. on both the period and semicolon as delimiters. I'll work out a function or class to do it and then blog it in a sequel to this post.

(Using regular expressions to match the delimiters, and extracting all but the matched parts, may be one way to do it, but I'll try another approach. There probably are many ways to go about it).

- Vasudev Ram - Dancing Bison Enterprises Signup to hear about new products or services from me. Contact Page


Sunday, February 9, 2014

A simple text file indexing program in Python

By Vasudev Ram



Recently, something that I was working on made me think of creating a program to index text files, that is, to create an index file for a text file, something like the index of a book (*), in which, for words in the book, there is a list of page numbers where that word occurs. The difference here is that this program will create, for each word, a list of line numbers where the word occurs in the text file being processed.

(*) To be more specific, what I created was something like a back-of-the-book index, but for text files. I mention that because there are many types of index (Wikipedia), and not just for books. In fact, I was surprised to see the number of meanings or uses of the word index :-) Check the Wikipedia link in the previous sentence to see them. One type of index familiar to programmers, of course, is an array index (or list index, for Python).

Here is the program, called text_file_indexer.py, with a sample input, run and output shown below it. Comments in the code explain the key parts of the logic. Some improvements to the program are possible, of course. I may work on some of them over time. You can already customize the delimiter characters string that is used to remove those characters from around words.

"""
text_file_indexer.py
A program to index a text file.
Author: Vasudev Ram - www.dancingbison.com
Copyright 2014 Vasudev Ram
Given a text file somefile.txt, the program will read it completely, 
and while doing so, record the occurrences of each unique word, 
and the line numbers on which they occur. This information is 
then written to an index file somefile.idx, which is also a text 
file.
"""

import sys
import os
import string
from debug1 import debug1

def index_text_file(txt_filename, idx_filename, 
    delimiter_chars=",.;:!?"):
    """
    Function to read txt_file name and create an index of the 
    occurrences of words in it. The index is written to idx_filename.
    There is one index entry per line in the index file. An index entry 
    is of the form: word line_num line_num line_num ...
    where "word" is a word occurring in the text file, and the instances 
    of "line_num" are the line numbers on which that word occurs in the 
    text file. The lines in the index file are sorted by the leading word 
    on the line. The line numbers in an index entry are sorted in 
    ascending order. The argument delimiter_chars is a string of one or 
    more characters that may adjoin words and the input and are not 
    wanted to be considered as part of the word. The function will remove 
    those delimiter characters from the edges of the words before the rest 
    of the processing.
    """
    try:
        txt_fil = open(txt_filename, "r")
        """
        Dictionary to hold words and the line numbers on which 
        they occur. Each key in the dictionary is a word and the 
        value corresponding to that key is a list of line numbers 
        on which that word occurs in txt_filename.
        """

        word_occurrences = {}
        line_num = 0

        for lin in txt_fil:
            line_num += 1
            debug1("line_num", line_num)
            # Split the line into words delimited by whitespace.
            words = lin.split()
            debug1("words", words)
            # Remove unwanted delimiter characters adjoining words.
            words2 = [ word.strip(delimiter_chars) for word in words ]
            debug1("words2", words2)
            # Find and save the occurrences of each word in the line.
            for word in words2:
                if word_occurrences.has_key(word):
                    word_occurrences[word].append(line_num)
                else:
                    word_occurrences[word] = [ line_num ]

        debug1("Processed {} lines".format(line_num))

        if line_num < 1:
            print "No lines found in text file, no index file created."
            txt_fil.close()
            sys.exit(0)

        # Display results.
        word_keys = word_occurrences.keys()
        print "{} unique words found.".format(len(word_keys))
        debug1("Word_occurrences", word_occurrences)
        word_keys = word_occurrences.keys()
        debug1("word_keys", word_keys)

        # Sort the words in the word_keys list.
        word_keys.sort()
        debug1("after sort, word_keys", word_keys)

        # Create the index file.
        idx_fil = open(idx_filename, "w")

        # Write the words and their line numbers to the index file.
        # Since we read the text file sequentially, there is no need 
        # to sort the line numbers associated with each word; they are 
        # already in sorted order.
        for word in word_keys:
            line_nums = word_occurrences[word]
            idx_fil.write(word + " ")
            for line_num in line_nums:
                idx_fil.write(str(line_num) + " ")
            idx_fil.write("\n")

        txt_fil.close()
        idx_fil.close()
    except IOError as ioe:
        sys.stderr.write("Caught IOError: " + repr(ioe) + "\n")
        sys.exit(1)
    except Exception as e:
        sys.stderr.write("Caught Exception: " + repr(e) + "\n")
        sys.exit(1)

def usage(sys_argv):
    sys.stderr.write("Usage: {} text_file.txt index_file.txt\n".format(
        sys_argv[0]))

def main():
    if len(sys.argv) != 3:
        usage(sys.argv)
        sys.exit(1)
    index_text_file(sys.argv[1], sys.argv[2])

if __name__ == "__main__":
    main()

# EOF
Here is a sample input text file, file01.txt, that I tested the program with:
This file is a test of the text_file_indexer.py program.
The program indexes a text file.
The output of the program is another file called an index file.
The index file is like the index of a book.
For each word that occurs in the text file, there will be a line 
in the index file, starting with that word, and followed by all 
the line numbers in the text file on which that word occurs.
I ran the text file indexer program with the command:
python text_file_indexer.py file01.txt file01.idx
And here is the output of running the program on that text file, that is, the contents of the file file01.idx:
For 5 
The 2 3 4 
This 1 
a 1 2 4 5 
all 6 
an 3 
and 6 
another 3 
be 5 
book 4 
by 6 
called 3 
each 5 
file 1 2 3 3 4 5 6 7 
followed 6 
in 5 6 7 
index 3 4 4 6 
indexes 2 
is 1 3 4 
like 4 
line 5 7 
numbers 7 
occurs 5 7 
of 1 3 4 
on 7 
output 3 
program 1 2 3 
starting 6 
test 1 
text 2 5 7 
text_file_indexer.py 1 
that 5 6 7 
the 1 3 4 5 6 7 7 
there 5 
which 7 
will 5 
with 6 
word 5 6 7 
- Vasudev Ram - Python training and consulting

O'Reilly 50% Ebook Deal of the Day