jugad2 - Vasudev Ram on software innovation: Python-split-command

Saturday, March 26, 2016

A basic file compare utility in Python

Weighing dishes from the island of Thera, Minoan civilization, 2000–1500 BC

Image attribution: Norbert Nagel

Recently, I had written a simple version of the Unix split command in Python. In order to check whether a split command works correctly, you have to join the files it creates - the split files - back into a single file, and then compare that new file to the original input file. So you need a file comparison utility. While both Unix and Windows have such utilities (cmp and fc.exe respectively), I thought of writing a simple one in Python. I did that, and then tested it with a few pairs of input files.

Here is the code, in file_compare.py:

# file_compare.py
# A simple file comparison utility.
# Author: Vasudev Ram
# Copyright 2016 Vasudev Ram

import sys
import os
from os.path import exists, getsize

def out_write(msg):
    sys.stdout.write(msg)

def err_write(msg):
    sys.stderr.write(msg)

def usage():
    err_write("Usage: {} file_a file_b\n".format(sys.argv[0]))

def file_object_compare(in_fil_a, in_fil_b):
    '''Logic: Assume files are equal to start with.
    Read both files, character by character.
    Compare characters at corresponding byte offsets. 
    If any pair at the same offset don't match, the files 
    are unequal. If we reach the end of the files, and 
    there was no mismatch, the files are equal.  We do not
    check for one file being a strict subset of the other, 
    because we only enter this function if the files are 
    of the same size.'''

    files_are_equal = True
    pos = 0
    while True:
        ca = in_fil_a.read(1)
        if ca == '':
            break
        cb = in_fil_b.read(1)
        if cb == '':
            break
        if ca != cb:
            files_are_equal = False
            break
        pos += 1
        if pos % 10000 == 0:
            print pos, 

    if files_are_equal:
        return (True, None)
    else:
        return (False, "files differ at byte offset {}".format(pos))

def file_compare(in_filename_a, in_filename_b):
    '''Compare the files in_filename_a and in_filename_b.
    If their contents are the same, return (True, None).
    else return (False, "[reason]"), where [reason] 
    is the reason why they are different, as a string.
    Reasons could be: file sizes differ or file contents differ.'''

    if getsize(in_filename_a) != getsize(in_filename_b):
        return (False, "file sizes differ")
    else:
        in_fil_a = open(in_filename_a, "rb")
        in_fil_b = open(in_filename_b, "rb")
        result = file_object_compare(in_fil_a, in_fil_b)
        in_fil_a.close()
        in_fil_b.close()
        return result
        
def main():
    if len(sys.argv) != 3:
        usage()
        sys.exit(1)

    try:
        # Get the input filenames.
        in_filename_a, in_filename_b = sys.argv[1:3]
        # Check they exist.
        for in_filename in (in_filename_a, in_filename_b):
            if not exists(in_filename):
                err_write(
                    "Error: Input file '{}' not found.\n".format(in_filename))
                sys.exit(1)
        # Don't allow comparing a file with itself.
        if in_filename_a == in_filename_b:
            out_write("No sense comparing {} against itself.".format(in_filename_a))
            sys.exit(0)
        # Compare the files.
        result = file_compare(in_filename_a, in_filename_b)
        if result[0]:
            out_write("Files compare equal.")
        else:
            out_write("Files compare unequal: {}".format(result[1]))
        sys.exit(0)
    except IOError as ioe:
        sys.stderr.write("Caught IOError: {}\n".format(str(ioe)))
    except Exception as e:
        sys.stderr.write("Caught Exception: {}\n".format(str(e)))

if __name__ == '__main__':
    main()

And here are a few input files I ran it with (containing differences at progressive character positions), a few runs of the program, and the output of those runs:

$ type f0.txt
file 1

$ type f1.txt
file 1

$ type f2.txt
file 2

$ type f3.txt
file 3

$ type f4.txt
mile 1

$ type f5.txt
fale 1

$ type f6.txt
fire 1

$ python file_compare.py
Usage: file_compare.py file_a file_b

$ python file_compare.py a b
Error: Input file 'a' not found.

$ python file_compare.py f0.txt f1.txt
Files compare equal.

$ python file_compare.py f0.txt f2.txt
Files compare unequal: files differ at byte offset 5

$ python file_compare.py f1.txt f2.txt
Files compare unequal: files differ at byte offset 5

$ python file_compare.py f2.txt f2.txt
No sense comparing f2.txt against itself.

$ python file_compare.py f1.txt f3.txt
Files compare unequal: files differ at byte offset 5

$ python file_compare.py f1.txt f4.txt
Files compare unequal: files differ at byte offset 0

$ python file_compare.py f1.txt f5.txt
Files compare unequal: files differ at byte offset 1

$ python file_compare.py f1.txt f6.txt
Files compare unequal: files differ at byte offset 2

$ python file_compare.py f1.txt f7.txt
Error: Input file 'f7.txt' not found.

$ python file_compare.py f64MB f64MB2
Files compare equal.

Most of the files tested were small, but I also tested with some files of tens of lines, and the last pair of files tested was 64 MB each.

Note:

These two lines:

if pos % 10000 == 0:
            print pos,

are present in order to display a progress counter, for comparisons on large files. You can delete them if you don't want to monitor the progress of the comparison.

Currently I am using read(1), which means Python is reading the file character by character. There are potentially many levels of buffering that happen anyway, such as at the level of the C stdio library that underlies CPython's I/O, the OS, the hard disk controller, and even the CPU. But it may be possible to improve the performance of this program, by specifying some buffer when opening the files.

See:

Python open function
and
Default buffer size for a file

It's possible to write a much shorter version of this program, subject to certain limitations :-) Can you guess how? If you have an idea, mention it in the comments.

The image at the top is of weighing dishes from the island of Thera, Minoan civilization, 2000–1500 BC.

- Enjoy.

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python Posts about xtopdf

My ActiveState recipes

Share |

Tuesday, March 15, 2016

Unix split command in Python

By Vasudev Ram

Recently, there was an HN thread about the implementation (not just use) of text editors. Someone mentioned that some editors, including vim, have problems opening large files. Various people gave workarounds or solutions, including using vim and other ways.

I commented that you can use the Unix command bfs (for big file scanner), if you have it on your system, to open the file read-only and then move around in it, like you can in an editor.

I also said that the Unix commands split and csplit can be used to split a large file into smaller chunks, edit the chunks as needed, and then combine the chunks back into a single file using the cat commmand.

This made me think of writing, just for fun, a simple version [1] of the split command in Python. So I did that, and then tested it some [2]. Seems to be working okay so far.

[1] I have not implemented the full functionality of the POSIX split command, only a subset, for now. May enhance it with a few command-line options, or more functionality, later, e.g. with the ability to split binary files. I've also not implemented the default size of 1000 lines, or the ability to take input from standard input if no filename is specfied. (Both are easy.)

However, I am not sure whether the binary file splitting feature should be a part of split, or should be a separate command, considering the Unix philosophy of doing one thing and doing it well. Binary file splitting seems like it should be a separate task from text file splitting. Maybe it is a matter of opinion.

[2] I tested split.py with various valid and invalid values for the lines_per_file argument (such as -3, -2, -1, 0, 1, 2, 3, 10, 50, 100) on each of these input files:

in_file_0_lines.txt
in_file_1_line.txt
in_file_2_lines.txt
in_file_3_lines.txt
in_file_10_lines.txt
in_file_100_lines.txt

where the meaning of the filenames should be self-explanatory.

Of course, I also checked after each test run, that the output file(s) contained the right data.

(There may still be some bugs, of course. If you find any, I'd appreciate hearing about it.)

Here is the code for split.py:

import sys
import os

OUTFIL_PREFIX = "out_"

def make_out_filename(prefix, idx):
    '''Make a filename with a serial number suffix.'''
    return prefix + str(idx).zfill(4)

def split(in_filename, lines_per_file):
    '''Split the input file in_filename into output files of 
    lines_per_file lines each. Last file may have less lines.'''
    in_fil = open(in_filename, "r")
    outfil_idx = 1
    out_filename = make_out_filename(OUTFIL_PREFIX, outfil_idx)
    out_fil = open(out_filename, "w")
    # Using chain assignment feature of Python.
    line_count = tot_line_count = file_count = 0
    # Loop over the input and split it into multiple files.
    # A text file is an iterable sequence, from Python 2.2,
    # so the for line below works.
    for lin in in_fil:
        # Bump vars; change to next output file.
        if line_count >= lines_per_file:
            tot_line_count += line_count
            line_count = 0
            file_count += 1
            out_fil.close()
            outfil_idx += 1
            out_filename = make_out_filename(OUTFIL_PREFIX, outfil_idx)
            out_fil = open(out_filename, "w")
        line_count += 1
        out_fil.write(lin)
    in_fil.close()
    out_fil.close()
    sys.stderr.write("Output is in file(s) with prefix {}\n".format(OUTFIL_PREFIX))
        
def usage():
    sys.stderr.write(
    "Usage: {} in_filename lines_per_file\n".format(sys.argv[0]))

def main():

    if len(sys.argv) != 3:
        usage()
        sys.exit(1)

    try:
        # Get and validate in_filename.
        in_filename = sys.argv[1]
        # If input file does not exist, exit.
        if not os.path.exists(in_filename):
            sys.stderr.write("Error: Input file '{}' not found.\n".format(in_filename))
            sys.exit(1)
        # If input is empty, exit.
        if os.path.getsize(in_filename) == 0:
            sys.stderr.write("Error: Input file '{}' has no data.\n".format(in_filename))
            sys.exit(1)
        # Get and validate lines_per_file.
        lines_per_file = int(sys.argv[2])
        if lines_per_file <= 0:
            sys.stderr.write("Error: lines_per_file cannot be less than or equal to 0.\n")
            sys.exit(1)
        # If all checks pass, split the file.
        split(in_filename, lines_per_file) 
    except ValueError as ve:
        sys.stderr.write("Caught ValueError: {}\n".format(repr(ve)))
    except IOError as ioe:
        sys.stderr.write("Caught IOError: {}\n".format(repr(ioe)))
    except Exception as e:
        sys.stderr.write("Caught Exception: {}\n".format(repr(e)))
        raise

if __name__ == '__main__':
    main()

You can run split.py like this:

$ python split.py
Usage: split.py in_filename lines_per_file

which will give you the usage help. And like this to actually split text files, in this case, a 100-line text file into 10 files of 10 lines each:

$ python split.py in_file_100_lines.txt 10
Output is in file(s) with prefix out_

Here are a couple of runs with invalid values for either the input file or the lines_per_file argument:

$ python split.py in_file_100_lines.txt 0
Error: lines_per_file cannot be less than or equal to 0.

$ python split.py not-there.txt 0
Error: Input file 'not-there.txt' not found.

As an aside, thinking about whether to use 0 or 1 as initial value for some of the _count variables in the program, made me remember this topic:

Why programmers count from 0

See the first few hits for some good answers.

And finally, speaking of zero, check out this earlier post by me:

Bhaskaracharya and the man who found zero

- Enjoy.

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python Posts about xtopdf

My ActiveState recipes

Share |