Saturday, March 26, 2016

A basic file compare utility in Python

By Vasudev Ram


Weighing dishes from the island of Thera, Minoan civilization, 2000–1500 BC

Image attribution: Norbert Nagel

Recently, I had written a simple version of the Unix split command in Python. In order to check whether a split command works correctly, you have to join the files it creates - the split files - back into a single file, and then compare that new file to the original input file. So you need a file comparison utility. While both Unix and Windows have such utilities (cmp and fc.exe respectively), I thought of writing a simple one in Python. I did that, and then tested it with a few pairs of input files.

Here is the code, in file_compare.py:
# file_compare.py
# A simple file comparison utility.
# Author: Vasudev Ram
# Copyright 2016 Vasudev Ram

import sys
import os
from os.path import exists, getsize

def out_write(msg):
    sys.stdout.write(msg)

def err_write(msg):
    sys.stderr.write(msg)

def usage():
    err_write("Usage: {} file_a file_b\n".format(sys.argv[0]))

def file_object_compare(in_fil_a, in_fil_b):
    '''Logic: Assume files are equal to start with.
    Read both files, character by character.
    Compare characters at corresponding byte offsets. 
    If any pair at the same offset don't match, the files 
    are unequal. If we reach the end of the files, and 
    there was no mismatch, the files are equal.  We do not
    check for one file being a strict subset of the other, 
    because we only enter this function if the files are 
    of the same size.'''

    files_are_equal = True
    pos = 0
    while True:
        ca = in_fil_a.read(1)
        if ca == '':
            break
        cb = in_fil_b.read(1)
        if cb == '':
            break
        if ca != cb:
            files_are_equal = False
            break
        pos += 1
        if pos % 10000 == 0:
            print pos, 

    if files_are_equal:
        return (True, None)
    else:
        return (False, "files differ at byte offset {}".format(pos))

def file_compare(in_filename_a, in_filename_b):
    '''Compare the files in_filename_a and in_filename_b.
    If their contents are the same, return (True, None).
    else return (False, "[reason]"), where [reason] 
    is the reason why they are different, as a string.
    Reasons could be: file sizes differ or file contents differ.'''

    if getsize(in_filename_a) != getsize(in_filename_b):
        return (False, "file sizes differ")
    else:
        in_fil_a = open(in_filename_a, "rb")
        in_fil_b = open(in_filename_b, "rb")
        result = file_object_compare(in_fil_a, in_fil_b)
        in_fil_a.close()
        in_fil_b.close()
        return result
        
def main():
    if len(sys.argv) != 3:
        usage()
        sys.exit(1)

    try:
        # Get the input filenames.
        in_filename_a, in_filename_b = sys.argv[1:3]
        # Check they exist.
        for in_filename in (in_filename_a, in_filename_b):
            if not exists(in_filename):
                err_write(
                    "Error: Input file '{}' not found.\n".format(in_filename))
                sys.exit(1)
        # Don't allow comparing a file with itself.
        if in_filename_a == in_filename_b:
            out_write("No sense comparing {} against itself.".format(in_filename_a))
            sys.exit(0)
        # Compare the files.
        result = file_compare(in_filename_a, in_filename_b)
        if result[0]:
            out_write("Files compare equal.")
        else:
            out_write("Files compare unequal: {}".format(result[1]))
        sys.exit(0)
    except IOError as ioe:
        sys.stderr.write("Caught IOError: {}\n".format(str(ioe)))
    except Exception as e:
        sys.stderr.write("Caught Exception: {}\n".format(str(e)))

if __name__ == '__main__':
    main()
And here are a few input files I ran it with (containing differences at progressive character positions), a few runs of the program, and the output of those runs:
$ type f0.txt
file 1

$ type f1.txt
file 1

$ type f2.txt
file 2

$ type f3.txt
file 3

$ type f4.txt
mile 1

$ type f5.txt
fale 1

$ type f6.txt
fire 1

$ python file_compare.py
Usage: file_compare.py file_a file_b

$ python file_compare.py a b
Error: Input file 'a' not found.

$ python file_compare.py f0.txt f1.txt
Files compare equal.

$ python file_compare.py f0.txt f2.txt
Files compare unequal: files differ at byte offset 5

$ python file_compare.py f1.txt f2.txt
Files compare unequal: files differ at byte offset 5

$ python file_compare.py f2.txt f2.txt
No sense comparing f2.txt against itself.

$ python file_compare.py f1.txt f3.txt
Files compare unequal: files differ at byte offset 5

$ python file_compare.py f1.txt f4.txt
Files compare unequal: files differ at byte offset 0

$ python file_compare.py f1.txt f5.txt
Files compare unequal: files differ at byte offset 1

$ python file_compare.py f1.txt f6.txt
Files compare unequal: files differ at byte offset 2

$ python file_compare.py f1.txt f7.txt
Error: Input file 'f7.txt' not found.

$ python file_compare.py f64MB f64MB2
Files compare equal.
Most of the files tested were small, but I also tested with some files of tens of lines, and the last pair of files tested was 64 MB each.

Note:

These two lines:
if pos % 10000 == 0:
            print pos, 
are present in order to display a progress counter, for comparisons on large files. You can delete them if you don't want to monitor the progress of the comparison.

Currently I am using read(1), which means Python is reading the file character by character. There are potentially many levels of buffering that happen anyway, such as at the level of the C stdio library that underlies CPython's I/O, the OS, the hard disk controller, and even the CPU. But it may be possible to improve the performance of this program, by specifying some buffer when opening the files.

See:

Python open function
and
Default buffer size for a file

It's possible to write a much shorter version of this program, subject to certain limitations :-) Can you guess how? If you have an idea, mention it in the comments.

The image at the top is of weighing dishes from the island of Thera, Minoan civilization, 2000–1500 BC.

- Enjoy.

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python  Posts about xtopdf

My ActiveState recipes

4 comments:

newbery said...

How about `filecmp` in the std library?

filecmp.cmp(path1, path2, shallow=False)

Vasudev Ram said...

@newbery: Thanks for your comment. Good point.

The purpose of the post, though, was to show how basic file comparison could be done in code, without using a library that already does it for you. And to use a simple method, not necessarily the best algorithm (at least to start with). (The filecmp module does stuff like caching and doing a stat on the files being compared, if the shallow flag is 1 or True.) Also for learning for myself (and any readers who may not know it); I often end up learning something new (some nuance or the other, sometimes more fundamental things) even when writing small programs, which is, in a way, a good thing :) And I often learn things from the comments too ...

newbery said...

You're welcome :)

For what it's worth, the implementation in filecmp is probably worth studying for instructive purposes.

The caching may seem like an unnecessary feature for most usecases. However, imagine this being being called repeatedly on a single file, watching for changes. The caching in the non-shallow case would make this much faster.

In the non-shallow case, the file stat is serving double duty. Returning some keys needed for the cache and the file size for the quick compare test (which is what `getsize` does under the hood anyway).

The `with` context manager construction for opening files is also generally considered better practice (for very good reasons) than explicitly opening and closing file handles.

Perhaps a more significant quibble is that it seems we would need to monkeypatch filecmp if we wanted to change the file read buffer size. Maybe would have been marginally nicer if it offered an optional `buffer` keyword argument.

Vasudev Ram said...

>filecmp is probably worth studying for instructive purposes.

Yes, likely, and I'll look at it, thanks.

>The caching may seem like an unnecessary feature for most usecases. However, imagine this being being called repeatedly on a single file, watching for changes. The caching in the non-shallow case would make this much faster.

True. And the speed difference would be higher, the larger the size of the files being compared is.

>In the non-shallow case, the file stat is serving double duty. Returning some keys needed for the cache and the file size for the quick compare test (which is what `getsize` does under the hood anyway).

True. I was aware that getsize calls stat to get the info from the inode (Unix) or equivalent Windows directory entry structure. Just meant that the stat info was also put in the cache. But it is for a good reason.

>The `with` context manager construction for opening files is also generally considered better practice (for very good reasons) than explicitly opening and closing file handles.

I do know what the `with` context manager does, and have used it before, for files, as well in my xtopdf toolkit: http://jugad2.blogspot.in/2013/12/xtopdf-pdfwriter-now-has-context.html .

But I have a doubt about its benefit in some cases. Will write some test code to check things out, and then comment here again if that doubt persists.

>Perhaps a more significant quibble is that it seems we would need to monkeypatch filecmp if we wanted to change the file read buffer size. Maybe would have been marginally nicer if it offered an optional `buffer` keyword argument.

True. As a personal preference, I'm not much in favor of monkey patching, though, since it leads to code with surprises, and harder to understand, which is kind of against the Zen of Python. The Ruby people do a lot of that, I know. I used to do Ruby earlier.

Thanks again.