Showing posts with label testing. Show all posts
Showing posts with label testing. Show all posts

Thursday, April 7, 2016

bsplit - binary file split utility in Python

By Vasudev Ram

Some days ago I had written a post about a Unix-like file split utility that I wrote in Python:

Unix split command in Python

I mentioned that unlike the Unix split, I had written mine to only work on text files, because it might be preferable to do it that way (the "do one thing well" idea). I had also said I could write the binary file split as a separate tool. Here it is - bsplit.py:

import sys

import os OUTFIL_PREFIX = "out_" def error_exit(message, code=1): sys.stderr.write("Error:\n{}".format(str(message))) sys.exit(code) def err_write(message): sys.stderr.write(message) def make_out_filename(prefix, idx): '''Make a filename with a serial number suffix.''' return prefix + str(idx).zfill(4) def bsplit(in_filename, bytes_per_file): '''Split the input file in_filename into output files of bytes_per_file bytes each. Last file may have less bytes.''' in_fil = open(in_filename, "rb") outfil_idx = 1 out_filename = make_out_filename(OUTFIL_PREFIX, outfil_idx) out_fil = open(out_filename, "wb") byte_count = tot_byte_count = file_count = 0 c = in_fil.read(1) # Loop over the input and split it into multiple files # of bytes_per_file bytes each (except possibly for the # last file, which may have less bytes. while c != '': byte_count += 1 out_fil.write(c) # Bump vars; change to next output file. if byte_count >= bytes_per_file: tot_byte_count += byte_count byte_count = 0 file_count += 1 out_fil.close() outfil_idx += 1 out_filename = make_out_filename(OUTFIL_PREFIX, outfil_idx) out_fil = open(out_filename, "wb") c = in_fil.read(1) # Clean up. in_fil.close() if not out_fil.closed: out_fil.close() if byte_count == 0: os.remove(out_filename) def usage(): err_write( "Usage: [ python ] {} in_filename bytes_per_file\n".format( sys.argv[0])) err_write( "splits in_filename into files with bytes_per_file bytes\n".format( sys.argv[0])) def main(): if len(sys.argv) != 3: usage() sys.exit(1) try: # Do some checks on arguments. in_filename = sys.argv[1] if not os.path.exists(in_filename): error_exit( "Input file '{}' not found.\n".format(in_filename)) if os.path.getsize(in_filename) == 0: error_exit( "Input file '{}' has no data.\n".format(in_filename)) bytes_per_file = int(sys.argv[2]) if bytes_per_file <= 0: error_exit( "bytes_per_file cannot be less than or equal to 0.\n") # If all checks pass, split the file. bsplit(in_filename, bytes_per_file) except ValueError as ve: error_exit(str(ve)) except IOError as ioe: error_exit(str(ioe)) except Exception as e: error_exit(str(e)) if __name__ == '__main__': main()

The program takes two command line arguments: - the name of an input file to split - the number of bytes per file, into which to split the input file

I tested bsplit with various combinations of test input files and bytes_per_file values. It worked as expected. But if you find any issues, I'd be interested to know - please leave a comment.

Some other recent posts related to the split / bsplit utilities:

A basic file compare utility in Python

Python one-liner to compare two files (conditions apply)

- Enjoy.

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python  Posts about xtopdf

My ActiveState recipes

Tuesday, March 15, 2016

Unix split command in Python

By Vasudev Ram

Recently, there was an HN thread about the implementation (not just use) of text editors. Someone mentioned that some editors, including vim, have problems opening large files. Various people gave workarounds or solutions, including using vim and other ways.

I commented that you can use the Unix command bfs (for big file scanner), if you have it on your system, to open the file read-only and then move around in it, like you can in an editor.

I also said that the Unix commands split and csplit can be used to split a large file into smaller chunks, edit the chunks as needed, and then combine the chunks back into a single file using the cat commmand.

This made me think of writing, just for fun, a simple version [1] of the split command in Python. So I did that, and then tested it some [2]. Seems to be working okay so far.

[1] I have not implemented the full functionality of the POSIX split command, only a subset, for now. May enhance it with a few command-line options, or more functionality, later, e.g. with the ability to split binary files. I've also not implemented the default size of 1000 lines, or the ability to take input from standard input if no filename is specfied. (Both are easy.)

However, I am not sure whether the binary file splitting feature should be a part of split, or should be a separate command, considering the Unix philosophy of doing one thing and doing it well. Binary file splitting seems like it should be a separate task from text file splitting. Maybe it is a matter of opinion.

[2] I tested split.py with various valid and invalid values for the lines_per_file argument (such as -3, -2, -1, 0, 1, 2, 3, 10, 50, 100) on each of these input files:

in_file_0_lines.txt
in_file_1_line.txt
in_file_2_lines.txt
in_file_3_lines.txt
in_file_10_lines.txt
in_file_100_lines.txt

where the meaning of the filenames should be self-explanatory.

Of course, I also checked after each test run, that the output file(s) contained the right data.

(There may still be some bugs, of course. If you find any, I'd appreciate hearing about it.)

Here is the code for split.py:

import sys
import os

OUTFIL_PREFIX = "out_"

def make_out_filename(prefix, idx):
    '''Make a filename with a serial number suffix.'''
    return prefix + str(idx).zfill(4)

def split(in_filename, lines_per_file):
    '''Split the input file in_filename into output files of 
    lines_per_file lines each. Last file may have less lines.'''
    in_fil = open(in_filename, "r")
    outfil_idx = 1
    out_filename = make_out_filename(OUTFIL_PREFIX, outfil_idx)
    out_fil = open(out_filename, "w")
    # Using chain assignment feature of Python.
    line_count = tot_line_count = file_count = 0
    # Loop over the input and split it into multiple files.
    # A text file is an iterable sequence, from Python 2.2,
    # so the for line below works.
    for lin in in_fil:
        # Bump vars; change to next output file.
        if line_count >= lines_per_file:
            tot_line_count += line_count
            line_count = 0
            file_count += 1
            out_fil.close()
            outfil_idx += 1
            out_filename = make_out_filename(OUTFIL_PREFIX, outfil_idx)
            out_fil = open(out_filename, "w")
        line_count += 1
        out_fil.write(lin)
    in_fil.close()
    out_fil.close()
    sys.stderr.write("Output is in file(s) with prefix {}\n".format(OUTFIL_PREFIX))
        
def usage():
    sys.stderr.write(
    "Usage: {} in_filename lines_per_file\n".format(sys.argv[0]))

def main():

    if len(sys.argv) != 3:
        usage()
        sys.exit(1)

    try:
        # Get and validate in_filename.
        in_filename = sys.argv[1]
        # If input file does not exist, exit.
        if not os.path.exists(in_filename):
            sys.stderr.write("Error: Input file '{}' not found.\n".format(in_filename))
            sys.exit(1)
        # If input is empty, exit.
        if os.path.getsize(in_filename) == 0:
            sys.stderr.write("Error: Input file '{}' has no data.\n".format(in_filename))
            sys.exit(1)
        # Get and validate lines_per_file.
        lines_per_file = int(sys.argv[2])
        if lines_per_file <= 0:
            sys.stderr.write("Error: lines_per_file cannot be less than or equal to 0.\n")
            sys.exit(1)
        # If all checks pass, split the file.
        split(in_filename, lines_per_file) 
    except ValueError as ve:
        sys.stderr.write("Caught ValueError: {}\n".format(repr(ve)))
    except IOError as ioe:
        sys.stderr.write("Caught IOError: {}\n".format(repr(ioe)))
    except Exception as e:
        sys.stderr.write("Caught Exception: {}\n".format(repr(e)))
        raise

if __name__ == '__main__':
    main()
You can run split.py like this:
$ python split.py
Usage: split.py in_filename lines_per_file
which will give you the usage help. And like this to actually split text files, in this case, a 100-line text file into 10 files of 10 lines each:
$ python split.py in_file_100_lines.txt 10
Output is in file(s) with prefix out_
Here are a couple of runs with invalid values for either the input file or the lines_per_file argument:
$ python split.py in_file_100_lines.txt 0
Error: lines_per_file cannot be less than or equal to 0.

$ python split.py not-there.txt 0
Error: Input file 'not-there.txt' not found.
As an aside, thinking about whether to use 0 or 1 as initial value for some of the _count variables in the program, made me remember this topic:

Why programmers count from 0

See the first few hits for some good answers.

And finally, speaking of zero, check out this earlier post by me:

Bhaskaracharya and the man who found zero

- Enjoy.

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python  Posts about xtopdf

My ActiveState recipes

Thursday, May 17, 2012

Glenford Myers, software testing guru, is worth reading even now


This blog post was a result of the frustration I have been feeling due to Android apps on my phone crashing, many times. Also see some of my recent tweets on the same.

Wonder when (if ever) people in general and software developers in particular are going to realize the importance of quality and that Quality Is Free (google the book of that name).

http://en.m.wikipedia.org/wiki/Glenford_Myers

I have used the principles of his book The Art of Software Testing to good advantage in projects I've led and worked on.

The self-test(ing:) quiz at the start of the 1st edition is mind-opening.

The concept of equivalence classes in testing is another good one.

Nowadays there doesn't even seem to be any discussion of software skills improvement topics in the current (popular) literature. Wondering if it is the economy or something else which is responsible for that   ...

- Vasudev Ram
www.dancingbison.com