Thursday, April 7, 2016

bsplit - binary file split utility in Python

By Vasudev Ram

Some days ago I had written a post about a Unix-like file split utility that I wrote in Python:

Unix split command in Python

I mentioned that unlike the Unix split, I had written mine to only work on text files, because it might be preferable to do it that way (the "do one thing well" idea). I had also said I could write the binary file split as a separate tool. Here it is - bsplit.py:

import sys

import os OUTFIL_PREFIX = "out_" def error_exit(message, code=1): sys.stderr.write("Error:\n{}".format(str(message))) sys.exit(code) def err_write(message): sys.stderr.write(message) def make_out_filename(prefix, idx): '''Make a filename with a serial number suffix.''' return prefix + str(idx).zfill(4) def bsplit(in_filename, bytes_per_file): '''Split the input file in_filename into output files of bytes_per_file bytes each. Last file may have less bytes.''' in_fil = open(in_filename, "rb") outfil_idx = 1 out_filename = make_out_filename(OUTFIL_PREFIX, outfil_idx) out_fil = open(out_filename, "wb") byte_count = tot_byte_count = file_count = 0 c = in_fil.read(1) # Loop over the input and split it into multiple files # of bytes_per_file bytes each (except possibly for the # last file, which may have less bytes. while c != '': byte_count += 1 out_fil.write(c) # Bump vars; change to next output file. if byte_count >= bytes_per_file: tot_byte_count += byte_count byte_count = 0 file_count += 1 out_fil.close() outfil_idx += 1 out_filename = make_out_filename(OUTFIL_PREFIX, outfil_idx) out_fil = open(out_filename, "wb") c = in_fil.read(1) # Clean up. in_fil.close() if not out_fil.closed: out_fil.close() if byte_count == 0: os.remove(out_filename) def usage(): err_write( "Usage: [ python ] {} in_filename bytes_per_file\n".format( sys.argv[0])) err_write( "splits in_filename into files with bytes_per_file bytes\n".format( sys.argv[0])) def main(): if len(sys.argv) != 3: usage() sys.exit(1) try: # Do some checks on arguments. in_filename = sys.argv[1] if not os.path.exists(in_filename): error_exit( "Input file '{}' not found.\n".format(in_filename)) if os.path.getsize(in_filename) == 0: error_exit( "Input file '{}' has no data.\n".format(in_filename)) bytes_per_file = int(sys.argv[2]) if bytes_per_file <= 0: error_exit( "bytes_per_file cannot be less than or equal to 0.\n") # If all checks pass, split the file. bsplit(in_filename, bytes_per_file) except ValueError as ve: error_exit(str(ve)) except IOError as ioe: error_exit(str(ioe)) except Exception as e: error_exit(str(e)) if __name__ == '__main__': main()

The program takes two command line arguments: - the name of an input file to split - the number of bytes per file, into which to split the input file

I tested bsplit with various combinations of test input files and bytes_per_file values. It worked as expected. But if you find any issues, I'd be interested to know - please leave a comment.

Some other recent posts related to the split / bsplit utilities:

A basic file compare utility in Python

Python one-liner to compare two files (conditions apply)

- Enjoy.

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python  Posts about xtopdf

My ActiveState recipes

7 comments:

Vasudev Ram said...

BTW, if you saw the line formatting changing in the first few minutes since it was published, and/or saw lines a bit messed up - such as lines that should be separate, joined together, I am sorry for the inconvenience. It was not due to me. There seems to be some issue with Blogger. (Has also happened to me a few times in the past.) It is sometimes joining lines together. I tried adding br or p tags (which should not be needed, really, just hitting Enter should work), but surprisingly it sometimes even removes them. Irritating. Not sure whether Google is doing some maintenance on the software, or what. If so, they should first test in a separate environment from production.

Vasudev Ram said...

Forgot to mention earlier: Image at top of post is by Yours Truly.

Anonymous said...

Why do you read and write one byte at a time?

Bystroushaak said...

You should really learn to use argparse.

Vasudev Ram said...

Thanks for your comment.

A few thoughts:

1. You shouldn't assume that just because someone doesn't use a particular library in a particular blog post, that they don't know it. I know about argparse, optparse, getopt, click, etc., all different ways of parsing arguments, each of which has pros and cons, some are deprecated, some third-party, etc.

2. This post was meant to show the basics of a command-line utility, with manual argument parsing, simple though it is.

Also, no single blog post can (or needs to) talk about all possible options and ways of improving the program. It would become a huge essay if so. Readers are expected to exercise some judgement when reading posts and deciding whether to use the code as is, or research whether they need to make any changes for their needs. Of course I know that some people are going to blindly copy and paste code anyway - that has been going on for ages, and it will be an issue for them.

See recent NPM left-pad issue, etc.:

NPM and Left-Pad: Have We Forgotten How to Program?

Left-pad as a service (left-pad.io)

The more people use libraries blindly without knowing how they work (and I've seen a lot of that happen in real projects I've worked on), the more such fiascos are going to happen.

So, Caveat lector, etc.

See this article by me about Developing a Linux command-line utility, which was on the IBM developerWorks site for 10 years. In it I mention that there are libraries for argument parsing (in C, in that case), but showed a manual method myself, there too.

Vasudev Ram said...

@Anonymous:

>Why do you read and write one byte at a time?

I started with the simplest thing that could work, which was to read and write a byte at a time. Also, don't forget that there are many levels of buffering that occur - some invisible to the programmer - e.g. at the Python stdlib level, the OS, the hard disk controller, the hard disk drive itself, and so on. So just adding a larger buffer at the Python program level is not necessarily going to give a speedup (though it may). The change would need to be measured, on many data sets. (First rule of performance tuning: measure before you optimize.) There are other issues to consider, and most of all, timing of the alternate versions has to be done before one can claim that a particular version (say, with 4K or 16K buffering), is faster. And that would be a bigger project and was not the goal of this post anyway.

Did briefly consider the option of reading a larger buffer at a time, but decided not to for this first version at least, since the idea was more to get something simple working first. As this programming proverb says:

"First make it work; then make it right; then make it fast."

Also see the related comments on my other recent posts, such as about the text file version of the split utility, and the file compare utility, linked to in the post above.

Vasudev Ram said...

Interestingly, no reader has mentioned these issues in the code:

1: This line:

err_write(
"splits in_filename into files with bytes_per_file bytes\n".format(
sys.argv[0]))

is wrong because there is no placeholder {} for the format arg, sys.argv[0], to be interpolated into.

2. I assign to, but do not use, the variables tot_byte_count and file_count, in this program. And IIRC (without looking), in the previous (text file) split program, I did not use byte_count either (I do use it in this program). Having but not using those two variables was an oversight; I initially intended to print out their values at the end of the program, but then changed my mind and decided to more closely emulate (in this bit) the Unix style of only providing the minimum output messages (sometimes none). But I forgot to delete the code that sets those unused variables. Probably using pylint would have caught this mistake.